Site iconJava PDF Blog

Glossary of PDF Terms

How to view pdf metadata using Java (PDF logo)

This article lists and defines commonly used terminology from the PDF world.

 

AcroForm

AcroForm is a PDF form format that was introduced in PDF 1.2. It uses a dictionary (/AcroForm) that is added to the Catalog. Learn more.

Action

Actions are automatic behaviours triggered by a user interaction or event. These are commonly used to navigate to different pages, or play multimedia content.

Adobe Corporation

Adobe created the original PDF file format and key software Distiller and PDF Reader to create and view PDF files. The standard is now Open but Adobe remain a key player.

AES

Advanced Encryption Standard is a cryptographic cipher use to protect information.

Alt text

Alternative (often just alt) text is descriptive text of an image which can be used by accessibility technology.

Annotation

Annotations can be a note, link, or rich media which sits on a page and can be interacted with by the user. Learn more.

Anti-aliasing

Anti-aliasing is a technique to produce smoother edges in rasterized content.

AP

AP stands for appearance and defines how an interactive element (form field or annotation) should look. AP entries typically contain a stream or resource dictionary. Learn more

Approval signature

Approval signatures are digital signatures that can detect changes in a document and confirm the signer of the document.

Arlington model

The Arlington model is a machine-readable model of all PDF objects.

Array object

Array objects are a one dimensional collection of objects arranged in a sequence, implicitly numbered, starting from 0.

Artifact

Artifacts provide information in the document which are not meant to be read by accessibility technology.

ASCII

American Standard Code for Information Interchange is a common convention for encoding a specific set of 128 characters as binary numbers.

AVIF

AV1 image file format is used for storing images and/or videos and is similar to HEIC. Learn more.

Binary data

Binary data is a sequence of bytes which usually requires context to have meaning.

Blending

Blending modes define what happens when two colours are drawn on top of each other. Learn more.

BMP

BMP is a raster graphics image file format. Learn more.

Bookmarks

Bookmarks are the informal name for Outlines.

Boolean object

Boolean objects represent either true or false.

Byte

A byte is eight binary bits.

Catalog

The catalog contains references to other objects defining the document’s contents, outline, article threads, named destinations, and other attributes. Learn more.

CCITT

CCITT is a lossless compression algorithm usually used for black and white images. Learn more.

Certificate

A certificate proves the authenticity of digital content.

Certification signature

A certification signature is very similar to an approval signature with the bonus that it can block certain actions from signature handlers.

Character

A character is a numeric code representing a letter, number, or symbol, defined by an encoding. Common encodings are ASCII and UTF-8.

CID fonts

CID fonts are an extension of TrueType which have extra features such as better support for Chinese characters. Learn more.

CMYK

CMYK is a subtractive color model which uses cyan, magenta, yellow, and key (black). CMYK masks colors on a white background, hence the name subtractive.

Color spaces

A color space is collection of colors that allows for reproducible results on different devices and outputs. Learn more.

Comment

Comments in PDF files are a seldom used feature that explain or annotate the file for people reading the source code. They are written using a % symbol.

Compressed object

Compressed objects were introduced in PDF 1.6 and allow objects to be stored in binary streams, which can then be compressed. Learn more.

Conformance

Conformance refers to whether a PDF abides by the rules of a certain subset of the PDF specification. Common subsets are PDF/A and PDF/X.

Content stream

A content stream contains graphical elements which are painted on a page.

COS

Carousel Object Syntax refers to the syntax used inside PDF files to describe objects.

Cross reference stream

Cross reference streams were introduced in PDF 1.5 which define the cross reference section in a stream, taking up far less space.

Cross reference section

Cross reference sections are located in the trailer of a PDF file and contain a list of objects, and their position in the file. Learn more.

DCT

Discrete Cosine Transform is compression algorithm commonly used in JPEG and WebP.

Deprecated

Anything described as deprecated is not recommended for use as it may not be supported going forward. Deprecated features in PDF are usually ignored by modern PDF processors. For example, XFA has been deprecated and no longer supported in most PDF readers.

Dictionary object

A dictionary object contains key-value pairs of other objects.

Direct object

A direct object is the opposite of an indirect object. Rather than point to another object, it inlines the object data.

Document part

A document part is a set of related pages.

Document part hierarchy

Document part hierarchies organise many document parts.

EOL marker

A whitespace character used to create new lines. Either a carriage return or a line feed (or both) are used at the end of a line. Learn more.

EXIF

EXIF is a metadata format for image files. Learn more.

FDF file

Forms Data Format files store forms and annotation data from PDF forms.

Filter

A filter allows streams to be encoded, usually to save space. Learn more.

Font

A font is an implementation which creates a typeface. Learn more.

Font program

A font program (also known as a font file) is a file which describes how to draw a font.

Form

PDF forms contain fillable fields and other interactive features where users can input personal information. Learn more.

Generation number

A generation number is a positive integer which represents different revisions of the same object. Most of the time the value is zero.

GhostScript

GhostScript is an Open Source interpreter for the PostScript language and PDF files.

GIF

GIF is a lossless image format which supports animated images. Learn more.

Glyph

A glyph is specific visual form of a character, numeral, or abstract symbol.

Graphics state

The graphics state is a stack of graphics control parameters which affect the currently executing graphics operators. Learn more.

HEIC

High Efficiency Image File Format is a lossy image format developed by Apple. It is an open standard but primarily used on Apple devices. Learn more.

Hinting

Font hinting refers to instructions that adjust the display of a font so that it lines up with a rasterized grid. It is essential in low resolution screens to produce readable text. Learn more.

HTML

Hypertext markup language is the language used to create webpages for display in web browsers. It is often accompanied by JavaScript and CSS.

Incremental updates

Incremental updates refers to the fact that PDF files can be updated without modifying the entire file. Changes are appended to the end of the file leaving the original contents unchanged.

Indirect object

An indirect object is labelled with an object identifier and sits between the keywords obj and endobj.

Integer object

An object containing a positive or negative whole number with no fractional part.

ISO 32000

ISO 32000 is the technical specification document which defines the PDF file format. Learn more.

JavaScript

JavaScript is a programming language that is commonly used on websites however it used to be available in PDF files for form validation and interactive elements. It has been deprecated as of PDF 1.7 due to security reasons. Learn more.

JBIG2

JBIG2 is an image compression standard for two color (usually black and white) images. Learn more.

JPEG

Joint Photographic Experts Group is an extremely common lossy image file format. Learn more.

JPEG 2000

JPEG 2000 (also JP2 or JPX) is a file format meant to succeed JPEG which offers better compression and higher quality images. Learn more.

JPEG XL

JPEG XL is the newest version of JPEG and is meant to have better compression and quality than JPEG 2000. Learn more.

Kerning

Kerning refers to the space between individual glyphs. It is used to create more visually appealing text.

Key

1. A key is a unique identifier in a key-pair, used in dictionaries.
2. A key is used to encrypt or decrypt a message.

Linearized PDF

A linearized PDF has been organised in such a way that it allows for more efficient page loading when the document is being streamed. Objects are reordered so what is needed first is at the top of the file. Learn more.

Lossless

Using lossless compression means the data can be identically reproduced when decompressed. Learn more.

Lossy

Using lossy compression means the data is approximately reproduced when decompressed, it is a tradeoff between, compression size, speed, and quality. Lossy compression is most commonly used with audio, video, and images. Learn more.

LZW

Lempel-Ziv-Welch is a lossless compression algorithm, commonly used in GIF images. Learn more.

Metadata

Metadata is data which provides information about other data.

Name object

A name object is a symbol represented by forward slash followed by a sequence of characters.

Name tree

A name tree is similar to a dictionary however all of the keys are strings and are ordered.

Null object

The null object has no value and is represented by the keyword null.

Number tree

A number tree is similar to a dictionary however all of the keys are integers and are ordered.

Numeric object

A numeric object is either an integer object or a real object.

Object

An object is a basic data structure used to represent information in a PDF file. An object can be either an array, boolean,
dictionary, integer, name, null, real, stream and string. They are written using COS syntax. Learn more.

Object number

An object number is an integer greater than zero which is uniquely assigned to each object within a PDF file. They can be in any arbitrary order but there must not be any duplicates.

Object identifier

An object identifier (also know as an object reference) consists of an object number and a generation number followed by either an R or obj.

OCR

Optical character recognition is the process of converting hand-written or printed text into machine readable text.

Operator

PostScript operators are used within streams in PDF files and are instructions on how to render content.

OpenType

OpenType fonts were designed by Microsoft and are derived from TrueType fonts.

Outline

The outline of a PDF document contains the structure of it’s page and categories, similarly to a contents page, and can be used to navigate.

PDF

Portable Document Format is a file format designed to display documents consistently no matter the device. Learn more.

PDF Association

PDF Association is the open trade body which supports and develops the PDF file format. Any interested Company or individual can join and participate.

PDF Processor

A PDF processor is a piece of software that can read or write PDF files while conforming to the PDF specification.

PDF version

Different versions of the PDF specification are available with newer ones being more refined and containing the latests features. Download the latest specification here

PDF/A

PDF/A is a stripped down version of the PDF specification designed for long term document preservation and compatibility on the maximal number of devices.

PDF/E

PDF/E is a format designed for engineering use as it supports embedding 3D models.

PDF/R

PDF/R is a format designed for storing multi-page raster images

PDF/UA

PDF/UA is a format designed to work with accessibility technology.

PDF/VCR

PDF/VCR enables variable content replacement for variable data printing.

PDF/VT

PDF/VT is an extension of PDF/X and support variable data printing.

PDF/X

PDF/X is a format commonly used by graphics designers and print professionals.

PNG

Portable Network Graphics is a lossless image format commonly used on the internet. Learn more.

PostScript

PostScript is a page description language used in electronic documents. PDF is based on a simplified version.

Preflight

Preflight refers to the scanning of a PDF document to ensure it conforms to a number of specified conditions and that it is ready for print production.

Raster

A raster is a matrix of cells which contain color data to represent an image.

Real object

A real object is a floating point number (a number with a fractional part) with limited range and precision.

Rectangle

A rectangle is an array object which describes either locations on a page or bounding boxes. It contains four numbers which represent the bottom left and upper right corners of the rectangle.

Redaction

Redaction involves censoring parts of a document so that it may be published without revealing sensitive information.

Resource dictionary

Associates resource names (such as /Font) with their objects.

Running text

Running text is the main text within the document’s body.

SHA

Secure Hashing Algorithm is a cryptographic has function commonly used to protect passwords.

Signature handler

A signature handler is software that implements the creation of digital signatures.

SRGB

SRGB is a standard red, green, and blue color space which is very commonly used.

Stream object

A steam object contains a dictionary, followed by some binary data.

String

A string is a sequence of characters.

Structured text

Structured text contains additional information about how the text is laid out. Learn more.

Tagged PDF

A Tagged PDF file contains information about how its content is structured. Learn more.

TIFF

Tag image file format is a format that can store one or more images. Learn more.

Trailer

The trailer is a dictionary at the end of a PDF file. It contains things like the largest object reference, the document catalog, and the info metadata object.

TrueType Font

TrueType fonts were designed by Apple and Microsoft as a competitor to Adobe’s Type 1 fonts.

Type 1 Font

PostScript Type 1 fonts are the most commonly used font in PDF files as they produce high quality output and you can extract the text values easily. Learn more.

Type 3 Font

PostScript Type 3 fonts have glyphs defined by the full PostScript language however they do not support hinting and are rarely used in PDF files. Learn more.

Unicode

Unicode refers to a range of character encodings which all map onto the universal character set. Learn more.

Unstructured text

Unstructured text has no model or structure to its layout, it is simply text.

UTF-8

Unicode Transmission Format — 8 is the most commonly used character encoding and is similar to ASCII.

Vector

A vector is a quantity with two dimensions, commonly direction and magnitude.

WebP

WebP is an image format created by Google. Learn more.

Whitespace character

Whitespace characters refer to non-printable characters that still have meaning in text. This could be a space, tab, new line, or something else. It is called whitespace because paper is commonly white.

XFA

XML Forms Architecture was introduced in PDF 1.5, however it was discontinued and deprecated in PDF 2.0.

XFDF

XFDF is very similar to the FDF file format, except data is represented as XML.

XML

Extensible markup language, is a file format for storing arbitrary data, it’s syntax is similar to HTML.

XMP

Extensible metadata platform is an XML based metadata format which stores information about a file. Learn more.

XObject

XObjects are containers for a sequence of graphics objects. Learn more.

Z-Index

Z-Index refers to the order of overlapping elements. Lower numbered elements appear in front of higher numbered elements.

Sources:
ISO 32000-2:2020-12 PDF 2.0 Specification
PDF Association Glossary of PDF terms