This article lists and defines commonly used terminology from the PDF world.
AcroForm
AcroForm is a PDF form format that was introduced in PDF 1.2. It uses a dictionary (/AcroForm) that is added to the Catalog. Learn more.
Action
Actions are automatic behaviours triggered by a user interaction or event. These are commonly used to navigate to different pages, or play multimedia content.
Adobe Corporation
Adobe created the original PDF file format and key software Distiller and PDF Reader to create and view PDF files. The standard is now Open but Adobe remain a key player.
AES
Advanced Encryption Standard is a cryptographic cipher use to protect information.
Alt text
Alternative (often just alt) text is descriptive text of an image which can be used by accessibility technology.
Annotation
Annotations can be a note, link, or rich media which sits on a page and can be interacted with by the user. Learn more.
Anti-aliasing
Anti-aliasing is a technique to produce smoother edges in rasterized content.
AP
AP stands for appearance and defines how an interactive element (form field or annotation) should look. AP entries typically contain a stream or resource dictionary. Learn more
Approval signature
Approval signatures are digital signatures that can detect changes in a document and confirm the signer of the document.
Arlington model
The Arlington model is a machine-readable model of all PDF objects.
Array object
Array objects are a one dimensional collection of objects arranged in a sequence, implicitly numbered, starting from 0.
Artifact
Artifacts provide information in the document which are not meant to be read by accessibility technology.
ASCII
American Standard Code for Information Interchange is a common convention for encoding a specific set of 128 characters as binary numbers.
AVIF
AV1 image file format is used for storing images and/or videos and is similar to HEIC. Learn more.
Binary data
Binary data is a sequence of bytes which usually requires context to have meaning.
Blending
Blending modes define what happens when two colours are drawn on top of each other. Learn more.
BMP
BMP is a raster graphics image file format. Learn more.
Bookmarks
Bookmarks are the informal name for Outlines.
Boolean object
Boolean objects represent either true or false.
Byte
A byte is eight binary bits.
Catalog
The catalog contains references to other objects defining the document’s contents, outline, article threads, named destinations, and other attributes. Learn more.
CCITT
CCITT is a lossless compression algorithm usually used for black and white images. Learn more.
Certificate
A certificate proves the authenticity of digital content.
Certification signature
A certification signature is very similar to an approval signature with the bonus that it can block certain actions from signature handlers.
Character
A character is a numeric code representing a letter, number, or symbol, defined by an encoding. Common encodings are ASCII and UTF-8.
CID fonts
CID fonts are an extension of TrueType which have extra features such as better support for Chinese characters. Learn more.
CMYK
CMYK is a subtractive color model which uses cyan, magenta, yellow, and key (black). CMYK masks colors on a white background, hence the name subtractive.
Color spaces
A color space is collection of colors that allows for reproducible results on different devices and outputs. Learn more.
Comment
Comments in PDF files are a seldom used feature that explain or annotate the file for people reading the source code. They are written using a %
symbol.
Compressed object
Compressed objects were introduced in PDF 1.6 and allow objects to be stored in binary streams, which can then be compressed. Learn more.
Conformance
Conformance refers to whether a PDF abides by the rules of a certain subset of the PDF specification. Common subsets are PDF/A and PDF/X.
Content stream
A content stream contains graphical elements which are painted on a page.
COS
Carousel Object Syntax refers to the syntax used inside PDF files to describe objects.
Cross reference stream
Cross reference streams were introduced in PDF 1.5 which define the cross reference section in a stream, taking up far less space.
Cross reference section
Cross reference sections are located in the trailer of a PDF file and contain a list of objects, and their position in the file. Learn more.
DCT
Discrete Cosine Transform is compression algorithm commonly used in JPEG and WebP.
Deprecated
Anything described as deprecated is not recommended for use as it may not be supported going forward. Deprecated features in PDF are usually ignored by modern PDF processors. For example, XFA has been deprecated and no longer supported in most PDF readers.
Dictionary object
A dictionary object contains key-value pairs of other objects.
Direct object
A direct object is the opposite of an indirect object. Rather than point to another object, it inlines the object data.
Document part
A document part is a set of related pages.
Document part hierarchy
Document part hierarchies organise many document parts.
EOL marker
A whitespace character used to create new lines. Either a carriage return or a line feed (or both) are used at the end of a line. Learn more.
EXIF
EXIF is a metadata format for image files. Learn more.
FDF file
Forms Data Format files store forms and annotation data from PDF forms.
Filter
A filter allows streams to be encoded, usually to save space. Learn more.
Font
A font is an implementation which creates a typeface. Learn more.
Font program
A font program (also known as a font file) is a file which describes how to draw a font.
Form
PDF forms contain fillable fields and other interactive features where users can input personal information. Learn more.
Generation number
A generation number is a positive integer which represents different revisions of the same object. Most of the time the value is zero.
GhostScript
GhostScript is an Open Source interpreter for the PostScript language and PDF files.
GIF
GIF is a lossless image format which supports animated images. Learn more.
Glyph
A glyph is specific visual form of a character, numeral, or abstract symbol.
Graphics state
The graphics state is a stack of graphics control parameters which affect the currently executing graphics operators. Learn more.
HEIC
High Efficiency Image File Format is a lossy image format developed by Apple. It is an open standard but primarily used on Apple devices. Learn more.
Hinting
Font hinting refers to instructions that adjust the display of a font so that it lines up with a rasterized grid. It is essential in low resolution screens to produce readable text. Learn more.
HTML
Hypertext markup language is the language used to create webpages for display in web browsers. It is often accompanied by JavaScript and CSS.
Incremental updates
Incremental updates refers to the fact that PDF files can be updated without modifying the entire file. Changes are appended to the end of the file leaving the original contents unchanged.
Indirect object
An indirect object is labelled with an object identifier and sits between the keywords obj
and endobj
.
Integer object
An object containing a positive or negative whole number with no fractional part.
ISO 32000
ISO 32000 is the technical specification document which defines the PDF file format. Learn more.
JavaScript
JavaScript is a programming language that is commonly used on websites however it used to be available in PDF files for form validation and interactive elements. It has been deprecated as of PDF 1.7 due to security reasons. Learn more.
JBIG2
JBIG2 is an image compression standard for two color (usually black and white) images. Learn more.
JPEG
Joint Photographic Experts Group is an extremely common lossy image file format. Learn more.
JPEG 2000
JPEG 2000 (also JP2 or JPX) is a file format meant to succeed JPEG which offers better compression and higher quality images. Learn more.
JPEG XL
JPEG XL is the newest version of JPEG and is meant to have better compression and quality than JPEG 2000. Learn more.
Kerning
Kerning refers to the space between individual glyphs. It is used to create more visually appealing text.
Key
1. A key is a unique identifier in a key-pair, used in dictionaries.
2. A key is used to encrypt or decrypt a message.
Linearized PDF
A linearized PDF has been organised in such a way that it allows for more efficient page loading when the document is being streamed. Objects are reordered so what is needed first is at the top of the file. Learn more.
Lossless
Using lossless compression means the data can be identically reproduced when decompressed. Learn more.
Lossy
Using lossy compression means the data is approximately reproduced when decompressed, it is a tradeoff between, compression size, speed, and quality. Lossy compression is most commonly used with audio, video, and images. Learn more.
LZW
Lempel-Ziv-Welch is a lossless compression algorithm, commonly used in GIF images. Learn more.
Metadata
Metadata is data which provides information about other data.
Name object
A name object is a symbol represented by forward slash followed by a sequence of characters.
Name tree
A name tree is similar to a dictionary however all of the keys are strings and are ordered.
Null object
The null object has no value and is represented by the keyword null
.
Number tree
A number tree is similar to a dictionary however all of the keys are integers and are ordered.
Numeric object
A numeric object is either an integer object or a real object.
Object
An object is a basic data structure used to represent information in a PDF file. An object can be either an array, boolean,
dictionary, integer, name, null, real, stream and string. They are written using COS syntax. Learn more.
Object number
An object number is an integer greater than zero which is uniquely assigned to each object within a PDF file. They can be in any arbitrary order but there must not be any duplicates.
Object identifier
An object identifier (also know as an object reference) consists of an object number and a generation number followed by either an R
or obj
.
OCR
Optical character recognition is the process of converting hand-written or printed text into machine readable text.
Operator
PostScript operators are used within streams in PDF files and are instructions on how to render content.
OpenType
OpenType fonts were designed by Microsoft and are derived from TrueType fonts.
Outline
The outline of a PDF document contains the structure of it’s page and categories, similarly to a contents page, and can be used to navigate.
Portable Document Format is a file format designed to display documents consistently no matter the device. Learn more.
PDF Association
PDF Association is the open trade body which supports and develops the PDF file format. Any interested Company or individual can join and participate.
PDF Processor
A PDF processor is a piece of software that can read or write PDF files while conforming to the PDF specification.
PDF version
Different versions of the PDF specification are available with newer ones being more refined and containing the latests features. Download the latest specification here
PDF/A
PDF/A is a stripped down version of the PDF specification designed for long term document preservation and compatibility on the maximal number of devices.
PDF/E
PDF/E is a format designed for engineering use as it supports embedding 3D models.
PDF/R
PDF/R is a format designed for storing multi-page raster images
PDF/UA
PDF/UA is a format designed to work with accessibility technology.
PDF/VCR
PDF/VCR enables variable content replacement for variable data printing.
PDF/VT
PDF/VT is an extension of PDF/X and support variable data printing.
PDF/X
PDF/X is a format commonly used by graphics designers and print professionals.
PNG
Portable Network Graphics is a lossless image format commonly used on the internet. Learn more.
PostScript
PostScript is a page description language used in electronic documents. PDF is based on a simplified version.
Preflight
Preflight refers to the scanning of a PDF document to ensure it conforms to a number of specified conditions and that it is ready for print production.
Raster
A raster is a matrix cells which contain color data to represent an image.
Real object
A real object is a floating point number with limited range and precision.
Rectangle
A rectangle is an array object which describes either locations on a page or bounding boxes. It contains four numbers which represent the bottom left and upper right corners of the rectangle.
Redaction
Redaction involves censoring parts of a document so that it may be publish without revealing sensitive information.
Resource dictionary
Associates resource names (such as /Font) with their objects.
Running text
Running text is the main text within the document’s body.
SHA
Secure Hashing Algorithm is a cryptographic has function commonly used to protect passwords.
Signature handler
A signature handler is software that implements the creation of digital signatures.
SRGB
SRGB is a standard red, green, and blue color space which is very commonly used.
Stream object
A steam object contains a dictionary, followed by some binary data.
String
A string is a sequence of characters.
Structured text
Structured text contains additional information about how the text is laid out. Learn more.
Tagged PDF
A Tagged PDF file contains information about how its content is structured. Learn more.
TIFF
Tag image file format is a format that can store one or more images. Learn more.
Trailer
The trailer is a dictionary at the end of a PDF file. It contains things like the largest object reference, the document catalog, and the info metadata object.
TrueType Font
TrueType fonts were designed by Apple and Microsoft as a competitor to Adobe’s Type 1 fonts.
Type 1 Font
PostScript Type 1 fonts are the most commonly used font in PDF files as they produce high quality output and you can extract the text values easily. Learn more.
Type 3 Font
PostScript Type 3 fonts have glyphs defined by the full PostScript language however they do not support hinting and are rarely used in PDF files. Learn more.
Unicode
Unicode refers to a range of character encodings which all map onto the universal character set. Learn more.
Unstructured text
Unstructured text has no model or structure to its layout, it is simply text.
UTF-8
Unicode Transmission Format — 8 is the most commonly used character encoding and is similar to ASCII.
Vector
A vector is a quantity with two dimensions, commonly direction and magnitude.
WebP
WebP is an image format created by Google. Learn more.
Whitespace character
Whitespace characters refer to non-printable characters that still have meaning in text. This could be a space, tab, new line, or something else. It is called whitespace because paper is commonly white.
XFA
XML Forms Architecture was introduced in PDF 1.5, however it was discontinued and deprecated in PDF 2.0.
XFDF
XFDF is very similar to the FDF file format, except data is represented as XML.
XML
Extensible markup language, is a file format for storing arbitrary data, it’s syntax is similar to HTML.
XMP
Extensible metadata platform is an XML based metadata format which stores information about a file. Learn more.
XObject
XObjects are containers for a sequence of graphics objects. Learn more.
Z-Index
Z-Index refers to the order of overlapping elements. Lower numbered elements appear in front of higher numbered elements.
Sources:
ISO 32000-2:2020-12 PDF 2.0 Specification
PDF Association Glossary of PDF terms
Our software libraries allow you to
Convert PDF files to HTML |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |
Read and write HEIC and other Image formats in Java |