Understanding the PDF file format - Text, shapes and images

Table of Contents show

I have been looking at an issue for a potential client recently which required the generation of different views of the page. This is interesting because it allows me to show you the internal workings of the PDF file format rather elegantly. It seems to be an increasingly common activity from our clients these days as they build web applications to display PDFs and need to separate out text and images.

What is in a PDF?

A PDF can contain bitmapped images, Vector graphics and text (which can be Vector or bitmapped depending on the font used). They are all drawn onto the same layer (so can hide things already drawn). If test is drawn under other content, it is usually selectable but invisible.

Sometimes, you may be surprised at what you find. While a PDF may look like it contains text, the lettering may actually be part of the image (as in a scan) or shapes (where the text was converted to paths). Here is a rather nice PDF page showing what is going on…

Here is the complete page

which consists of an image

text and vector graphics

and just the text

(the white text is invisible on a default white background)

The white text, in particular, illustrates how dependent on each other the layers are – we could generate it as a transparent image and add a coloured background if we wanted to highlight the text layer on its own.

How did I generate the separations?

Our Jpedal software contains lots of examples to convert PDF files to images. These recognise a JVM flag, “org.jpedal.separation”. If this is set to all, then all the layers will be printed out as separate images.

Our software libraries allow you to

Convert PDF files to HTML

Use PDF Forms in a web browser

Convert PDF Documents to an image

Work with PDF Documents in Java

Read and write HEIC and other Image formats in Java

Understanding the PDF file format – Text, shapes and images

How did I generate the separations?

Our software libraries allow you to

How to add a table of contents to a…

New options for our PDF merger

Manipulate PDF files in the JPedal Viewer

Understanding the PDF file format – Text, shapes and images

How did I generate the separations?

Related posts:

Our software libraries allow you to

How to add a table of contents to a…

New options for our PDF merger

Manipulate PDF files in the JPedal Viewer