Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Understanding the PDF file format – Text, shapes and images

1 min read

I have been looking at an issue for a potential client recently which required the generation of different views of the page. This is interesting because it allows me to show you the internal workings of the PDF file format rather elegantly. It seems to be an increasingly common activity from our clients these days as they build web applications to display PDFs and need to separate out text and images.

What is in a PDF

A PDF can contain bitmapped images, Vector graphics and text (which can be Vector or bitmapped depending on the font used). Sometimes, you may be surprised at what you find. While a PDF may look like it contains text, the lettering may actually be part of the image (as in a scan) or shapes (where the text was converted to paths). Here is a rather nice PDF page showing what is going on…

Here is the complete page

which consists of images

text and vector graphics

and just the text

(the white text is invisible on a default white background)

The white text in particular illustrates how dependent on each other the layers are – we could generate it as a transparent image and add a coloured background if we wanted to highlight the text layer on its own.

How it works

Both ConvertPagesToImages and ConvertPagesToHiResImages recognise a new JVM flag, “org.jpedal.separation”. If this is set to all, then all the layers will be printed out as separate images.

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

IDRsolutions develop a Java PDF Viewer and SDK, an Adobe forms to HTML5 forms converter, a PDF to HTML5 converter and a Java ImageIO replacement. On the blog our team post anything interesting they learn about.

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

3 reasons Java developers switch to JDeli from ImageIO

ImageIO is build into the JDK and provides basic image support in Java. JDeli is a commercial image library for Java Developers from IDRsolutions....
Mark Stephens
1 min read

Why we wrote our own Java jpeg2000 libraries

JPEG2000 is an important image file format which offers significant benefits over JPEG. For our specific usage it does generate significantly smaller file sizes...
Mark Stephens
52 sec read

How to choose JPG versus JPEG2000 for image files

Since we started to support both JPG and JPG2000 as image file outputs in our software, we have found that this is a very...
Mark Stephens
1 min read

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2019. All rights reserved.