I have been looking at an issue for a potential client recently which required the generation of different views of the page. This is interesting because it allows me to show you the internal workings of the PDF file format rather elegantly. It seems to be an increasingly common activity from our clients these days as they build web applications to display PDFs and need to separate out text and images.
What is in a PDF
A PDF can contain bitmapped images, Vector graphics and text (which can be Vector or bitmapped depending on the font used). Sometimes, you may be surprised at what you find. While a PDF may look like it contains text, the lettering may actually be part of the image (as in a scan) or shapes (where the text was converted to paths). Here is a rather nice PDF page showing what is going on…
Here is the complete page
which consists of images
text and vector graphics
and just the text
(the white text is invisible on a default white background)
The white text in particular illustrates how dependent on each other the layers are – we could generate it as a transparent image and add a coloured background if we wanted to highlight the text layer on its own.
How it works
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
Latest posts by Mark Stephens (see all)
- Introducing the new XFA Parser in FormVu - May 16, 2018
- Moving to JPedal release 8 - May 2, 2018
- Which version of Java SE should I use? - April 25, 2018
- How we are improving our code quality with IDEA in 2018 - March 7, 2018
- How we are improving our code quality with NetBeans in 2018 - March 1, 2018