When you look at a PDF file you see images displayed. In fact there are ‘several’ versions of each image…
Firstly there is the raw, unclipped version of the image. This may be in an ‘odd colorspace’ – see this previous posting for a good example.
This RAW image may also be much bigger than what you see onscreen. This can be useful sometimes if you want to generate the highest quality version of the extracted image – for example putting content from a catalogue on a website. There is a good example of this on the clipped image tab at the extraction examples page linking to a documented example. In this we use the high quality raw image (if present) and scale the clip and scaling up to the image rather than scaling the image down to the page.
The RAW image may also be rotated differently and have a background which is not present in the final PDF. When it is drawn on the page a transformation is applied (which can include scaling, rotation, sheering and clipping). In Java we also convert the images to sRGB.
The FINAL image is what you see on the PDF page so all of these other versions are ‘hidden’.
When you view the PDF page, you will always see the final page, but if you are doing extraction it can be useful to differentiate between the different versions. Sometimes they can be more useful. What would you use them for?
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.