Did you know that not only can you convert PDF files into images, (as I explained last time), but you can also extract the actual images used on the page. In this post, I am going to tell you more…
How are images stored in PDF files?
First, you need to understand how images are stored in a PDF file. A PDF contains a raw image (which may be much better quality than the the scaled version displayed), a transformation (which can scale, sheer, rotate, stretch the image), and a clip (which may remove parts of the image).
We can make use of all this when we extract the images from the PDF file. More information about understanding this can be found in our previous blog post. A classic use for this is extracting high quality images of products from existing catalogues for your online store (or all those cute kitten pictures from that PDF you downloaded).
The image data is not stored in PDFs as an image such as a JPG, PNG, TIFF etc. Instead images are stored as XObjects within the file, which contain information about the image. The binary data used for the pixels,the colorspace information, clipping are all separate and ‘merged’ together to create the final image when the PDF is displayed. Further details about this can be found in our previous article on how images are stored in PDF files.
Find out more about image extraction from the PDF
In our JPedal library, we have already done all the hard work of making it possible to extract images from PDFs. We also have lots of example code and documentation on image extraction to get you started. And if you just want to extract the clipped image, we have an option for that too.
Next time we will take a look at some further reading you can do on PDF & Java.