PDF files contain the raw images used on the PDF pages. You can extract and combine this data into a self-contained image. In this post, I am going to give you a brief overview and some useful links to learn how to do this in JPedal Java PDF library.
How are images stored in PDF files?
First, you need to understand how images are stored in a PDF file. A PDF contains a raw image (which may be much better quality than the the scaled version displayed), a transformation (which can scale, sheer, rotate, stretch the image), and a clip (which may remove parts of the image).
We can make use of all this when we extract the images from the PDF file. More information about understanding this can be found in our previous blog post. A classic use for this is extracting high quality images of products from existing catalogues for your online store (or all those cute kitten pictures from that PDF you downloaded).
The image data is not stored in PDFs as an image such as a JPG, PNG, TIFF etc. Instead images are stored as XObjects within the file, which contain information about the image. The binary data used for the pixels,the colorspace information, clipping are all separate and ‘merged’ together to create the final image when the PDF is displayed. Further details about this can be found in our previous article on how images are stored in PDF files.
Find out more about image extraction from the PDF
In our JPedal library, we have already done all the hard work of making it possible to extract images from PDFs. We also have lots of example code and documentation on image extraction to get you started. And if you just want to extract the clipped image, we have an option for that too.