Back to Basics: The 2017 Guide to to PDF Files – Extracting images from PDF files

Did you know that not only can you convert PDF files into images, (as I explained last time), but you can also extract the actual images used on the page. In this post, I am going to tell you more…

How are images stored in PDF files?

First, you need to understand how images are stored in a PDF file. A PDF contains a raw image (which may be much better quality than the the scaled version displayed), a transformation (which can scale, sheer, rotate, stretch the image), and a clip (which may remove parts of the image).

We can make use of all this when we extract the images from the PDF file. More information about understanding this can be found in our previous blog post. A classic use for this is extracting high quality images of products from existing catalogues for your online store (or all those cute kitten pictures from that PDF you downloaded).

The image data is not stored in PDFs as an image such as a JPG, PNG, TIFF etc. Instead images are stored as XObjects within the file, which contain information about the image. The binary data used for the pixels,the colorspace information, clipping are all separate and ‘merged’ together to create the final image when the PDF is displayed. Further details about this can be found in our previous article on how images are stored in PDF files.

Find out more about image extraction from the PDF

In our JPedal library, we have already done all the hard work of making it possible to extract images from PDFs. We also have lots of example code and documentation on image extraction to get you started. And if you just want to extract the clipped image, we have an option for that too.

Next time we will take a look at some further reading you can do on PDF & Java.

If you’re a first-time reader, or simply want to be notified when we post new articles and updates, you can keep up to date by social media (Twitter,FacebookandGoogle+) or the Blog RSS.

Related Posts:

The following two tabs change content below.

Bethan Palmer

Developer at IDR Solutions
Bethan is a Java developer at IDR Solutions and was a speaker at JavaOne 2016. She has a degree in English Literature and in her spare time enjoys sports including running and handball.
Bethan

About Bethan Palmer

Bethan is a Java developer at IDR Solutions and was a speaker at JavaOne 2016. She has a degree in English Literature and in her spare time enjoys sports including running and handball.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>