Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Understanding the PDF file format – How Are Images Stored

1 min read

How are images stored?

When I was learning the PDF file format, I found Images could be quite a complex topic in PDF so I wrote this article to hopefully explain them clearly.  Please do let me know if you have any suggestions to improve it or it raises any questions for you.

A PDF file usually stores an image as a separate object (an XObject) which contains the raw binary data for the image. These are all listed in the Resources object for the page or the file and each has a name (ie Im1). It is wrong to think of images embedded inside a PDF as Tif, Gif, Bmp, Jpeg or Png. They are not.

It is important to appreciate that this is not usually an image in the sense of a Tif or a Jpg or a Png image – it is the binary data for the pixels, the colorspace used for the image, information about the Image. The image is ripped apart when the PDF is created and different PDF creation tools may store the same image in very different ways.

Here is an example shown in the PDF object viewer in Acrobat 9

image of PDF object viewer in Acrobat 9

Sometimes the raw image data is adjusted to the required size needed for the page and sometimes it is not – in that case it is scaled up or down when it is drawn – different PDF creation tools create PDF files in very different ways.

The actual pixel data can be compressed and one of the compression formats (DCTDecode) is the same used as in a JPEG (JPX is the same as Jpeg2000). If you save this data, it can be opened as a JPEG file, but it may need altering to include the colorspace data.

This image is then drawn in the PDF contents stream by a DO command and the image name (ie Im1). The image can be used multiple times and scaled, rotated or clipped – it takes whatever vales are set when the DO command is executed. Some things which appear as an image to the eye may also be made up of multiple images or not even images at all!

All this means that if you want to extract images from a PDF, you need to assemble the image from all the raw data – it is not stored as a complete image file you can just rip out.

And also there is a ‘raw’ (which is sometimes a much higher quality and sometimes exactly the same size) version of the image and a clipped/scaled version of the image – both can be extracted (and you can also scale the clip up onto the raw to produce a higher quality image – see this PDF Clipped Image Extraction example).

As with everything PDF, there is a lot of flexibility and lots of alternatives and options…

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

3 reasons Java developers switch to JDeli from ImageIO

ImageIO is build into the JDK and provides basic image support in Java. JDeli is a commercial image library for Java Developers from IDRsolutions....
Mark Stephens
1 min read

Why we wrote our own Java jpeg2000 libraries

JPEG2000 is an important image file format which offers significant benefits over JPEG. For our specific usage it does generate significantly smaller file sizes...
Mark Stephens
1 min read

How to choose JPG versus JPEG2000 for image files

Since we started to support both JPG and JPG2000 as image file outputs in our software, we have found that this is a very...
Mark Stephens
1 min read

10 Replies to “Understanding the PDF file format – How Are Images…”

  1. Extraction would be cool but first of all it seems impossible to be able to tell if an image actually EXISTS within a pdf. Using Apache Tika, I am able to extract pdf content but only the text gets extracted as content. Do you know a way in which to determine if an image exists in a PDF ?

    Thanks.

    1. You would need to scan all the Resources objects (and any Resources Objects on Xforms) so see if they contain any image objects and also scan all the streams for inline images.

  2. Thank you so much! Up until I read this post, I thought the images inside a PDF were either PNG or JPEG and was so confused when I couldn’t find PNG or JPEG signatures inside the PDFs.

  3. So is there a way to extract images from PDFs without de-compresing and re-compressing them?

  4. This was very helpful. I also didn’t know about the images being stored as a separate object and have been working with PDFs (editing, creating, printing) for many years. Im assuming so, but does that also include if you have a group of tif & jpeg images and you combine them into a pdf? A common occurrence in our world is our customers requesting printing of PDFs and then we have to break out the pricing between black & white vs. color images. Is there another way to parse those out? Other than sitting down and scrolling through all the PDFs? We are using Adobe Acrobat Pro X. Thank you for this information I am passing it along to my team.

    1. The Image data is stored separately from the colour data. You can compress the image data using CCITT (Tiff) and DCT/JPX (JPEG/JPEG2000) format but this is not the same as them being actual images. In some cases (ie DeviceRGB colorSpace) they can be equivalent to final image but they cannot really be treated as self-contained images.

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2019. All rights reserved.