Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Understanding the PDF file format – How Are Images Stored

2 min read

How are images stored?

When I was learning the PDF file format, I found Images could be quite a complex topic in PDF so I wrote this article to hopefully explain them clearly.  Please do let me know if you have any suggestions to improve it or it raises any questions for you.

A PDF file usually stores an image as a separate object (an XObject) which contains the raw binary data for the image. These are all listed in the Resources object for the page or the file and each has a name (ie Im1). It is wrong to think of images embedded inside a PDF as Tif, Gif, Bmp, Jpeg or Png. They are not.

It is important to appreciate that this is not usually an image in the sense of a Tif or a Jpg or a Png image – it is the binary data for the pixels, the colorspace used for the image, information about the Image. The image is ripped apart when the PDF is created and different PDF creation tools may store the same image in very different ways.

Here is an example shown in the PDF object viewer in Acrobat 9

image of PDF object viewer in Acrobat 9

Sometimes the raw image data is adjusted to the required size needed for the page and sometimes it is not – in that case it is scaled up or down when it is drawn – different PDF creation tools create PDF files in very different ways.

The actual pixel data can be compressed and one of the compression formats (DCTDecode) is the same used as in a JPEG (JPX is the same as Jpeg2000). If you save this data, it can be opened as a JPEG file, but it may need altering to include the colorspace data.

This image is then drawn in the PDF contents stream by a DO command and the image name (ie Im1). The image can be used multiple times and scaled, rotated or clipped – it takes whatever vales are set when the DO command is executed. Some things which appear as an image to the eye may also be made up of multiple images or not even images at all!

All this means that if you want to extract images from a PDF, you need to assemble the image from all the raw data – it is not stored as a complete image file you can just rip out.

And also there is a ‘raw’ (which is sometimes a much higher quality and sometimes exactly the same size) version of the image and a clipped/scaled version of the image – both can be extracted (and you can also scale the clip up onto the raw to produce a higher quality image – see this PDF Clipped Image Extraction example).

As with everything PDF, there is a lot of flexibility and lots of alternatives and options…

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Did you know...

IDRsolutions offers a whole range of online file converters to convert PDF and Microsoft Excel, Word and Office Documents to HTML5, SVG or image formats?

It is free to use for single file conversions and also includes Developer links if you want to use our commercial software for bulk conversions. Find out more on this page

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to read HEIC image files in Java with…

In this article, I will explain how to read HEIC files into Java as a BufferedImage. ImageIO does not read HEIC file types so...
Mark Stephens
1 min read

How to convert WMF files to SVG in java…

This article will show you how to convert WMF files into SVG files using our JDeli Java Image library. What is WMF? WMF is...
Amy Pearson
1 min read

How to write WebP images in Java

In this article, I will walk you through how to write out images as WebP images in Java. ImageIO does not support WebP images...
Mark Stephens
1 min read

15 Replies to “Understanding the PDF file format – How Are Images…”

  1. Extraction would be cool but first of all it seems impossible to be able to tell if an image actually EXISTS within a pdf. Using Apache Tika, I am able to extract pdf content but only the text gets extracted as content. Do you know a way in which to determine if an image exists in a PDF ?

    Thanks.

    1. You would need to scan all the Resources objects (and any Resources Objects on Xforms) so see if they contain any image objects and also scan all the streams for inline images.

  2. Thank you so much! Up until I read this post, I thought the images inside a PDF were either PNG or JPEG and was so confused when I couldn’t find PNG or JPEG signatures inside the PDFs.

  3. So is there a way to extract images from PDFs without de-compresing and re-compressing them?

  4. This was very helpful. I also didn’t know about the images being stored as a separate object and have been working with PDFs (editing, creating, printing) for many years. Im assuming so, but does that also include if you have a group of tif & jpeg images and you combine them into a pdf? A common occurrence in our world is our customers requesting printing of PDFs and then we have to break out the pricing between black & white vs. color images. Is there another way to parse those out? Other than sitting down and scrolling through all the PDFs? We are using Adobe Acrobat Pro X. Thank you for this information I am passing it along to my team.

    1. The Image data is stored separately from the colour data. You can compress the image data using CCITT (Tiff) and DCT/JPX (JPEG/JPEG2000) format but this is not the same as them being actual images. In some cases (ie DeviceRGB colorSpace) they can be equivalent to final image but they cannot really be treated as self-contained images.

  5. Thank You so much! Very helpful to understand how PDF works.
    But i didn´t find in any place how to use a image inside a PDF as monochrome.
    I’m with a problem where my PDF has a barcode image, the printers are identifying my PDF as colored, but it isn´t.
    Do you know how can i convert that to Monochrome or exist a way to force my entire PDF as monochrome?

    1. It may well be stored as coloured data even if it looks black and white. Some print drivers will allow you to print as monchrome. You could use a tool like IText to edit the PDF data or a tool like JPedal to extract the image and make black and white.

  6. I received some pdf files from a public disclosure request.

    They are of two types:

    1) Content Creator / Encoding Software: Xerox Color C70
    2) Content Creator / Encoding Software Adobe Acrobat Pro 9.0.0

    I have converted both to RTF format using Adobe Pro DC on a 2019 iMac.

    However, I believe #2 converted better. AND, in the #2 converted RTF
    files I am finding text not found in the original PDF files.

    Could anyone suggest how I might get some help with this?

  7. HI,
    I’m removing some images from PDF, but sometimes the whole page comes as an image. I can understand if the pages were scanned in as images. However, in many of these PDF’s I can copy and paste the text out of the pages. So I was surprised the pages came out as images, not just the “pictures”.

    Can you explain this, that is , copy and paste text, but pages come out as images?

    1. My guess would be the page is OCRed so you have both the picture and the text drawn on the page. The PDF is whatever the PDF creation tool adds, so pages can be just images or a mix.

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2020. All rights reserved.