Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Extract raw JPEG images from a PDF file

1 min read

It is actually possible to extract some raw images from the PDF file. In general images do not exist inside a PDF file – TIFFs and PNGs are ripped apart and the data stored in separate objects. The data is compressed using various compression formats (JBIG2, CCITT, FLATE, LZW). However, one of the formats used for image data, is the DCT format. This is actually a JPEG, and if you take the binary data out and save it in a file with a .jpeg format, you can open it. It includes not just the pixel data but also the JPEG header at the start – it is a complete file.

If you open a PDF file, the stored JPEG data will appear in the XObject image. Here is an example.

14 0 obj
<<
/Intent/RelativeColorimetric
/Type/XObject
/ColorSpace/DeviceGray
/Subtype/Image
/Name/X
/Width 2988
/BitsPerComponent 8
/Length 134030
/Height 2286
/Filter/DCTDecode
>>
stream (binary data) endstream

The /Type shows that this is an image. The key section is the /Filter value – DCTDecode indicates a JPEG (JPX shows a JPEG2000) which also works. The data is between stream and endstream. You need to extract the raw data (cut and paste of text is unlikely to work) for the jpeg file. The /Length value shows how long it is.

Lastly, the /Colorspace is important because it shows the color coding used in the JPEG. If it is DeviceRGB, it will look exactly as it is in the PDF display. Not many viewers understand types like DeviceCMYK – you may need a heavyweight package like Photoshop to see it correctly.

If the image is clipped, you may find you can see background details not in the PDF display and the image may also be a different size or even upside down. But you have extracted the raw image data!

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Did you know...

IDRsolutions offers a whole range of online file converters to convert PDF and Microsoft Excel, Word and Office Documents to HTML5, SVG or image formats?

It is free to use for single file conversions and also includes Developer links if you want to use our commercial software for bulk conversions. Find out more on this page

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to read HEIC image files in Java with…

In this article, I will explain how to read HEIC files into Java as a BufferedImage. ImageIO does not read HEIC file types so...
Mark Stephens
1 min read

How to convert WMF files to SVG in java…

This article will show you how to convert WMF files into SVG files using our JDeli Java Image library. What is WMF? WMF is...
Amy Pearson
1 min read

How to write WebP images in Java

In this article, I will walk you through how to write out images as WebP images in Java. ImageIO does not support WebP images...
Mark Stephens
1 min read

5 Replies to “Extract raw JPEG images from a PDF file”

  1. One caveat, if the PDF has been encrypted then this won’t work. The binary data in the stream that makes up the Jpeg will have been encrypted. The file produced by this procedure won’t be a valid Jpeg.

  2. Is it possible to do this with TIFF images as well? (The PDF samples I am looking at have been compressed – FlateDecode)

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2020. All rights reserved.