It is actually possible to extract some raw images from the PDF file. In general, images do not exist inside a PDF file – TIFFs and PNGs are ripped apart and the data stored in separate objects. The data is compressed using various compression formats (JBIG2, CCITT, FLATE, LZW). However, one of the formats used for image data is the DCT format. This is actually a JPEG, and if you take the binary data out and save it in a file with a .jpeg format, you can open it. It includes not just the pixel data but also the JPEG header at the start – it is a complete file.
How is the JPEG data stored?
If you open a PDF file, the stored JPEG data will appear in the XObject image. Here is an example.
14 0 obj << /Intent/RelativeColorimetric /Type/XObject /ColorSpace/DeviceGray /Subtype/Image /Name/X /Width 2988 /BitsPerComponent 8 /Length 134030 /Height 2286 /Filter/DCTDecode >> stream (binary data) endstream
The /Type shows that this is an image. The key section is the /Filter value – DCTDecode indicates a JPEG (JPX shows a JPEG2000) which also works. The data is between stream and endstream. You need to extract the raw data (cut and paste of text is unlikely to work) for the jpeg file. The /Length value shows how long it is.
Lastly, the /Colorspace is important because it shows the colour-coding used in the JPEG. If it is DeviceRGB, it will look exactly as it is in the PDF display. Not many viewers understand types like DeviceCMYK – you may need a heavyweight package like Photoshop to see it correctly.
If the image is clipped, you may find you can see background details not in the PDF display and the image may also be a different size or even upside down. But you have extracted the raw image data!
Would you like to learn more about PDF files?
This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 20 years worth of PDF knowledge and tips, so click here to visit our series index!
Are you working with JPEG Images in Java?
You might like to check out our JDeli image library. It offers lots of advantages over ImageIO and free alternatives such as:-
- prevent heap related JVM crashes
- support for additional image formats such as Heic
- reduce output file size
- improve read/write performance
- create smaller files
- control over output
- support threading
- superior image scaling algorithms