Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Understanding the PDF file format – OCR PDF files

1 min read

Some PDF files are generated from scanning in pages as images, and these have their own unique quirks. Sometimes, the original book copy is the only copy available so this is the only way to get hold of the content. I hope to explain some of these and the impact they might have in this article…

You can usually tell an OCR PDF file from it’s appearance – the text on the pages has a ‘jagged’ bitmapped appearance to it rather than smooth look you get with text rendered as Vector graphics. If it doubt, you can have a look at the PDF Properties for the Producer or Creator (Abbyy Fine Reader is a common tool for converting scanned pages into PDF files).

When pages are scanned in, the text is calculated using Optical Character Recognition software. This is not always 100% perfect. This might be because the page scan is poor quality, the text is at an angle, the font has very similar letters, and so on. To hide this fact, the text is often placed behind the image by the PDF creator. That way it still looks perfect and it is only if you start to search that you will see any errors.

Generally, each page is scanned in as a single high resolution image which is usually embedded as a large black and white or grayscale image.

This has two big implications for you as users of PDF files.

First of all, the files are bigger because they contain both the text (or an OCR tools best guess) and a high resolution image. Sometimes this image will have real images (ie page logos) on them.

Secondly, just because it looks like a perfect representation of the page, it does not mean that the text is actually correct and can be searched.

Sometimes, the original book copy is the only copy available so this is the only way to get hold of the content. Google currently has a big project to scan in lots of old books – many created before computers even existed.

So PDF files created with OCR are okay (and often the only thing available), but not as useful as a ‘proper’ PDF file version if you can get it.

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Did you know...

IDRsolutions offers a whole range of online file converters to convert PDF and Microsoft Excel, Word and Office Documents to HTML5, SVG or image formats?

It is free to use for single file conversions and also includes Developer links if you want to use our commercial software for bulk conversions. Find out more on this page

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

How to read HEIC image files in Java with…

In this article, I will explain how to read HEIC files into Java as a BufferedImage. ImageIO does not read HEIC file types so...
Mark Stephens
1 min read

How to convert WMF files to SVG in java…

This article will show you how to convert WMF files into SVG files using our JDeli Java Image library. What is WMF? WMF is...
Amy Pearson
1 min read

How to write WebP images in Java

In this article, I will walk you through how to write out images as WebP images in Java. ImageIO does not support WebP images...
Mark Stephens
1 min read

Leave a Reply

Your email address will not be published. Required fields are marked *

IDRsolutions Ltd 2020. All rights reserved.