Size does matter

rulerRecently I have been looking into an issue in our PDF text extraction. A case was found where text extraction would appear to freeze. This obviously was a major concern so we began investigating. Once we got out our magnifying glasses and began looking into the issue it soon became apparent that the extraction was not freezing or becoming stuck in a loop. It seems the PDF extraction would take a full 5 minutes to finish a single page. It appears that in our extraction, size does matter.

Now I should explain. The size of the file was not the problem, neither was the size of the page, the size of the text was the problem in this case. It was too small and the extraction did not handle it well. In order to extract the text correctly we need to take the text content and merge it together correctly. As text coordinates can be specified for different areas and lines of text we need to add new lines between text that have significantly different y coordinates whilst merging.

The problem is this PDF file contained a section of text that was nothing but ‘_’ characters. The character had a height of 0.03 and the gap between this text segment and the next beneath it was around 490. So our code tried to add in all the new lines to add this gap to the output text. For those not so quick at mental arithmetic that gives us approximately 16333 new lines to be added. Now as the pdf page wasn’t larger than an A4 sheet of paper and the font size wasn’t ridiculously small I think we can assume that 16333 empty lines was probably found in error.

So what do we need to make this work? We need something larger because size does matter. When we are merging text together for extraction we should ignore smaller than average values. The best solution was to always choose the larger option in cases such as this. In this case the extraction merged the lines in a slightly different order which proved to be slightly and the amount of lines found in the case mentioned above was reduced to a modest 3 lines which matched the gap present in the pdf more accurately.

And to anyone who thinks I only wrote this article because of the innuendo possibilities, shame on you.

This post is part of our “Fonts Articles Index” in these articles we explore Fonts.

If you’re a first-time reader, or simply want to be notified when we post new articles and updates, you can keep up to date by social media (TwitterFacebook and Google+) or the  Blog RSS.

Ebook Page Link

The following two tabs change content below.
Kieran France is a programmer for IDRSolutions. He enjoys tinkering with most things including gadgets, code and electronics. He often has no idea what to write in his blog posts but tries his hardest to make them interesting and entertaining, he also makes no excuses for his odd sense of humor.

Related Posts:

KieranF

About Kieran France

Kieran France is a programmer for IDRSolutions. He enjoys tinkering with most things including gadgets, code and electronics. He often has no idea what to write in his blog posts but tries his hardest to make them interesting and entertaining, he also makes no excuses for his odd sense of humor.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>