Recently I have been looking into an issue in our PDF text extraction. A case was found where text extraction would appear to freeze. This obviously was a major concern so we began investigating. Once we got out our magnifying glasses and began looking into the issue it soon became apparent that the extraction was not freezing or becoming stuck in a loop. It seems the PDF extraction would take a full 5 minutes to finish a single page. It appears that in our extraction, size does matter.
Now I should explain. The size of the file was not the problem, neither was the size of the page, the size of the text was the problem in this case. It was too small and the extraction did not handle it well. In order to extract the text correctly we need to take the text content and merge it together correctly. As text coordinates can be specified for different areas and lines of text we need to add new lines between text that have significantly different y coordinates whilst merging.
The problem is this PDF file contained a section of text that was nothing but ‘_’ characters. The character had a height of 0.03 and the gap between this text segment and the next beneath it was around 490. So our code tried to add in all the new lines to add this gap to the output text. For those not so quick at mental arithmetic that gives us approximately 16333 new lines to be added. Now as the pdf page wasn’t larger than an A4 sheet of paper and the font size wasn’t ridiculously small I think we can assume that 16333 empty lines was probably found in error.
So what do we need to make this work? We need something larger because size does matter. When we are merging text together for extraction we should ignore smaller than average values. The best solution was to always choose the larger option in cases such as this. In this case the extraction merged the lines in a slightly different order which proved to be slightly and the amount of lines found in the case mentioned above was reduced to a modest 3 lines which matched the gap present in the pdf more accurately.
And to anyone who thinks I only wrote this article because of the innuendo possibilities, shame on you.
This post is part of our “Fonts Articles Index” in these articles we explore Fonts.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.