Because PDF is very much an output and display format it does not contain much text formatting information such as paragraph breaks and spaces unless these optional tags are added (Adobe calls it MarkedContent).
With Marked content, it is possible to extract an almost perfect copy of the text data in a PDF. Otherwise, the software needs to guess such details. This is why it is very hard to extract complex irregular or multi-column text from a PDF file – the correct definition of what a column is varies with every file. Even spaces and returns have to be guessed.
What is available, however, is a lot of information on the text ‘style’ including Font used, size and even the colour. This information can be very useful for identifying structures on the page (such as page titles or headers and footers) or making sense of some values in Symbolic fonts.
You can also have the co-ordinates for every character on the page. So unstructured content can be easily searched for words but is not ideal for extracting data in any structured format.
Are you a Developer working with PDF files?
Our developers guide contains a large number of technical posts to help you understand the PDF file Format.
Find out more about our software for Developers
|Convert PDF to HTML5 or SVG|
|Convert AcroForms and XFA to HTML5|
|Java PDF SDK for working with PDF files|