When Adobe created the PDF file format it was designed as an end file format, not one for editing and reusing. It works like a vector graphics file not a text document – so it contains ‘draw’ commands for images, text and shapes not any details of structures – there are no styles, line or paragraph markers or even spaces. It looks perfect but the structure is added by your brain looking at the display – there is nothing in the file.
It turned out that lots of people wanted to extract text from PDF files and were very disappointed by what they got back. So Adobe added some additional functionality into the spec so that you could add extra metadata into the file to preserve all this information and easily retrieve it. This is called Marked Content and the results are very good, but it needs to be added into the PDF when it is created. You can find out if it is present by reading my blog post.
There are several tools which claim they can add this information into existing PDF files or recreate it (with varying degrees of success). But the bottom line really is that if you want to extract Structured content from a PDF file, it really needs to contain it in the first place.
If you do have Structured PDF files and want to extract the text content with our software, we have a tutorial to extract structured content (if present) using JPedal.
Do you need to solve any of these problems?
|Display PDF documents in a Web app|
|Use PDF Forms in a web browser|
|Convert PDF Documents to an image|
|Work with PDF Documents in Java|
Are you a Developer working with PDF files?
|Learn more about PDF file format|