Because PDF is very much an output and display format it does not contain much format information such as paragraph breaks and spaces unless these tags are explicity added (Adobe calls it MarkedContent). In this case, it is possible to extract an almost perfect copy of the text data in a PDF. Otherwise, the software needs to guess such details. This is why it is very hard to extract complex irregular or multi-column text from a PDF file – the correct definition of what a column is varies with every file. Even spaces and returns have to be guessed.
What is available, however, is a lot of information on the text ‘style’ including Font used, size and even the colour. This information can be very useful for identifying structures on the page (such as page titles or headers and footers) or making sense of some values in Symbolic fonts.
Most of the time customers want just the text (and building XML trees is a relatively slow process in Java). So our JPedal library extracts just the text by default because it is fast and what most people want. All that XML metadata is extracted though as part of the process (you need to know the font to make sense of the text encoding). So we offer an option to include this information.
The best way to see the XML tags available is to run the JPedal text extraction example (the demo or full version) and see how it works or use the text extraction menu option in our example Viewer. You will see there is a surprising amount of useful data in XML mode.
Policymakers strive to make efficient use of taxpayer dollars The transition from welfare to work is proving more difficult
low and serving program goals. AMBER WAVESin rural than in urban areas, especially in remote, sparselypopulated areas where job opportunities are few.
in the design and administration of USDA’s food assistance
programs. Balance must be struck between keeping costs
12 DATA FEATURE
Trends in U.S. Per Capita Consumption of
Dairy Products, 1909 to 2001
Selected statistics on agriculture and trade, diet and health,
natural resources, and rural America
Snapshots of recent events at ERS, highlights of new
publications, and previews of research in the works
Recent accolades for ERS researchers
This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!
Latest posts by Mark Stephens (see all)
- Some example PDF to HTML5 conversions to show what is possible - December 3, 2013
- What is the connection between PDF files and the Reprieve Christmas Party? - November 29, 2013
- What is it like to be on the NetBeans podcast? - November 26, 2013
- My key take aways from 37signals new book ‘Remote – office not required’ - November 20, 2013
- My Cat (and my) first impressions of the new MacBook Pro laptop - November 16, 2013