PDF format and style information

Because PDF is very much an output and display format it does not contain much format information such as paragraph breaks and spaces unless these tags are explicity added (Adobe calls it MarkedContent). In this case, it is possible to extract an almost perfect copy of the text data in a PDF. Otherwise, the software needs to guess such details. This is why it is very hard to extract complex irregular or multi-column text from a PDF file – the correct definition of what a column is varies with every file. Even spaces and returns have to be guessed.

What is available, however, is a lot of information on the text ‘style’ including Font used, size and even the colour. This information can be very useful for identifying structures on the page (such as page titles or headers and footers) or making sense of some values in Symbolic fonts.

Most of the time customers want just the text (and building XML trees is a relatively slow process in Java). So our JPedal library extracts just the text by default because it is fast and what most people want. All that XML metadata is extracted though as part of the process (you need to know the font to make sense of the text encoding). So we offer an option to include this information.

The best way to see the XML tags available is to run the JPedal text extraction example (the demo or full version) and see how it works or use the text extraction menu option in our example Viewer. You will see there is a surprising amount of useful data in XML mode.

JUNE 2003

3

Policymakers strive to make efficient use of taxpayer dollars The transition from welfare to work is proving more difficult

low and serving program goals. AMBER WAVESin rural than in urban areas, especially in remote, sparselypopulated areas where job opportunities are few.

in the design and administration of USDA’s food assistance

programs. Balance must be struck between keeping costs

12 DATA FEATURE

Trends in U.S. Per Capita Consumption of

Dairy Products, 1909 to 2001

46 INDICATORS

Selected statistics on agriculture and trade, diet and health,

natural resources, and rural America

50 GLEANINGS

Snapshots of recent events at ERS, highlights of new

publications, and previews of research in the works

52 PROFILES

Recent accolades for ERS researchers

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Related Posts:

Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

4 thoughts on “PDF format and style information

  1. [...] is a slightly complicated isuse – there is an article on the JPedal blog explaining why at PDF format and style information | Java PDF Blog If you want to use JPedal for text extraction, there is a number of tutorials at Java PDF [...]

  2. Does your code if there are multiple columns of text in a pdf document ?

  3. Sorry i meant does your code work if there are multiple columns of text in a pdf document ?

    • ExtractTextAsTable will try to guess it. Does the PDF file contain structured content?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>