Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

Do you need to process or display PDF files?

Find out why you should be using IDRSolutions software

What text format and style information is in a PDF file?

September 3, 2009 39 sec read

Because PDF is very much an output and display format it does not contain much text formatting information such as paragraph breaks and spaces unless these optional tags are added (Adobe calls it MarkedContent).

With Marked content, it is possible to extract an almost perfect copy of the text data in a PDF. Otherwise, the software needs to guess such details. This is why it is very hard to extract complex irregular or multi-column text from a PDF file – the correct definition of what a column is varies with every file. Even spaces and returns have to be guessed.

What is available, however, is a lot of information on the text ‘style’ including Font used, size and even the colour. This information can be very useful for identifying structures on the page (such as page titles or headers and footers) or making sense of some values in Symbolic fonts.

You can also have the co-ordinates for every character on the page. So unstructured content can be easily searched for words but is not ideal for extracting data in any structured format.

Our software libraries allow you to

Convert PDF files to HTML

Use PDF Forms in a web browser

Convert PDF Documents to an image

Work with PDF Documents in Java

Read and write HEIC and other Image formats in Java

Mark Stephens Mark has been working with Java and PDF since 1999 and is a big NetBeans fan. He enjoys speaking at conferences. He has an MA in Medieval History and a passion for reading.

« BufferedImage raster data in Java

How big is a PDF Page in bytes? »

What is the best compression format for PDF?

Promil
Oct 17, 2023 3 min read

5 ways to protect PDF files from copying

Mark Stephens
Jul 26, 2023 1 min read

How to insert an image into a PDF

Recently, we released JPedal 2023.07 which contains the ability to insert images into PDF files. All you need is a copy of JPedal, a...

Jacob Collins
Jul 7, 2023 18 sec read

3 Replies to “What text format and style information is in a…”

PDFTEXTExtractor says:
November 5, 2010 at 9:35 pm
Does your code if there are multiple columns of text in a pdf document ?
PDFTEXTExtractor says:
November 5, 2010 at 9:35 pm
Sorry i meant does your code work if there are multiple columns of text in a pdf document ?
1. Mark Stephens says:
  November 6, 2010 at 11:33 am
  ExtractTextAsTable will try to guess it. Does the PDF file contain structured content?

Comments are closed.