Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Extract Shape information from a PDF file (solving common issues with PDF files)

1 min read

Introduction

PDF ProblemsWelcome to a new series on our blog! We get lots of inquires about solving problems with PDF files so we will be explaining the problem and giving you the generic solution in a blog post.

If you then want to solve the problem using our products, we will also give you a link to the support pages. So let us know your problem!

Today’s problem

Our first problem is getting the Shape information from a PDF file. You need to know the following to solve this problem.

1. The PDF file format is based on Postscript.

2. It includes a block of code to draw each page which contains THREE types of Shape (F, S, B). F is a Fill shape, S is a Stroke shape and B is both stroke and Fill.

3. Other Postscript Commands setup other values (color, stroke, size, clipping in the GraphicsState) and commands in the Postscript stream effect all subsequent commands so you need to decode the whole stream to get the correct data.

4. Shapes can include lines, rectangles and complex structures.

5. Co-ordinates used are using PDF co-ordinates so they may need changing if you want to use them.

Generic solutions

There are 2 ways to solve this problem and you will need to parse the file in each case:-

1. Write out the shapes as the PDF is decoded (or include a callback so that users can track). Some PDF libraries offer this feature or you could hack it into one of the Open Source PDF libraries out there.

2. Turn the PDF into something where this information is extractable from the converted file (for example HTML, SVG, EPS). An image is not suitable because the shapes will be an integrated part of the rendered image.

I personally would use option 1.

How we would solve it with our software

Here is my solution:-

1. Download the JPedal trial jar.
2. Jpedal has several custom interfaces so that users can add callback into their code.
3. The ShapeTracker would seem an exact match for solving this problem.
4. There is commented out section in our ConvertPagesToImages. If you copy this code into your IDE, you can try it and adapt to your exact needs. [link]

/**
 * code to track shapes
 */org.jpedal.external.ShapeTracker myShapeTracker=new TestShapeTracker();
decode_pdf.addExternalHandler(myShapeTracker, org.jpedal.external.Options.ShapeTracker);

5. Remember we have a support forum if you need any help or further details.

f you’re a first-time reader, or simply want to be notified when we post new articles and updates, you can keep up to date by social media (TwitterFacebook and Google+) or the Blog RSS.

IDRsolutions develop a Java PDF Viewer and SDK, an Adobe forms to HTML5 forms converter, a PDF to HTML5 converter and a Java ImageIO replacement. On the blog our team post anything interesting they learn about.

Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *