Site iconJava PDF Blog

Extract Shape information from a PDF file (solving common issues with PDF files)

Introduction

Welcome to a new series on our blog! We get lots of inquires about solving problems with PDF files so we will be explaining the problem and giving you the generic solution in a blog post.

If you then want to solve the problem using our products, we will also give you a link to the support pages. So let us know your problem!

Today’s problem

Our first problem is getting the Shape information from a PDF file. You need to know the following to solve this problem.

1. The PDF file format is based on Postscript.

2. It includes a block of code to draw each page which contains THREE types of Shape (F, S, B). F is a Fill shape, S is a Stroke shape and B is both strokes and Fill.

3. Other Postscript Commands setup other values (color, stroke, size, clipping in the GraphicsState) and commands in the Postscript stream affect all subsequent commands so you need to decode the whole stream to get the correct data.

4. Shapes can include lines, rectangles and complex structures.

5. Co-ordinates used are using PDF co-ordinates so they may need changing if you want to use them.

Generic solutions

There are 2 ways to solve this problem and you will need to parse the file in each case:-

1. Write out the shapes as the PDF is decoded (or include a callback so that users can track). Some PDF libraries offer this feature or you could hack it into one of the Open Source PDF libraries out there.

2. Turn the PDF into something where this information is extractable from the converted file (for example HTML, SVG, EPS). An image is not suitable because the shapes will be an integrated part of the rendered image.

I personally would use option 1.

How we would solve it with our software

Here is my solution:-

1. Download the JPedal trial jar.
2. Jpedal has several custom interfaces so that users can add callback into their code.
3. The ShapeTracker would seem an exact match for solving this problem.
4. There is a commented-out section in our ConvertPagesToImages. If you copy this code into your IDE, you can try it and adapt to your exact needs. [link]

/**
 * code to track shapes
 */org.jpedal.external.ShapeTracker myShapeTracker=new TestShapeTracker();
decode_pdf.addExternalHandler(myShapeTracker, org.jpedal.external.Options.ShapeTracker);

5. Remember we have a support page if you need any help or further details.