Our first problem is getting the Shape information from a PDF file. You need to know the following to solve this problem.
1. The PDF file format is based on Postscript.
2. It includes a block of code to draw each page which contains THREE types of Shape (F, S, B). F is a Fill shape, S is a Stroke shape and B is both stroke and Fill.
3. Other Postscript Commands setup other values (color, stroke, size, clipping in the GraphicsState) and commands in the Postscript stream effect all subsequent commands so you need to decode the whole stream to get the correct data.
4. Shapes can include lines, rectangles and complex structures.
5. Co-ordinates used are using PDF co-ordinates so they may need changing if you want to use them.
There are 2 ways to solve this problem and you will need to parse the file in each case:-
1. Write out the shapes as the PDF is decoded (or include a callback so that users can track). Some PDF libraries offer this feature or you could hack it into one of the Open Source PDF libraries out there.
2. Turn the PDF into something where this information is extractable from the converted file (for example HTML, SVG, EPS). An image is not suitable because the shapes will be an integrated part of the rendered image.
I personally would use option 1.
How we would solve it with our software
Here is my solution:-
1. Download the JPedal trial jar.
2. Jpedal has several custom interfaces so that users can add callback into their code.
3. The ShapeTracker would seem an exact match for solving this problem.
4. There is commented out section in our ConvertPagesToImages. If you copy this code into your IDE, you can try it and adapt to your exact needs. [link]
/** * code to track shapes */org.jpedal.external.ShapeTracker myShapeTracker=new TestShapeTracker(); decode_pdf.addExternalHandler(myShapeTracker, org.jpedal.external.Options.ShapeTracker);
5. Remember we have a support forum if you need any help or further details.