In a previous article, I wrote about the problems with editing the text in PDF files. PDF files are very different from other file formats such as Word or OpenOffice which stored the data as a set of objects which are then rendered as needed. The PDF file format were really designed for end file display.
A PDF file is more like a vector image file. It contains a set of pages which draw the page so it looks perfect – underneath there are very few structures so editing can be a nightmare. Essentially what you now have in a PDF is the draw commands in Postscript to show the content, not the content itself.
Where it becomes difficult if you want to change the actual content on the page. Because the structure of words, paragraphs and text flow no longer exists it is very difficult to alter the text, especially if you need to reflow it. You are having to have to hack the Postscript command stream and guess what is going on. PDF files which look identical can be structured very differently internally.
The PDF file format is great for displaying content, securing it, allowing users to add comments and for providing interaction via forms. It is less suited as an intermediate editable format which is why there are lots of creation, display, splitting tools but only a few basic editing tools.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.