JPedal now contains an Apache Tika Parser which can parse and extract unstructured text from PDF files.
How to use an Apache Tika PDF Parser
Just like any other Apache Tika Parser, you must call the parse()
method with a few parameters.
First, you must pass a TikaInputStream
containing the path to your PDF file.
Second, you must pass a ContentHandler
. It is advisable to set the character limit to -1
otherwise, the whole PDF file may not be parsed.
Next, you pass a Metadata
. This can be a blank instance or it can contain the password to the PDF file if it is encrypted.
Finally, a ParseContext
is not needed so the last argument can be null.
The extracted text is now stored in the ContentHandler
!
Learn More
You can find more information about our Apache Tika Parser here.
Are you a Developer working with PDF files?
Our developers guide contains a large number of technical posts to help you understand the PDF file Format.
Do you need to solve any of these problems?
Display PDF documents in a Web app |
Use PDF Forms in a web browser |
Convert PDF Documents to an image |
Work with PDF Documents in Java |