Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

New APIs to handle PDF files in JPedal 6 – Text

1 min read

JPedal6In my first article I covered PDF files and Image handling. In the second part I will look at text features for JPedal 6.

Search a PDF file for text

JPedal makes it very easy to scan the pages of a PDF file for text. Here is a simple example (Javadoc).
JPedal includes some very powerful features for text search including regular expressions.

FindTextInRectangle extract=new FindTextInRectangle("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
      int pageCount=extract.getPageCount();
      for (int page=1; page<=pageCount; page++) {
 
          float[] co-ords=extract.findTextOnPage(page"textToFind", SearchType.MUTLI_LINE_RESULTS ) ;
      }
 }
 
 extract.closePDFfile();

Extract Unstructured Text from a PDF file

If a PDF contains unstructured, extractable text,  this API will allow it to be extracted from the page (Javadoc).

ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     int pageCount=extract.getPageCount();
     for (int page=1; page<=pageCount; page++) {
 
        String text=extract.getTextOnPage(page);
     }
 }
 
 extract.closePDFfile();

Extract Strucutred Text from a PDF file

If a PDF was correctly created with structured, extractable text, then this API will allow the text content to be extracted from the page as Structured content in a Java Document (Javadoc).

ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     int pageCount=extract.getPageCount();
     for (int page=1; page<=pageCount; page++) {
 
        String text=extract.getTextOnPage(page);
     }
 }
 
 extract.closePDFfile();

Extract Wordlist from a PDF file

Many customers use JPedal to pre-index the text content of their PDF files in a database. This API makes it easy to extract all the words from a PDF file with their text positions onscreen  (Javadoc).

ExtractTextAsWordlist extract=new ExtractTextAsWordlist("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
      int pageCount=extract.getPageCount();
      for (int page=1; page<=pageCount; page++) {
 
        List wordList=extract.getWordsOnPage(page);
      } 
 }
 
 extract.closePDFfile();

Extract Document outline from  PDF files

PDF file often contain a Document outline to provide a Table of Contents (Javadoc).

ExtractOutline extract=new ExtractOutline("C:/pdfs/mypdf.pdf");
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     Document pdfOutline=extract.getPDFTextOutline();
 }
 
 extract.closePDFfile();

The JPedal API provides a great deal of easy to use functionality with PDF files and Text handling. Are there any other additional features you would like to see?

Next time we look at some general features (page count, page sizes, etc)  in the JPedal library easily available with a new API.

If you’re a first-time reader, or simply want to be notified when we post new articles and updates, you can keep up to date by social media (Twitter, Facebook and Google+) or the Blog RSS.

Mark Stephens Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Leave a Reply

Your email address will not be published. Required fields are marked *