Extracting text references from a PDF file

JPedal is used for alsorts of extraction tasks. A recent request was from a client who wanted to extract the references and text from a PDF file. Here is what you need to do to achieve this and some sample code if you would like to use our PDF library.

Links are stored as Annotation objects in a PDF file. Every page has a possible list of annotation objects on that page (if there are no annotations are present there will be no value). Annotations objects allow PDf files to contain animations and interactions. There are stored separately from the page text and drawn on by the PDF renderer.

There are several types of Annotation object. The one we are interested in is the Annot values with /Subtype of link. Here is what the raw data might look like.

23 0 obj<<

/F 4

/A<</URI(http://www.jpedal.org/link.html)/Type/Action/S/URI>>

/BS<</W 0>>

/Subtype/Link

/StructParent 1

/Rect[60.72 684 86.88 696]

>>

endobj

The values we are interested in are:-

/Subtype value which tells us it is a link (it could be a video, a sound, a form, a popup note or lots of other cool features)

/Rect value (which is the PDF co-ordinates of a rectangle which is the link. If you click on this area, the link will activate

/A value (this is the action value which tells us what to do. In this case we have a URL which we open in a browser

These are easy to extract from the Annot object.

As mentioned earlier, the text is stored separately so we need to decode the page and extract the text from the area of the page. Because of the way PDF works, you cannot be sure what is at any page location unless you parse the whole page.

I have found that there can be a slight mismatch between the Annot rectangle and the exact area the text occupies, so you may want to allow a small margin for error (ie use slightly larger figures).

So, that is how you would extract the references and their text, and here is the documented code example our PDF library.

Would you like some help building something you think would make an interesting blog article?

/**
 * setup JPedal for text extraction
 */PdfDecoder decodePdf = new PdfDecoder(true);
decodePdf.init(true);
PdfDecoder.setFontReplacements(decodePdf);
PdfDecoder.useTextExtraction();

String filename="/Users/markee/Downloads/sampelFile.pdf";

try {
decodePdf.openPdfFile(filename);
} catch (PdfException e) {
e.printStackTrace();
}

/**
 * main code loop here - scan all pages for Annots and text
 */for(int page=1;page<decodepdf.getpagecount()+1;page++){ get="" annots="" on="" thepage="" pdfarrayiterator="" annotlistforpage="decodePdf.getFormRenderer().getAnnotsOnPage(page);" if="" we="" have="" annots,="" decode="" page="" as="" well="" and="" data="" if(annotlistforpage!="null" &&="" annotlistforpage.gettokencount()="">0){ //can have empty lists

try {
decodePdf.decodePage(page);
} catch (Exception e1) {
e1.printStackTrace();
}

//work through list getting values
while(annotListForPage.hasMoreTokens()){

//get ID of annot which has already been decoded and get actual object
String annotKey=annotListForPage.getNextValueAsString(true);

Object[] rawObj=decodePdf.getFormRenderer().getCompData().
                        getRawForm(annotKey);

if(rawObj!=null){

FormObject annotObj=(FormObject)rawObj[0];

int subtype=annotObj.getParameterConstant(PdfDictionary.Subtype);

//the type of annot we are interested in
if(subtype==PdfDictionary.Link){

System.out.println("\nlink object");
float[] coords=annotObj.getFloatArray(PdfDictionary.Rect);
System.out.println("Rect= "+coords[0]+" "+coords[1]
                                         +" "+coords[2]+" "+coords[3]);

//text in A subobject
PdfObject aData=annotObj.getDictionary(PdfDictionary.A);
if(aData!=null &&
                                          aData.getNameAsConstant(PdfDictionary.S)==PdfDictionary.URI){
String text=aData.getTextStreamValue(PdfDictionary.URI);
System.out.println("link text="+text);
}

/**
 * get data at location
 */try {

PdfGroupingAlgorithms currentGrouping = decodePdf.
                                                getGroupingObject();

//we need a small margin of error to ensure we get the text
int x1 = (int)coords[0]-1;
int y1 = (int)coords[1]-1;
int x2 = (int)coords[2]+1;
int y2 = (int)coords[3]+4;

/**The call to extract the text*/String text =currentGrouping.
                                                 extractTextInRectangle(x1,y1,x2,y2,page,false,true);
System.out.println("text at location="+text);

} catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
}

/**close the pdf file*/decodePdf.closePdfFile();

This post is part of our “Understanding the PDF File Format” series. In each article, we discuss a PDF feature, bug, gotcha or tip. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Related Posts:

The following two tabs change content below.

Mark Stephens

System Architect and Lead Developer at IDRSolutions
Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX. He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.
Markee174

About Mark Stephens

Mark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.

He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

3 thoughts on “Extracting text references from a PDF file

  1. krishna

    good morning sir

    sir i want java code for extraction of references from pdf

    • Is it part of the reader classes used internally. If you wish to buy a commercial license, we will add an article documenting it for you.

      • krishna

        plz give me the compelete code sir

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>