How are text links defined in a PDF file?

Table of Contents show

When you are viewing a PDF file, you may well notice that (like a web page) there are blue clickable links. These are defined in 2 ways:-

Viewer generated links

Several PDF Viewers will spot that the text on the page is a link. For example is starts http:// or www. and add a link. This is something each Viewer does on an adhoc basis and this data is not in the PDF file. It can be a bit hit and miss, especially with multiline values.

Actual links

The PDF File can contain actual links. These are not stored in the text but are Annotations. They are stored as separate objects in a PDF file and drawn by the PDF viewer.

Every PDF page has a possible list of annotation objects on that page (if there are no annotations are present there will be no value). Annotations objects allow PDF files to contain animations and interactions. There are stored separately from the page text and drawn on by the PDF renderer.

There are several types of Annotation object. The one we are interested in is the Annot values with /Subtype of link. Here is what the raw data might look like.

23 0 obj<<

/F 4

/A<</URI(http://www.jpedal.org/link.html)/Type/Action/S/URI>>

/BS<</W 0>>

/Subtype/Link

/StructParent 1

/Rect[60.72 684 86.88 696]

endobj

The values we are interested in are:-

/Subtype value which tells us it is a link (it could be a video, a sound, a form, a popup note or lots of other cool features)

/Rect value (which is the PDF coordinates of a rectangle which is the link). If you click on this area, the link will activate

/A value (this is the action value which tells us what to do. In this case we have a URL which we open in a browser

These are easy to extract from the Annot object. This is a feature offered by our JPedal PDF software.

As mentioned earlier, the text is stored separately so we need to decode the page and extract the text from the area of the page. Because of the way PDF works, you cannot be sure what is at any page location unless you parse the whole page.

Our software libraries allow you to

Convert PDF files to HTML

Use PDF Forms in a web browser

Convert PDF Documents to an image

Work with PDF Documents in Java

Read and write HEIC and other Image formats in Java

3 Replies to “How are text links defined in a PDF file?”

krishna says:
August 2, 2012 at 5:57 pm
good morning sir
sir i want java code for extraction of references from pdf
1. Mark Stephens says:
  August 2, 2012 at 8:48 pm
  Is it part of the reader classes used internally. If you wish to buy a commercial license, we will add an article documenting it for you.
  1. krishna says:
    August 3, 2012 at 4:09 pm
    plz give me the compelete code sir

Comments are closed.

How are text links defined in a PDF file?

Viewer generated links

Actual links

Our software libraries allow you to

How to Reorder Pages in a PDF Using Java…

How to remove blank pages from a PDF in…

Convert PDF to HTML5: Preserving Layout

3 Replies to “How are text links defined in a PDF file?”

How are text links defined in a PDF file?

Viewer generated links

Actual links

Related posts:

Our software libraries allow you to

How to Reorder Pages in a PDF Using Java…

How to remove blank pages from a PDF in…

Convert PDF to HTML5: Preserving Layout

3 Replies to “How are text links defined in a PDF file?”