So what is punctuation?
This may seem like a simple question yet I find myself asking it more and more often whilst working on our pdf search and text extraction. So once again today I found myself asking this same question. What is punctuation?
According to dictionary.com punctution is,
“The practice or system of using certain conventional marks or characters in writing or printing in orderto separate elements and make the meaning clear, as in ending a sentence or separating clauses.”
Unfortunately this is not all to useful as the english language has many different forms of punctuation and often uses the same symbols in a multitude of ways. We can even see punctuation used in ways other than for sentence structure, for example as emoticons.
For instance the character ‘.’ could be a full stop, it could be a decimal place or it could even be apart of ‘…’
In a pdf the character ‘.’ could also be used in a multitude of other ways to help format a page and improve the flow of the text.
This is just one trivial example from many but I keep finding examples when searching for whole words only or when extracting text as a word list the results are being thrown off by the use of punctuation.
When searching or extracting text, what of the ‘-‘ character.
Is the term “mutli-tasking” one word or two?
If it’s one word should we allow it to contain the ‘-‘?
How do we check if this is a valid use within a word?
What of one word split across two line with ‘-‘ at the end of the first line?
Is this one word or two?
What of the ‘-‘?
I’m not writing this to provide a concrete solution, neither am I looking to be provided with one as I believe there not to be one due to the way punctuation can be used in text documents.
These questions arise often as everyone producing pdfs produces them in different styles. These questions are just a few of the things that make my job interesting.
IDRsolutions develop a Java PDF library, a PDF forms to HTML5 converter, a PDF to HTML5 or SVG converter and a Java Image Library that doubles as an ImageIO replacement. On the blog our team post about anything interesting they learn about.