Today I will demonstrate a worked example to show how you can create a PDF translator using our PDF toolkit JPedal and Translator. This will convert any PDF Document from one language to another (in this case English to Chinese).
You can get a copy of JPedal here.
Extracting text
First, we will need to extract the text from the document so that it can be passed to a translation API.
JPedal has lots of different methods to extract text based on what you need. I am going to use the paragraph estimation feature so we can translate one paragraph at a time.
We can do this by decoding the file with PdfDecoder and calling the getParagraphAreasAs2dArray() method.
final PdfDecoderServer pdfDecoderServer = new PdfDecoderServer();
pdfDecoderServer.openPdfFile("inputFile.pdf");
pdfDecoderServer.decodePage(1);
final TextLines textLines = pdfDecoderServer.getTextLines();
final int[][] paragraphs = textLines.getParagraphAreasAs2dArray(i, 5);
Next, we will need to convert our paragraph rectangles from X,Y,W,H
format to X0,Y0,X1,Y1
so that we can pass them to the grouping algorithm which extracts the text.
private static void convertRectangles(final int[][] input) {
for (int i = 0; i < input.length; i++) {
final int x = input[i][0];
final int y = input[i][1];
input[i][2] += x;
input[i][3] += y;
}
}
Now, for each paragraph, we can extract the words.
final PdfGroupingAlgorithms groupingObject = pdfDecoderServer.getGroupingObject();
final List words = groupingObject.extractTextAsWordlist(x0, y0, x1, y1, 1, true, "&:=()!;.,\\/\"\""");
final StringBuilder paragraphString = new StringBuilder();
for (int j = 0; j < words.size(); j += 5) {
paragraphString.append(words.get(j)).append(" ");
}
final String pureText = Strip.convertToText(paragraphString.toString(), true);
Learn more about extracting text.
Translating the text
Second, we need to connect to a translation API to get the translated text.
I have chosen to use Translator because it is easy to use and works well, but you could use any library.
final Translator translator = new Translator();
final Translation translation = translator.translateBlocking(pureText, Language.CHINESE_SIMPLIFIED, Language.ENGLISH);
final String translatedText = translation.getTranslatedText();
Annotating text
Finally, we need to insert the translated text as an annotation which overlays each paragraph on the page.
We can use JPedal’s PdfManipulator class to efficiently perform bulk edits to a PDF file.
final float[] rect = toFloatArray(paragraph);
final float[] red = new float[] {1.0f, 0.0f, 0.0f};
final int flags = Annotation.getFlagsValue(false, false, true, false, true, false, true, true, false, true);
pdfManipulator.addAnnotation(1, new FreeText(rect, flags, translatedText, red, 1.0f, 1.0f, BaseFont.Helvetica, 10, Quadding.LEFT_JUSTIFIED));
Once all the annotations are added, and we are outside of the loop, we can then apply the queued edits and write them to the file.
pdfManipulator.apply();
pdfManipulator.writeDocument(new File("outputFile.pdf"));
Learn more about manipulating PDF documents.
Results
You can find the complete source code for this on our GitHub profile.
We can help you better understand the PDF format as developers who have been working with the format for more than 2 decades!