How to translate PDF files in Java (Tutorial)

Table of Contents show

Today I will demonstrate a worked example to show how you can create a PDF translator using our PDF toolkit JPedal and Translator. This will convert any PDF Document from one language to another (in this case English to Chinese).

You can get a copy of JPedal here.

Extracting text

First, we will need to extract the text from the document so that it can be passed to a translation API.

JPedal has lots of different methods to extract text based on what you need. I am going to use the paragraph estimation feature so we can translate one paragraph at a time.

We can do this by decoding the file with PdfDecoder and calling the getParagraphAreasAs2dArray() method.

final PdfDecoderServer pdfDecoderServer = new PdfDecoderServer();

pdfDecoderServer.openPdfFile("inputFile.pdf");

pdfDecoderServer.decodePage(1);

final TextLines textLines = pdfDecoderServer.getTextLines();

final int[][] paragraphs = textLines.getParagraphAreasAs2dArray(i, 5);

Next, we will need to convert our paragraph rectangles from X,Y,W,H format to X0,Y0,X1,Y1 so that we can pass them to the grouping algorithm which extracts the text.

private static void convertRectangles(final int[][] input) {

    for (int i = 0; i < input.length; i++) {

        final int x = input[i][0];

        final int y = input[i][1];

        input[i][2] += x;

        input[i][3] += y;

    }

}

Now, for each paragraph, we can extract the words.

final PdfGroupingAlgorithms groupingObject = pdfDecoderServer.getGroupingObject();

final List words = groupingObject.extractTextAsWordlist(x0, y0, x1, y1, 1, true, "&:=()!;.,\\/\"\"''");

final StringBuilder paragraphString = new StringBuilder();

for (int j = 0; j < words.size(); j += 5) {

    paragraphString.append(words.get(j)).append(" ");

}

final String pureText = Strip.convertToText(paragraphString.toString(), true);

Learn more about extracting text.

Translating the text

Second, we need to connect to a translation API to get the translated text.

I have chosen to use Translator because it is easy to use and works well, but you could use any library.

final Translator translator = new Translator();

final Translation translation = translator.translateBlocking(pureText, Language.CHINESE_SIMPLIFIED, Language.ENGLISH);

final String translatedText = translation.getTranslatedText();

Annotating text

Finally, we need to insert the translated text as an annotation which overlays each paragraph on the page.

We can use JPedal’s PdfManipulator class to efficiently perform bulk edits to a PDF file.

final float[] rect = toFloatArray(paragraph);

final float[] red = new float[] {1.0f, 0.0f, 0.0f};

final int flags = Annotation.getFlagsValue(false, false, true, false, true, false, true, true, false, true);

pdfManipulator.addAnnotation(1, new FreeText(rect, flags, translatedText, red, 1.0f, 1.0f, BaseFont.Helvetica, 10, Quadding.LEFT_JUSTIFIED));

Once all the annotations are added, and we are outside of the loop, we can then apply the queued edits and write them to the file.

pdfManipulator.apply();

pdfManipulator.writeDocument(new File("outputFile.pdf"));

Learn more about manipulating PDF documents.

Results

Before

After

You can find the complete source code for this on our GitHub profile.

We can help you better understand the PDF format as developers who have been working with the format for more than 2 decades!

The JPedal PDF library allows you to solve these problems in Java

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

final PdfManipulator pdf = new PdfManipulator();
pdf.loadDocument(new File("inputFile.pdf"));
pdf.addPage(1, PaperSize.A4_LANDSCAPE);
pdf.addText(1, "Hello World", 10, 10, BaseFont.HelveticaBold, 12, 1, 0.3f, 0.2f);
pdf.addImage(1, new BufferedImage(), new float[] {0, 0, 100, 100});
pdf.rotatePage(1, 90);
pdf.apply();
pdf.writeDocument(new File("outputFile.pdf"));

Viewer viewer = new Viewer();
viewer.setupViewer();
viewer.executeCommand(ViewerCommands.OPENFILE, "pdfFile.pdf");

//Convenience static method (see class for additional options)
ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);

PdfMerge.mergeFiles(new File("inputFile1.pdf"), new File("inputFile2.pdf"), new File("outputFile.pdf"));

PdfManipulator.splitInHalf(new File("inputFile.pdf"), new File("outputFolder"), pageToSplitAt);

PrintPdfPages print = new PrintPdfPages("C:/pdfs/mypdf.pdf");

if (print.openPDFFile()) {
    print.printAllPages("Printer Name");
}

//Convenience static method (see class for additional options)
ExtractClippedImages.writeAllClippedImagesToDir("inputFileOrDirectory", "outputDir", "outputImageFormat", new String[] {"imageHeightAsFloat", "subDirectoryForHeight"});

//Convenience static method (see class for additional options)
ArrayList resultsForPages = FindTextInRectangle.findTextOnAllPages("/path/to/file.pdf", "textToFind");

java -jar jpedal.jar --inspect "inputFile.pdf"

PdfSigner.signPdf(
        "inputFile.pdf",
        "outputFile.pdf",
        "keystorePassword",
        "keystoreFile.p12",
        "signerName",
        "signerLocation",
        "signingReason",
        ACCESS_PERMISSION.P1
);

How to translate PDF files in Java (Tutorial)

Extracting text

Translating the text

Annotating text

Results

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

Manipulate PDF files in the JPedal Viewer

How to remove annotations from PDF files in Java…

Mastering Server-Side PDF Processing in Java

How to translate PDF files in Java (Tutorial)

Extracting text

Translating the text

Annotating text

Results

Related posts:

The JPedal PDF library allows you to solve these problems in Java

What is JPedal?

Why use JPedal?

What licenses are available?

How to use JPedal?

Manipulate PDF files in the JPedal Viewer

How to remove annotations from PDF files in Java…

Mastering Server-Side PDF Processing in Java