Customer question - Extracting overlapping clipped images from a PDF file

We had an interesting inquiry about extracting clipped images from a PDF file this week. We already have an example to do this, but the twist here was that overlapping images should be merged together. So I sat down to figure out how to do this…

We have the images and their location, so we can easily see which overlap. In theory we just see which images overlap and merge them. However, there are 2 challenges to this problem.

Firstly, as we merge the images, the combined size of the images changes. Once you have merged 2 images, you may find that they jointly overlap an image (which separately they did not). So you need to keep repeating the process of looking for overlaps from the start until no more matches occur. Luckily, you can do this comparison with some simple maths to compare locations.

The second problem is that the images need to be merged in the order they appear, so that overlapping images appear as they would in the PDF. This means that you cannot do the merging until the end – otherwise the order may be wrong if you merge matches as you find them.

So we have added some code into our clipped image extraction example to extract merged images. There are also some little tweaks (like an option to only extract merged Images). The code will be in the next monthly update (or there is a sneak preview below) to adapt further to alter exactly how the merge works and with which images.

Here is some sample output

/**
 * ===========================================
 * Java Pdf Extraction Decoding Access Library
 * ===========================================
 *
 * Project Info:  http://www.jpedal.org
 * (C) Copyright 1997-2012, IDRsolutions and Contributors.
 *
 * This file is part of JPedal
 *
 @LICENSE@
  *
  * ---------------
  * ExtractClippedImages.java
  * ---------------
 */
package org.jpedal.examples.images;

import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.ResourceBundle;

import com.sun.imageio.plugins.jpeg.JPEGImageWriter;
import org.jpedal.PdfDecoder;
import org.jpedal.io.JAIHelper;
import org.jpedal.objects.PdfImageData;
import org.jpedal.utils.LogWriter;
import org.jpedal.utils.Messages;
import org.w3c.dom.Element;

import javax.imageio.IIOImage;
import javax.imageio.ImageIO;
import javax.imageio.ImageTypeSpecifier;
import javax.imageio.metadata.IIOMetadata;
import javax.imageio.plugins.jpeg.JPEGImageWriteParam;
import javax.imageio.stream.ImageOutputStream;

/**
 * Sample code providing a workflow which extracts clipped images and places versions
 * scaled to specific heights
 *
 * It is run using the format
 *
 * java -cp libraries_needed org/jpedal/examples/ ExtractClippedImages $inputDir $processedDir $logFile h1 dir1 h2 dir2 ... hn dirn 
 *
 * Values with SPACES but be surrounded by "" as in "This is one value"
 * The values passed are
 *
 * $inputDir - directory containing files
 * $processedDIr - directory to put files in
 * $log - path and name of logfile
 *
 * Any number of h - height required in pixels as an integer for output (-1 means keep current size) dir1 - directory to write out images
 *
 * So to create 3 versions of the image (one at original size, one at 100 and one at 50 pixels high), you would use
 *
 * java -cp libraries_needed org/jpedal/examples/ ExtractScalesImages /export/files/ /export/processedFiles/ /logs/image.log -1 /output/raw/ 100 /output/medium/ 50 /output/thumbnail/
 *
 * Note image quality depends on the raw image in the original.
 *
 * This can be VERY memory intensive 
 *
 */public class ExtractClippedImages
{

    /**flag to show if we print messages*/    public static boolean outputMessages = true;

    /**directory to place files once decoded*/    private static String processed_dir="processed";

    /**used for regression tests by IDR solutions*/    public static boolean testing=false;

    /**rootDir containing files*/    private static String inputDir="";

    /**number of output directories*/    private static int outputCount;

    /**sizes to output at -1 means unchanged*/    private static float[] outputSizes;

    /**target directories for files*/    private static String[] outputDirectories;

    /**the decoder object which decodes the pdf and returns a data object*/    PdfDecoder decode_pdf = null;

    /**correct separator for OS */    static final private String separator = System.getProperty( "file.separator" );

    /**location output files written to*/    private String output_dir="clippedImages";

    /**type of image to save*/    private String imageType = "tiff";

    /**background colour to add to JPEG*/    private Color backgroundColor=Color.WHITE;

    /**test method to extract the images from a directory*/    public ExtractClippedImages( String rootDir,String outDir )
    {
        /**
         * setup messages
         */        try{
            Messages.setBundle(ResourceBundle.getBundle("org.jpedal.international.messages"));
        }catch(Exception e){
            e.printStackTrace();
            System.out.println("Exception loading resource bundle");
        }

        /**
         * setup tests
         */        /**read output values*/        String[] args={"500","high"};
        outputCount=(args.length)/2;

        /**read and create output directories*/        outputSizes=new float[outputCount];
        outputDirectories=new String[outputCount];
        for(int i=0;i 0 )
                        LogWriter.writeLog("page"+ ' ' +page+"contains "+image_count+ " images");
                    else
                        LogWriter.writeLog("No bitmapped images on page "+page);

                    LogWriter.writeLog("Writing out images");

                    //location of images
                    float[] x1=new float[image_count],y1=new float[image_count],w=new float[image_count],h=new float[image_count];

                    //used to merge images
                    float[] rawX1=new float[image_count],rawY1=new float[image_count],rawH=new float[image_count];

                    String[] image_name=new String[image_count];
                    BufferedImage[] image=new BufferedImage[image_count];

                    boolean[] isMerged=new boolean[image_count];

                    //work through and get each image details
                    for( int i = 0;i < image_count;i++ ){

                        image_name[i] =pdf_images.getImageName( i );

                        //we need some duplicates as we update some values on merge but still need originals at end
                        //so easiest just to store
                        x1[i]=pdf_images.getImageXCoord(i);
                        rawX1[i]=pdf_images.getImageXCoord(i);
                        y1[i]=pdf_images.getImageYCoord(i);
                        rawY1[i]=pdf_images.getImageYCoord(i);
                        w[i]=pdf_images.getImageWidth(i);
                        h[i]=pdf_images.getImageHeight(i);
                        rawH[i]=pdf_images.getImageHeight(i);

                        image[i] =decode_pdf.getObjectStore().loadStoredImage(  "CLIP_"+image_name[i] );

                    }

                    //merge overlapping images
                    boolean mergeImages=true;
                    if(mergeImages){
                        boolean imagesMerged=true;
                        boolean[] isUsed=new boolean[image_count];
                        ArrayList[] imagesUsed=new ArrayList[image_count];

                        //get list of images to merge for each block
                        //we repeat from start each time as areas grow
                        while(imagesMerged){

                            //if no overlaps found we will exit, otherwise repeat from start as sizes changed
                            imagesMerged=false;

                            //for each image
                            for( int i = 0;i < image_count;i++ ){

                                //compare against all others
                                for( int i2 = 0;i2 < image_count;i2++ ){

                                    if(i==i2)
                                        continue;

                                    /**
                                     * look for overlap
                                     */                                    if(!isUsed[i] && !isUsed[i2] && image[i]!=null && image[i2]!=null && x1[i]>=x1[i2]&& x1[i]<=(x1[i2]+w[i2]) && y1[i]>=y1[i2]&& y1[i]<=(y1[i2]+h[i2])){

                                        //work out the new combined size
                                        float newX=x1[i2];
                                        float newY=y1[i2];
                                        float newX2=x1[i]+w[i];
                                        float altNewX2=x1[i2]+w[i2];
                                        if(newX2(image_count);

                                            imagesUsed[i2].add(i2);

                                        }

                                        imagesUsed[i2].add(i);

                                        isUsed[i]=true;

                                        //merge any items attached to this image
                                        if(imagesUsed[i]!=null){
                                            for(Object currentImage:imagesUsed[i].toArray()){
                                                imagesUsed[i2].add(currentImage);
                                            }

                                            isMerged[i]=false;
                                            imagesUsed[i]=null;

                                        }

                                        //restart
                                        imagesMerged=true;
                                        i=image_count;
                                        i2=image_count;
                                    }
                                }
                            }
                        }

                        //now put together in correct order
                        for(int i=0;i\n", image_name[i], image[i], i);
                            //}
                        }
                    }

                    //flush images in case we do more than 1 page so only contains
                    //images from current page
                    decode_pdf.flushObjectValues(true);

                }
            }catch( Exception e ){
                decode_pdf.closePdfFile();
                LogWriter.writeLog( "Exception " + e.getMessage() );

                e.printStackTrace();
            }
        }

        /**close the pdf file*/        decode_pdf.closePdfFile();

    }

    private void generateVersions(String file_name, int page, String s, String image_name, BufferedImage bufferedImage, int i) {

        for(int versions=0;versions0){

                    scaling=outputSizes[versions]/newHeight;

                    if(scaling>1){
                        scaling=1;
                    }else{

                        Image scaledImage= image_to_save.getScaledInstance(-1,(int)outputSizes[versions],BufferedImage.SCALE_SMOOTH);

                        image_to_save = new BufferedImage(scaledImage.getWidth(null),scaledImage.getHeight(null) , BufferedImage.TYPE_INT_ARGB);

                        Graphics2D g2 =image_to_save.createGraphics();

                        g2.drawImage(scaledImage, 0, 0,null);
                        //ImageIO.write((RenderedImage) scaledImage,"PNG",new File(outputName));


                    }
                }

                String tiffFlag=System.getProperty("org.jpedal.compress_tiff");
                String jpgFlag=System.getProperty("org.jpedal.jpeg_dpi");
                boolean compressTiffs = tiffFlag!=null;

                //if(compressTiffs)
                JAIHelper.confirmJAIOnClasspath();

                //no transparency on JPEG so give background and draw on
                if(imageType.startsWith("jp")){

                    int iw=image_to_save.getWidth();
                    int ih=image_to_save.getHeight();
                    BufferedImage background=new BufferedImage(iw,ih, BufferedImage.TYPE_INT_RGB);

                    Graphics2D g2=(Graphics2D)background.getGraphics();
                    g2.setPaint(backgroundColor);
                    g2.fillRect(0,0,iw,ih);

                    g2.drawImage(image_to_save,0,0,null);
                    image_to_save= background;

                }

                if(testing){ //used in regression tests
                    decode_pdf.getObjectStore().saveStoredImage( outputName, image_to_save, true, false, imageType );
                }else if(JAIHelper.isJAIused() && imageType.startsWith("tif")){

                    LogWriter.writeLog("Saving image with JAI " + outputName + '.' + imageType);

                    com.sun.media.jai.codec.TIFFEncodeParam params = null;

                    if(compressTiffs){
                        params = new com.sun.media.jai.codec.TIFFEncodeParam();
                        params.setCompression(com.sun.media.jai.codec.TIFFEncodeParam.COMPRESSION_DEFLATE);
                    }

                    FileOutputStream os = new FileOutputStream(outputName+".tif");

                    javax.media.jai.JAI.create("encode", image_to_save, os, "TIFF", params);

                    os.flush();
                    os.close();

                } else if (jpgFlag != null && imageType.startsWith("jp") && JAIHelper.isJAIused()) {

                    saveAsJPEG(jpgFlag, image_to_save, 1, new FileOutputStream(output_dir + page + image_name + '.' + imageType));

                } else{

                    //save image
                    LogWriter.writeLog("Saving image "+outputName+ '.'+imageType);
                    BufferedOutputStream bos= new BufferedOutputStream(new FileOutputStream(new File(outputName+ '.'+imageType)));
                    ImageIO.write(image_to_save, imageType, bos);
                    bos.flush();
                    bos.close();
                    //decode_pdf.getObjectStore().saveStoredImage( outputName, image_to_save, true, false, imageType );
                }
                //save an xml file with details
                /**
                 * output the data
                 */                //LogWriter.writeLog( "Writing out "+(outputName + ".xml"));
                OutputStreamWriter output_stream = new OutputStreamWriter(new FileOutputStream(outputName + ".xml"),"UTF-8");

                output_stream.write(
                        "\n");
                output_stream.write(
                        "\n");
                output_stream.write("\n\n\n");
                output_stream.write(s);
                output_stream.write(""+file_name+"\n");
                output_stream.write(""+newHeight+"\n");
                output_stream.write(""+image_to_save.getHeight()+"\n");
                output_stream.write(""+scaling+"\n");
                output_stream.write("\n");
                output_stream.close();
            }catch( Exception ee ){
                LogWriter.writeLog( "Exception " + ee + " in extracting images" );
            }
        }
    }

    /**
     * @return Returns the output_dir.
     */    public String getOutputDir() {
        return output_dir;
    }

    /**
     * main routine which checks for any files passed and runs the demo
     */    public static void main( String[] args )
    {

        long start=System.currentTimeMillis();

        Messages.setBundle(ResourceBundle.getBundle("org.jpedal.international.messages"));

        if(outputMessages)
            System.out.println( "Simple demo to extract images from a page at various heights" );

        /**exit and report if wrong number of values*/        if((args.length >= 5) && ((args.length % 2) == 1)) {

            LogWriter.writeLog("Values read");
            LogWriter.writeLog("inputDir="+inputDir);
            LogWriter.writeLog("processedDir="+processed_dir);
            LogWriter.writeLog("logFile="+LogWriter.log_name);
            LogWriter.writeLog("Directory and height pair values");

            /**count output values*/            outputCount = (args.length-3) / 2;

            for(int i=0; i=0 && JPEGcompression<=1f){

            //old compression
            //jpegEncodeParam.setQuality(JPEGcompression,false);

            // new Compression
            JPEGImageWriteParam jpegParams = (JPEGImageWriteParam) imageWriter.getDefaultWriteParam();
            jpegParams.setCompressionMode(JPEGImageWriteParam.MODE_EXPLICIT);
            jpegParams.setCompressionQuality(JPEGcompression);

        }

        //old write and clean
        //jpegEncoder.encode(image_to_save, jpegEncodeParam);

        //new Write and clean up
        imageWriter.write(imageMetaData, new IIOImage(image_to_save, null, null), null);
        ios.close();
        imageWriter.dispose();

    }
}

Our software libraries allow you to

Convert PDF to HTML in Java

Convert PDF Forms to HTML5 in Java

Convert PDF Documents to an image in Java

Work with PDF Documents in Java

Read and Write AVIF, HEIC, WEBP and other image formats

Are you a Java Developer working with PDF files?

Customer question – Extracting overlapping clipped images from a PDF file

Our software libraries allow you to

What is the AVIF image format?

Comparing WebP compression algorithms

How to read PDF files in Java?

Viewing Products

SDK Products

Are you a Java Developer working with PDF files?

Customer question – Extracting overlapping clipped images from a PDF file

Our software libraries allow you to

What is the AVIF image format?

Comparing WebP compression algorithms

How to read PDF files in Java?