Ringing Rocks County Park, PA

A (really) quick guide to extracting and converting pdfs

Getting scans ready for zooniverse is easy with cpdf and sips!

I’ve previously posted on our zooniverse project "castaway" that as part of an IMLS funded initiative to integrate three major orphaned fish collections and their data into our collection.

Part of this project requires taking pdf scans of datasheets and splitting out specific pages that are then converted to png format for upload.

With thousands of records that contain multiple pages each, this can be an overwhelming task. I wanted to offer a quick guide for how we’ve been automating this procedure using available tools.

Extracting a pdf page from the command line


The first challenge is splitting the pdf pages. Fortunately, there is a free executable you can deploy for this task. Meet cpdf.

cpdf is short for ‘Coherent PDF Command Line Tools’ and offers a fantastic set of free tools available through github. For our purposes, splitting pdfs is done using a single line of code.

Assume we have a pdf called “DanMoore-0004.pdf”, which corresponds to multiple pages (in this case 2) of one of our Dan Moore vessel stations.

Example of a two page pdf that needs to have page 1 extracted.

Place this file into the folder where you have cpdf (or add cpdf to your bash). Open terminal and navigate to this folder. Now enter the following command:

./cpdf DanMoore-0004.pdf 1  -o DanMoore-0004_page1.pdf

That’s it! 1 indicates the page you want and the -o is the name of the outfile. You can easily turn this into a script to go through thousands of pdfs in seconds.

Using the above line, a single page from a pdf is readily extracted.

Converting from pdf to png in terminal


To convert to png is also super easy on a mac and requires a single line of code. Move all your single page pdfs into a new folder and navigate there in terminal. Now enter:

for i in *; do sips -s format png $i --out $i.png; done	


This will take all the files in your folder using the wildcard (*) and use the mac command line tool sips to convert them to a png. This is incredibly useful when you have hundreds or thousands of pdfs.

From here, simply upload to zooniverse. I posted this for our lab and to link anyone who might find this useful. I’m sure there are other ways to accomplish the same thing, but regardless, this certainly beats spending days doing this manually!

Dialogue & Discussion