This article contains several useful tricks for manipulating PDF files.
The focus of this article is on Open Source and Free software, that are available for UNIX-like operating systems. These tools are made for use on the command-line of a shell.
Adding password restrictions to a PDF file
PDF files can have two passwords;
- user password (Must be supplied to read a document.)
- owner password (Can restrict printing, editing, copying. Not necessary to read the document.)
Adding restrictions is done by “encrypting” the PDF with a owner password. Since this password is easily removed, you don’t need to remember this password. So I tend to generate one automatically.
The following command uses the SHA-256 checksum of the original file as the owner password.
> qpdf --encrypt '' `sha256 -q unrestricted.pdf` 128 \ --extract=n --modify=none --use-aes=y --cleartext-metadata -- \ unrestricted.pdf restricted.pdf
As given, it prevents copying (--extract=n) and modification (--modify=none), but leaves the document metadata unencrypted. By default, printing is allowed. The user password is an empty string, leaving read access open.
Running both through pdfinfo (from the poppler-utils package) shows the file restrictions. First the unrestricted file.
Contrast that with the output for the restricted file (trimmed for brevity).
> pdfinfo restricted.pdf ... Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:AES) ...
Note that this only protects your documents from laypeople, since qpdf can also remove such restrictions, as shown below.
If you need stronger access control, you should set the user password or use other kinds of encryption that would prevent people from reading the file without knowing the password.
Removing restrictions from a PDF file
If a document only has an owner password, you can use qpdf to remove it, without having to provide the owner password!
Note that this only works with one of the standard encryption handlers (RC4 and AES). If a document was encrypted with a custom encryption handler this will not work.
> qpdf -decrypt restricted.pdf unrestricted2.pdf > pdfinfo unrestricted2.pdf ... Encrypted: no ...
So an owner password is not a protection against knowledgeable people.
Changing the metadata in a PDF file
The exiftool program can be used to change the Info dictionary and XMP tags in a PDF file.
For example, I’ve seen a e-book application on an android device use the “title” from the Info dictionary to label PDFs in the user interface. However in some PDF files the title is either empty or bears no resemblance to the actual contents. In cases like this you really want to update the metadata.
> exiftool -Title='Alexit hardener 405-25' \ -overwrite_original ALEXIT-Hardener_405-25_DE.pdf 1 image files updated
First, we create a file called pdfmarks. This contains the new document information.
[ /Title (New title) /Author (Author name) /Subject (New Subject) /Keywords (comma, separated, keywords) /ModDate (D:20170909130029) /CreationDate (D:20061204092842) /Creator (Name of the application used to create the original document) /Producer (GPL Ghostscript 9.16) /DOCINFO pdfmark
The dates can be formatted using the date command. To get the current date in PDF format, use the following command.
In Python you can generate these date as follows.
In : from datetime import datetime as dt In : dt.strftime(dt(2014, 8, 21, 14, 34, 0), '/CreationDate (D:%Y%m%d%H%M%S)') Out: '/CreationDate (D:20140821143400)' In : dt.strftime(dt.now(), '/ModDate (D:%Y%m%d%H%M%S)') Out: '/ModDate (D:20170909130654)'
The ghostscript program is used to change the PDF file.
> gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=output.pdf \ input.pdf pdfmarks
This modifies both the Info dictionary and the XMP tags.
Overlaying text and images in a PDF file
This is such a substantial topic that it is located in a separate article.
Converting PDF to bitmap formats
Sometimes a PDF needs to be converted to bitmap format, e.g. for display on a webpage. (This is assuming that generating the same image in SVG format is not possible.)
The programs are from the ImageMagick suite of tools.
> convert -density 1200 -units PixelsPerInch \ input.pdf \ -scale 25% \ output.png
The first option (which needs to come before the name of the input file) tells it to convert the image to a bitmap at 1200 pixels per inch (“PPI”). The standard resolution used by convert is only 72 PPI.
After the input file, -scale 25% is used to scale the image back. This reduces the effective resolution to 300 PPI, but averages the pixels giving a less pixelated look.
> convert -density 1200 -units PixelsPerInch \ input.pdf \ -background white -flatten\ -scale 25% \ output.jpg
Here the -background white and -flatten options are needed to prevent a black background on some PDF files.
Creating a PDF from scanned pages
In the following it is assumed that the pages are scanned on A4 format and have their resolution embedded in the metadata.
> convert page*.jpg -adjoin intermediate.tiff > tiff2pdf -j -o output.pdf intermediate.tiff > rm -f intermediate.tiff
If the images do not contain resolution information, you have to specify it. In the example below, the image resolution in 150 PPI.
> convert -density 150 -units PixelsPerInch page*.jpg \ -adjoin intermediate.tiff > tiff2pdf -r 150 -j -o output.pdf intermediate.tiff > rm -f intermediate.tiff
Decompressing a PDF file
The stream objects in PDF files are often compressed to save space. It is easier to study PDF files when the streams are not compressed. Here is how to decompress them (thanks to the Hand-coded PDF tutorial).
> ps2pdf -dCompressPages=false input.pdf output.pdf
Retrieving the info dict from a PDF file in Python
This code uses the pdfinfo program (from poppler-utils) internally for convenience.
Cropping a PDF file
For this, I use pdfcrop (part of the TeXLive distribution) in combination with a viewer like mupdf. On the console I run the following command to do automatic cropping:
pdfcrop input.pdf output.pdf
Manual cropping is used to cut off parts of the image around the edges, like this:
pdfcrop --bbox "55 200 540 800" input.pdf output.pdf
The numbers are the size of the bounding box in “<left> <bottom> <right> <top>” order. These dimensions are given in PostScript points and the origin of the page coordinate system is at the left bottom of the page.