PDF tricks
This article contains several useful tricks for manipulating PDF files.
The focus of this article is on Free Software, that are available for UNIX-like operating systems. These tools are made for use on the command-line of a shell.
Adding password restrictions to a PDF file
PDF files can have two passwords;
- user password (Must be supplied to read a document.)
- owner password (Can restrict printing, editing, copying. Not necessary to read the document.)
You can use qpdf (see also qpdf on github) to add restrictions.
Adding restrictions is done by “encrypting” the PDF with a owner password. Since this password is easily removed, you don’t need to remember this password. So I tend to generate one automatically.
The following command uses the SHA-256 checksum of the original file as the owner password.
> qpdf --encrypt '' `sha256 -q unrestricted.pdf` 128 \
--extract=n --modify=none --use-aes=y --cleartext-metadata -- \
unrestricted.pdf restricted.pdf
As given, it prevents copying (--extract=n
) and modification
(--modify=none
), but leaves the document metadata unencrypted. By default,
printing is allowed. The user password is an empty string, leaving read access open.
Running both through pdfinfo
(from the poppler-utils package) shows the
file restrictions. First the unrestricted file.
> pdfinfo unrestricted.pdf
Subject: ...
Keywords: ...
Author: ...
Creator: ...
Producer: ...
CreationDate: Tue Mar 1 21:17:23 2016 CET
ModDate: Tue Mar 1 21:17:23 2016 CET
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 2
Encrypted: no
Page size: 841.89 x 595.276 pts (A4)
Page rot: 0
File size: 152342 bytes
Optimized: no
PDF version: 1.7
Contrast that with the output for the restricted file (trimmed for brevity).
> pdfinfo restricted.pdf
...
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:AES)
...
Note that this only protects your documents from laypeople, since qpdf can also remove such restrictions, as shown below.
If you need stronger access control, you should set the user password or use other kinds of encryption that would prevent people from reading the file without knowing the password.
Removing restrictions from a PDF file
If a document only has an owner password, you can use qpdf to remove it, without having to provide the owner password!
Note that this only works with one of the standard encryption handlers (RC4 and AES). If a document was encrypted with a custom encryption handler this will not work.
> qpdf -decrypt restricted.pdf unrestricted2.pdf
> pdfinfo unrestricted2.pdf
...
Encrypted: no
...
So an owner password is not a protection against knowledgeable people.
Changing the metadata in a PDF file
Using exiftool
The exiftool program can be used to change the Info dictionary and XMP tags in a PDF file.
For example, I’ve seen a e-book application on an android device use the “title” from the Info dictionary to label PDFs in the user interface. However in some PDF files the title is either empty or bears no resemblance to the actual contents. In cases like this you really want to update the metadata.
> exiftool -Title='Alexit hardener 405-25' \
-overwrite_original ALEXIT-Hardener_405-25_DE.pdf
1 image files updated
Using ghostscript
This is done using the pdfmarks functionality (thanks to Ghostscript PDF reference & tips)
First, we create a file called pdfmarks
. This contains the new document information.
[ /Title (New title)
/Author (Author name)
/Subject (New Subject)
/Keywords (comma, separated, keywords)
/ModDate (D:20170909130029)
/CreationDate (D:20061204092842)
/Creator (Name of the application used to create the original document)
/Producer (GPL Ghostscript 9.16)
/DOCINFO pdfmark
The dates can be formatted using the date
command. To get the current date
in PDF format, use the following command.
date +D:%Y%m%d%H%M%S
In Python you can generate these date as follows.
In [1]: from datetime import datetime as dt
In [2]: dt.strftime(dt(2014, 8, 21, 14, 34, 0), '/CreationDate (D:%Y%m%d%H%M%S)')
Out[2]: '/CreationDate (D:20140821143400)'
In [3]: dt.strftime(dt.now(), '/ModDate (D:%Y%m%d%H%M%S)')
Out[3]: '/ModDate (D:20170909130654)'
The ghostscript program is used to change the PDF file.
> gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=output.pdf \
input.pdf pdfmarks
This modifies both the Info dictionary and the XMP tags.
This has been automated by the pdfsetinfo script.
Overlaying text and images in a PDF file
This is such a substantial topic that it is located in a separate article.
Converting PDF to bitmap formats
Sometimes a PDF needs to be converted to bitmap format, e.g. for display on a webpage. (This is assuming that generating the same image in SVG format is not possible.)
The programs are from the ImageMagick suite of tools.
to PNG
> convert -density 1200 -units PixelsPerInch \
input.pdf \
-scale 25% \
output.png
The first option (which needs to come before the name of the input file) tells it to convert the image to a bitmap at 1200 pixels per inch (“PPI”). The standard resolution used by convert is only 72 PPI.
After the input file, -scale 25%
is used to scale the image back. This
reduces the effective resolution to 300 PPI, but averages the pixels giving
a less pixelated look.
to JPEG
> convert -density 1200 -units PixelsPerInch \
input.pdf \
-background white -flatten\
-scale 25% \
output.jpg
Here the -background white
and -flatten
options are needed to prevent
a black background on some PDF files.
Creating a PDF from scanned pages
In the following it is assumed that the pages are scanned on A4 format and have their resolution embedded in the metadata.
> convert page*.jpg -adjoin intermediate.tiff
> tiff2pdf -j -o output.pdf intermediate.tiff
> rm -f intermediate.tiff
If the images do not contain resolution information, you have to specify it. In the example below, the image resolution in 150 PPI.
> convert -density 150 -units PixelsPerInch page*.jpg \
-adjoin intermediate.tiff
> tiff2pdf -r 150 -j -o output.pdf intermediate.tiff
> rm -f intermediate.tiff
The convert
program is from the ImageMagick suite of tools, while
tiff2pdf
is part of libtiff.
Decompressing a PDF file
The stream objects in PDF files are often compressed to save space. It is easier to study PDF files when the streams are not compressed. Here is how to decompress them (thanks to the Hand-coded PDF tutorial).
> ps2pdf -dCompressPages=false input.pdf output.pdf
Retrieving the info dict from a PDF file in Python
This code uses the pdfinfo
program (from poppler-utils) internally for convenience.
from datetime import datetime as dt
import subprocess as sp
import re
def pdfinfo(path):
"""Retrieves the Info dictionary from a PDF file.
The information is converted to a Python dictionary.
The values are converted to a suitable format.
Arguments:
path: String that indicates the location of the PDF file.
Returns:
A Python dictionary containing the file's info.
"""
# Extract the info from a PDF file.
text = sp.check_output(['pdfinfo', path]).decode('utf-8')
# Convert info to a doctionary.
info = dict(re.findall('(.*)?:\s+(.*)?\s+', text, re.MULTILINE))
# Convert dates to datetime objects.
keys = info.keys()
for key in keys & ('CreationDate', 'ModDate'):
info[key] = dt.strptime(info[key], '%c %Z')
# Convert suitable values to integers
for key in keys & ('File size', 'Pages', 'Page rot'):
info[key] = int(info[key].split()[0])
# Convert quitable values to boolean
for key in keys & ('Encrypted', 'JavaScript', 'Optimized', 'Suspects',
'Tagged'):
info[key] = info[key].split()[0] in ("yes", "true", "t", "1")
return info
Cropping a PDF file
For example, let’s say we want to extract from page 2 of a document the material with lower left coordinates (49,190) and upper right coordinates (556,841).
Using pdfcrop
First we extract the required page with ghostscript.
gs -q -DNOPAUSE -DBATCH -sDEVICE=pdfwrite -dFirstPage=2 -dLastPage=2 \
-sOutputFile=page2.pdf input.pdf
here I use pdfcrop
(part of the TeXLive distribution) in combination
with a viewer like mupdf
. On the console I run the following command to
do automatic cropping:
pdfcrop page2.pdf output.pdf
Manual cropping is used to cut off parts of the image around the edges, like this:
pdfcrop --bbox "49 190 556 810" input.pdf output.pdf
The numbers are the size of the bounding box in “<left> <bottom> <right> <top>” order. These dimensions are given in PostScript points and the origin of the page coordinate system is at the left bottom of the page.
The resulting file sizes are:
> du input.pdf page2.pdf output.pdf 108 input.pdf 60 page2.pdf 56 output.pdf
Finally, clean up intermediate files:
rm page2.pdf
Using ghostscript
Extracting the required page with ghostscript and converting it to EPS.
gs -q -DNOPAUSE -DBATCH -sDEVICE=eps2write -dFirstPage=2 -dLastPage=2 \
-sOutputFile=page2.eps input.pdf
Next, we use sed
to update the BoundingBox
and (important) set the
CropBox
.
sed -e 's/^%%BoundingBox.*$/%%BoundingBox: 49 190 556 810/' \
-e 's/^%%HiResBoundingBox.*$/%%CropBox: 49 190 556 810/' page2.eps > page2-mod.eps
Finally, we convert the modified EPS file back to PDF.
gs -q -DNOPAUSE -DBATCH -dEPSCrop -sDEVICE=pdfwrite -sOutputFile=output.pdf page2-mod.eps
The resulting file sizes are:
> du input.pdf page2* output.pdf 108 input.pdf 384 page2-mod.eps 384 page2.eps 56 output.pdf
Lastly, clean up temporary files:
rm page2*
For comments, please send me an e-mail.