Roland's homepage

My random knot in the Web

PDF tricks

This article contains several useful tricks for manipulating PDF files.

The focus of this article is on Open Source and Free software, that are available for UNIX-like operating systems. These tools are made for use on the command-line of a shell.

Adding password restrictions to a PDF file

PDF files can have two passwords;

  • user password (Must be supplied to read a document.)
  • owner password (Can restrict printing, editing, copying. Not necessary to read the document.)

You can use qpdf (see also qpdf on github) to add restrictions.

Adding restrictions is done by “encrypting” the PDF with a owner password. Since this password is easily removed, you don’t need to remember this password. So I tend to generate one automatically.

The following command uses the SHA-256 checksum of the original file as the owner password.

> qpdf --encrypt '' `sha256 -q unrestricted.pdf` 128 \
--extract=n --modify=none --use-aes=y --cleartext-metadata -- \
unrestricted.pdf restricted.pdf

As given, it prevents copying (--extract=n) and modification (--modify=none), but leaves the document metadata unencrypted. By default, printing is allowed. The user password is an empty string, leaving read access open.

Running both through pdfinfo (from the poppler-utils package) shows the file restrictions. First the unrestricted file.

> pdfinfo unrestricted.pdf
Subject:        ...
Keywords:       ...
Author:         ...
Creator:        ...
Producer:       ...
CreationDate:   Tue Mar  1 21:17:23 2016 CET
ModDate:        Tue Mar  1 21:17:23 2016 CET
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          2
Encrypted:      no
Page size:      841.89 x 595.276 pts (A4)
Page rot:       0
File size:      152342 bytes
Optimized:      no
PDF version:    1.7

Contrast that with the output for the restricted file (trimmed for brevity).

> pdfinfo restricted.pdf
...
Encrypted:      yes (print:yes copy:no change:no addNotes:no algorithm:AES)
...

Note that this only protects your documents from laypeople, since qpdf can also remove such restrictions, as shown below.

If you need stronger access control, you should set the user password or use other kinds of encryption that would prevent people from reading the file without knowing the password.

Removing restrictions from a PDF file

If a document only has an owner password, you can use qpdf to remove it, without having to provide the owner password!

Note that this only works with one of the standard encryption handlers (RC4 and AES). If a document was encrypted with a custom encryption handler this will not work.

> qpdf -decrypt restricted.pdf unrestricted2.pdf
> pdfinfo unrestricted2.pdf
...
Encrypted:      no
...

So an owner password is not a protection against knowledgeable people.

Changing the metadata in a PDF file

Using exiftool

The exiftool program can be used to change the Info dictionary and XMP tags in a PDF file.

For example, I’ve seen a e-book application on an android device use the “title” from the Info dictionary to label PDFs in the user interface. However in some PDF files the title is either empty or bears no resemblance to the actual contents. In cases like this you really want to update the metadata.

> exiftool -Title='Alexit hardener 405-25' \
-overwrite_original ALEXIT-Hardener_405-25_DE.pdf
    1 image files updated

Using ghostscript

This is done using the pdfmarks functionality (thanks to Ghostscript PDF reference & tips)

First, we create a file called pdfmarks. This contains the new document information.

[ /Title (New title)
  /Author (Author name)
  /Subject (New Subject)
  /Keywords (comma, separated, keywords)
  /ModDate (D:20170909130029)
  /CreationDate (D:20061204092842)
  /Creator (Name of the application used to create the original document)
  /Producer (GPL Ghostscript 9.16)
  /DOCINFO pdfmark

The dates can be formatted using the date command. To get the current date in PDF format, use the following command.

date +D:%Y%m%d%H%M%S

In Python you can generate these date as follows.

In [1]: from datetime import datetime as dt

In [2]: dt.strftime(dt(2014, 8, 21, 14, 34, 0), '/CreationDate (D:%Y%m%d%H%M%S)')
Out[2]: '/CreationDate (D:20140821143400)'

In [3]: dt.strftime(dt.now(), '/ModDate (D:%Y%m%d%H%M%S)')
Out[3]: '/ModDate (D:20170909130654)'

The ghostscript program is used to change the PDF file.

> gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=output.pdf \
input.pdf pdfmarks

This modifies both the Info dictionary and the XMP tags.

This has been automated by the pdfsetinfo script.

Overlaying text and images in a PDF file

This is such a substantial topic that it is located in a separate article.

Converting PDF to bitmap formats

Sometimes a PDF needs to be converted to bitmap format, e.g. for display on a webpage. (This is assuming that generating the same image in SVG format is not possible.)

The programs are from the ImageMagick suite of tools.

to PNG

> convert -density 1200 -units PixelsPerInch \
input.pdf \
-scale 25% \
output.png

The first option (which needs to come before the name of the input file) tells it to convert the image to a bitmap at 1200 pixels per inch (“PPI”). The standard resolution used by convert is only 72 PPI.

After the input file, -scale 25% is used to scale the image back. This reduces the effective resolution to 300 PPI, but averages the pixels giving a less pixelated look.

to JPEG

> convert -density 1200 -units PixelsPerInch \
input.pdf \
-background white -flatten\
-scale 25% \
output.jpg

Here the -background white and -flatten options are needed to prevent a black background on some PDF files.

Creating a PDF from scanned pages

In the following it is assumed that the pages are scanned on A4 format and have their resolution embedded in the metadata.

> convert page*.jpg -adjoin intermediate.tiff
> tiff2pdf -j -o output.pdf intermediate.tiff
> rm -f intermediate.tiff

If the images do not contain resolution information, you have to specify it. In the example below, the image resolution in 150 PPI.

> convert -density 150 -units PixelsPerInch page*.jpg \
-adjoin intermediate.tiff
> tiff2pdf -r 150 -j -o output.pdf intermediate.tiff
> rm -f intermediate.tiff

The convert program is from the ImageMagick suite of tools, while tiff2pdf is part of libtiff.

Decompressing a PDF file

The stream objects in PDF files are often compressed to save space. It is easier to study PDF files when the streams are not compressed. Here is how to decompress them (thanks to the Hand-coded PDF tutorial).

> ps2pdf -dCompressPages=false input.pdf output.pdf

Retrieving the info dict from a PDF file in Python

This code uses the pdfinfo program (from poppler-utils) internally for convenience.

from datetime import datetime as dt
import subprocess as sp
import re


def pdfinfo(path):
    """Retrieves the Info dictionary from a PDF file.

    The information is converted to a Python dictionary.
    The values are converted to a suitable format.

    Arguments:
        path: String that indicates the location of the PDF file.

    Returns:
        A Python dictionary containing the file's info.
    """
    # Extract the info from a PDF file.
    text = sp.check_output(['pdfinfo', path]).decode('utf-8')
    # Convert info to a doctionary.
    info = dict(re.findall('(.*)?:\s+(.*)?\s+', text, re.MULTILINE))
    # Convert dates to datetime objects.
    keys = info.keys()
    for key in keys & ('CreationDate', 'ModDate'):
        info[key] = dt.strptime(info[key], '%c %Z')
    # Convert suitable values to integers
    for key in keys & ('File size', 'Pages', 'Page rot'):
        info[key] = int(info[key].split()[0])
    # Convert quitable values to boolean
    for key in keys & ('Encrypted', 'JavaScript', 'Optimized', 'Suspects',
                    'Tagged'):
        info[key] = info[key].split()[0] in ("yes", "true", "t", "1")
    return info

Cropping a PDF file

For example, let’s say we want to extract from page 2 of a document the material with lower left coordinates (49,190) and upper right coordinates (556,841).

Using pdfcrop

First we extract the required page with ghostscript.

gs -q -DNOPAUSE -DBATCH -sDEVICE=pdfwrite -dFirstPage=2 -dLastPage=2 \
-sOutputFile=page2.pdf input.pdf

here I use pdfcrop (part of the TeXLive distribution) in combination with a viewer like mupdf. On the console I run the following command to do automatic cropping:

pdfcrop page2.pdf output.pdf

Manual cropping is used to cut off parts of the image around the edges, like this:

pdfcrop --bbox "49 190 556 810" input.pdf output.pdf

The numbers are the size of the bounding box in “<left> <bottom> <right> <top>” order. These dimensions are given in PostScript points and the origin of the page coordinate system is at the left bottom of the page.

The resulting file sizes are:

> du input.pdf page2.pdf output.pdf
108 input.pdf
60  page2.pdf
56  output.pdf

Finally, clean up intermediate files:

rm page2.pdf

Using ghostscript

Extracting the required page with ghostscript and converting it to EPS.

gs -q -DNOPAUSE -DBATCH -sDEVICE=eps2write -dFirstPage=2 -dLastPage=2 \
-sOutputFile=page2.eps input.pdf

Next, we use sed to update the BoundingBox and (important) set the CropBox.

sed -e 's/^%%BoundingBox.*$/%%BoundingBox: 49 190 556 810/' \
-e 's/^%%HiResBoundingBox.*$/%%CropBox: 49 190 556 810/' page2.eps > page2-mod.eps

Finally, we convert the modified EPS file back to PDF.

gs -q -DNOPAUSE -DBATCH -dEPSCrop -sDEVICE=pdfwrite -sOutputFile=output.pdf page2-mod.eps

The resulting file sizes are:

> du input.pdf page2* output.pdf
108 input.pdf
384 page2-mod.eps
384 page2.eps
56  output.pdf

Lastly, clean up temporary files:

rm page2*

For comments, please send me an e-mail.


Related articles


←  Adding text or graphics to a PDF file Removing big files from git history  →