PDF Tricks for the Linux Command-Line

0
279
PDF Tricks for the Linux Command-Line
PDF Tricks for the Linux Command-Line

There are a myriad of PDF utilities, each with its own idiosyncrasies. This article combines several of them and provides just the right syntax to get you going without any delay.

In the January 2023 edition of OSFY, I wrote an article on how to programmatically create HTML, ODT, DOCX and PDF files. In this article, I focus on PDF so that you can use it as a barebones ready-reference for your oft-required PDF-processing tasks.

Here are some PDF Tricks:

Creating PDFs from scratch

Old MarkDown is passé. CommonMark is the new standard, or rather, it is standardised MarkDown. If you prefer MarkDown, then you have to use its slow Perl script. If you can use CommonMark, then get the executable from my website. I built it from its C source code, and it is blisteringly fast.

# Convert MarkDown to HTML
perl markdown.pl jokebook.md > content.htm
# or
cmark --unsafe --validate-utf8 \
jokebook.md > content.htm
# Place the converted HTML in a HTML template
echo ‘<!DOCTYPE html><html><title>2020 Jokebook</title></head><body>’ > jokebook.htm
cat content.htm >> jokebook.htm
echo ‘</body></html>’ >> jokebook.htm
# Embed images in the HTML
libreoffice \
--convert-to “html:HTML:EmbedImages” \
jokebook.htm
# creates jokebook.html with self-contained
# (base64-encoded) images
# Convert HTML to ODF
libreoffice --convert-to “odt” jokebook.html
# creates jokebook.odt
# Convert ODF to PDF
libreoffice --convert-to “pdf” jokebook.odt
# creates jokebook.pdf

Converting images to PDF

Sometimes, you have to create PDF pages from images. ImageMagick is the preferred Linux utility to convert images.

# Convert images to PDFs
magick front-cover.png -resize 100% front.pdf
magick back-cover.png -resize 100% back.pdf

Concatenating several PDFs

PDFtk is a powerful PDF-processing utility that can perform tasks such as merging, splitting, encrypting, decrypting, stamping, and watermarking.

# Concatenate several PDFs
pdftk front.pdf inner-pages.pdf back.pdf \
output book.pdf

Encrypting PDFs

PDF documents can be encrypted using two passwords — owner and user.

pdftk book.pdf output book-encrypted.pdf \
encrypt_128bit \
owner_pw RcHrDsTlMn^012 \
user_pw FrSfTWrFnDtn^321

If you set an empty user password, you can let users view the PDF without a password prompt. You can specify additional restrictions using the ‘allow’ option. Possible values are Printing, DegradedPrinting, ModifyContents, Assembly, CopyContents, ScreenReaders, ModifyAnnotations, FillIn and AllFeatures. If you do not specify the allow option, none of these features will be available. Well, the PDF standard specifies that these features should not be available. In reality, the restrictions are not strictly implemented by many PDF viewer applications.

Removing PDF password

Use the input_pw option to specify the password.

pdftk book-encrypted.pdf \
input_pw RcHrDsTlMn^012 \
output book-decrypted.pdf

If you start this command with a space, then the password (RcHrDsTlMn^012) will not be stored in the ‘bash’ shell history. Another trick is to use an interactive dialog.

sPassword=$(zenity --password \
--title “Decrypt PDF” \
--text “Type the password”)
pdftk book-encrypted.pdf \
input_pw $sPassword \
output book-decrypted.pdf

If you want a console-only prompt, then disable the input echo.

stty -echo
read -p “Type the password: “ sPassword
stty echo

Converting PDF pages to images

You can use the pdftoppm utility to convert PDF pages to images in JPEG and PNG format. You can also specify the pixel density and page range.

# Export pages 2 to 12 with 96 dpi
pdftoppm -png -r 96 -f 2 -l 12 book.pdf page
# creates numbered images with prefix ‘page’

Rasterising a PDF

Sometimes, you cannot give a PDF as is to someone else or put it online. To foil content scrapers (and since ‘AI’ is now an unabashed content scraper), it is best to convert even the text to JPEG images. Why JPEG? Because it is lossy.

pdftoppm -jpeg -r 96 book.pdf page
magick page*.jpg book-rasterized.pdf

Some content scrapers do have the ability to read text from images, but that task requires extra resources. If you set the DPI to 72, it gets more demanding. For further annoyance, stamp it or watermark it.

magick watermark.png watermark.pdf
pdftk book-rasterized.pdf \
stamp watermark.pdf \
output pages-watermarked.pdf

Converting PDF to DjVu

Many PDF books on Archive.org are scanned from library copies of physical books. These PDFs are extremely heavy as the scans are usually dense or big images. Ordinary PDF readers on tablets struggle to read such PDFs. My solution is to convert the PDFs to DjVu. (DjVu viewer applications are optimised for reading from images. They load DjVu pages in a snap.)

pdf2djvu --dpi=220 \
--output=tablet.djvu \
library.pdf

Depending on the page size in the PDF, you will have to adjust DPI with the dpi option.

 To stamp or watermark a PDF, first create a translucent image, convert it to PDF and then use it with pdftk
Figure 1: To stamp or watermark a PDF, first create a translucent image, convert it to PDF and then use it with pdftk

Splitting PDFs

There is no ‘pdftk’ option to remove pages. However, you can use its ‘cat’ option and specify pages that need to be left in the output document.

pdftk book.pdf cat 6-end \
output story.pdf
# eliminates pages 1 to 5

pdftk does have a burst option to convert each page into a separate PDF.

pdftk book.pdf burst \
output page%02d.pdf

Specify the format mask of the page number in the output PDF name, similar to how you display numbers with the printf function of C standard library.

Other tasks

pdftk can do a lot of other tasks. Just consult its help output. While it can add and extract file attachments to and from a PDF, respectively, do remember that if you send a PDF with a file attachment by email, mail servers will pre-emptively remove your mail from the inbox of the recipient.

LEAVE A REPLY

Please enter your comment!
Please enter your name here