Formatting tool for text extracted from PDF

I have lots of PDF articles to read. However, my small-screened Sony Reader doesn’t show PDFs very well. I use a Linux utility called pdftotext to extract the raw text out of the PDF, in a very simple layout (either all mashed together, or with whitespace according to the layout of the page).

The problem is, these extracted text files are often very difficult to read because the lines are either completely mashed together (making titles, headers, footnotes, and new paragraphs difficult to spot), or they are formatted using literal layout, such that each line of text is on a separate line – which would only be comfortable if my reader screen were wide enough to fit the entire line of text.

I’ve written a short Linux bash script called formatpar that takes text files generated using pdftotext’s -layout option (literal layout) and bunches textual paragraphs back together the way they should be bunched. If a given line is more than X (say, 70) characters long, formatpar will wrap the next line of text onto it, resulting in a pretty close semblance of the original paragraph structure.

Get or preview the script:

Formatpar is under a Creative Commons license. You are free to use it, modify / add to it, and share it, for personal, public, or commercial use, as long as you give me credit as author and ensure that users are aware of my licensing terms.

Follow

Get every new post delivered to your Inbox.