View Single Post
Old 03-17-2007, 08:37 PM   #1
leha began at the beginning.
Posts: 14
Karma: 10
Join Date: Nov 2006
Device: prs-500
Poor boys way of editing pdf files (mostly linux, cygwin)

Hi all,

I want to start a thread about editing pdfs with command line tools (mostly linux, cygwin). It might be boring for most folks but I find it interesting because I don't want to pay for a commercial pdf editor and there are no good free ones for linux (at least for the things that I want to do).

WARNING: the post is long and boring

The main reason for fulling around with pdfs is to make them more readable on my sony prs500. There are quite a few books that are available online but they are not formatted for small screen: fonts are thin, gray text and huge margins with page numbers titles etc.

Considering that pdf/ps formats are well documented and essentially they are text files it is possible to edit them with some text utils. An example of what can be achieved by standard tools is attached at the end (this page is from apress book, I don't want to promote them but apress sells pdfs for 50% of paper copy price. Real pdfs no subscription stupidity. quote "Printing, document assembly, content copying or extraction, and content extraction for accessibility are permissible, for personal use only."). Back to the topic. Bellow is the way to do this conversion , I am not very good with bash scripting and editing pdfs so any input would be interesting.

The first thing to do is to convert pdf to ps. The program I used is pdftops from xpdf package (pdf2ps is a different one and not good). It has a couple interesting options that might be useful

-f <int>            : first page to print
-l <int>            : last page to print
The resulting ps file can be opened in any editor and improved by hand. It has simple structure, for example: you can find page 30 by searching for "%%Page: 30" or find out font information by looking for things like "/F1234_0" (1234 is a font number). I put together some script that dumps a list of font id, name and how many time a font was used on a screen, it requires filename as parameter (ps file created by pdftops)

# this script grabs all fonts from a postscript file and dumps them on stdout

# the first argument is the file name to work with

# get all font names and ids, replace " " with ":", "for" does not like spaces
fonts=`cat $1 | grep -E "/F[0-9]{1,5}_0 /" | sed "s/ /:/g"`

# get the number of times particular font id occurs in the file
for i in $fonts
	fontid=`echo $i | cut -f1 -d:` # the thing that we are going to look for
	fontname=`echo $i | cut -f2 -d: | sed 's/\///'`
	fontfreq=`cat $1 | grep -E "$fontid" | wc -l`
	echo $fontid $fontname $fontfreq
The result looks like (the output of the previous command was dumped to fontfreq.txt, will be used in examples below)

/F154_0 HelveticaNeue-MediumCond 2
/F155_0 HelveticaNeue-Condensed 3
/F156_0 ZapfDingbats 2
/F157_0 FAADGE+TimesNewRoman 3
/F100_0 HelveticaNeue-BoldCond 2
/F103_0 Utopia-Regular 7
/F108_0 TheSansMonoConSemiLight 6
/F111_0 HelveticaNeue-BoldCond 55
/F112_0 ZapfDingbats 48
/F109_0 HelveticaNeue-MediumCond 67
/F110_0 Utopia-Regular 481
/F113_0 TheSansMonoConSemiLight 430
/F122_0 Utopia-Italic 64
/F117_0 HelveticaNeue-HeavyCond 9
/F119_0 Utopia-Semibold 34
/F123_0 FAADGE+TimesNewRoman_0 29
/F129_0 HelveticaNeue-Condensed 102
/F130_0 HelveticaNeue-CondensedObl 9
/F135_0 TheSansMonoConSemiLight-Italic 4
/F132_0 Utopia-Bold 6
/F136_0 Symbol 2
Using this information it is possible to replace regular fonts with bold ones (I also prefer straight fonts like HelveticaNew to Times , straight fonts look better on my reader). For example to make main text bold you should replace all "/F110_0" with "/F111_0" or "/F132_0" (dont touch the header though). The third column gives you hint what font is applied to the main text and what to headers etc. In this case "/F110_0 Utopia-Regular" is used for the bulk of the text and "/F113_0 TheSansMonoConSemiLight" is used for code. It is possible to use sed command for this but there is a problem of multiple ids used for the same font (not so sever in this case). Utopia-Regular is for example /F103_0 and /F110_0. To tackle this you can use the following work around

cat fontfreq.txt | grep "Utopia-Regular" | ./ /F111_0
fontfreq.txt is the above list of fonts, "Utopia-Regular" the name of font you want to improve, /F111_0 is an id of better font and is

# the first argument is the file to work on
# the second argument is the id of the font to use in place of incoming list of fonts

while read fontstring
	fontid=`echo $fontstring | cut -f1 -d" "` # first column is supposed to be font id
	echo $fontid
	sed "s/\\$fontid 1/\\$2 1/" $1 >
	mv $1
I also wanted more readable code listing so I did

cat fontfreq.txt | grep "TheSansMonoConSemiLight" | ./ /F132_0
cat fontfreq.txt | grep "HelveticaNeue-Condensed" | ./ /F111_0
to the file. Now pdf looks better on my reader but there are still lots of empty space around text (prs500 considers headers and page numbers importand but I don't). To crop ps file you need to insert something like

[/CropBox [110 100 500 670] /PAGES pdfmark
after "%%BeginProlog", where 110 100 - bottom left corner , and 500 670 top right one (0 0 at bottom right). To find crop coordinates I used "gv" viewer , it shows position of cursor. If you dont have it you can do it by trial and error, useful information in this case is at the head of the ps file, look for something like

%%DocumentMedia: plain 612 792 0 () ()
%%BoundingBox: 0 0 612 792
Opening big files in editors is an ugly thing so I used a script to insert cropbox command
./ 110 100 500 670 > 

# inserts crop box command into the file passed as the first argument
# values for crop are 2..5th cl arguments

prologln=`grep -m 1 -n "^%%BeginProlog$" $1 | cut -f1 -d:`

head -n $prologln $1
echo "[/CropBox [$2 $3 $4 $5] /PAGES pdfmark"
tail -n +`expr $prologln + 1` $1
Most ps viewers don't give crap about our cropbox command but ps2pdf from gs package does

ps2pdf edited.pdf
The result contains everything (including headers etc.) but pdf viewers will show only stuff that is inside our cropbox (if you try to print you will get the whole page with headers). This way of cropping pdfs is not very straightforward but it works (pdfcrop produces garbage for some reason on my system).

Any comments on how to edit pdf/ps files will be appreciated.
Attached Files
File Type: pdf original.pdf (48.1 KB, 655 views)
File Type: pdf edited.pdf (63.9 KB, 668 views)
leha is offline   Reply With Quote