Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 03-17-2007, 08:37 PM   #1
leha
Member
leha began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Nov 2006
Device: prs-500
Poor boys way of editing pdf files (mostly linux, cygwin)

Hi all,

I want to start a thread about editing pdfs with command line tools (mostly linux, cygwin). It might be boring for most folks but I find it interesting because I don't want to pay for a commercial pdf editor and there are no good free ones for linux (at least for the things that I want to do).

WARNING: the post is long and boring

The main reason for fulling around with pdfs is to make them more readable on my sony prs500. There are quite a few books that are available online but they are not formatted for small screen: fonts are thin, gray text and huge margins with page numbers titles etc.

Considering that pdf/ps formats are well documented and essentially they are text files it is possible to edit them with some text utils. An example of what can be achieved by standard tools is attached at the end (this page is from apress book, I don't want to promote them but apress sells pdfs for 50% of paper copy price. Real pdfs no subscription stupidity. quote "Printing, document assembly, content copying or extraction, and content extraction for accessibility are permissible, for personal use only."). Back to the topic. Bellow is the way to do this conversion , I am not very good with bash scripting and editing pdfs so any input would be interesting.

The first thing to do is to convert pdf to ps. The program I used is pdftops from xpdf package (pdf2ps is a different one and not good). It has a couple interesting options that might be useful

Code:
-f <int>            : first page to print
-l <int>            : last page to print
The resulting ps file can be opened in any editor and improved by hand. It has simple structure, for example: you can find page 30 by searching for "%%Page: 30" or find out font information by looking for things like "/F1234_0" (1234 is a font number). I put together some script that dumps a list of font id, name and how many time a font was used on a screen, it requires filename as parameter (ps file created by pdftops)

Code:
#!/bin/bash
# this script grabs all fonts from a postscript file and dumps them on stdout

# the first argument is the file name to work with

# get all font names and ids, replace " " with ":", "for" does not like spaces
fonts=`cat $1 | grep -E "/F[0-9]{1,5}_0 /" | sed "s/ /:/g"`

# get the number of times particular font id occurs in the file
for i in $fonts
do
	fontid=`echo $i | cut -f1 -d:` # the thing that we are going to look for
	fontname=`echo $i | cut -f2 -d: | sed 's/\///'`
	fontfreq=`cat $1 | grep -E "$fontid" | wc -l`
	echo $fontid $fontname $fontfreq
done
The result looks like (the output of the previous command was dumped to fontfreq.txt, will be used in examples below)

Code:
/F154_0 HelveticaNeue-MediumCond 2
/F155_0 HelveticaNeue-Condensed 3
/F156_0 ZapfDingbats 2
/F157_0 FAADGE+TimesNewRoman 3
/F100_0 HelveticaNeue-BoldCond 2
/F103_0 Utopia-Regular 7
/F108_0 TheSansMonoConSemiLight 6
/F111_0 HelveticaNeue-BoldCond 55
/F112_0 ZapfDingbats 48
/F109_0 HelveticaNeue-MediumCond 67
/F110_0 Utopia-Regular 481
/F113_0 TheSansMonoConSemiLight 430
/F122_0 Utopia-Italic 64
/F117_0 HelveticaNeue-HeavyCond 9
/F119_0 Utopia-Semibold 34
/F123_0 FAADGE+TimesNewRoman_0 29
/F129_0 HelveticaNeue-Condensed 102
/F130_0 HelveticaNeue-CondensedObl 9
/F135_0 TheSansMonoConSemiLight-Italic 4
/F132_0 Utopia-Bold 6
/F136_0 Symbol 2
Using this information it is possible to replace regular fonts with bold ones (I also prefer straight fonts like HelveticaNew to Times , straight fonts look better on my reader). For example to make main text bold you should replace all "/F110_0" with "/F111_0" or "/F132_0" (dont touch the header though). The third column gives you hint what font is applied to the main text and what to headers etc. In this case "/F110_0 Utopia-Regular" is used for the bulk of the text and "/F113_0 TheSansMonoConSemiLight" is used for code. It is possible to use sed command for this but there is a problem of multiple ids used for the same font (not so sever in this case). Utopia-Regular is for example /F103_0 and /F110_0. To tackle this you can use the following work around

Code:
cat fontfreq.txt | grep "Utopia-Regular" | ./fontreplace.sh original.ps /F111_0
fontfreq.txt is the above list of fonts, "Utopia-Regular" the name of font you want to improve, /F111_0 is an id of better font and fontreplace.sh is

Code:
#!/bin/bash
# the first argument is the file to work on
# the second argument is the id of the font to use in place of incoming list of fonts

while read fontstring
do
	fontid=`echo $fontstring | cut -f1 -d" "` # first column is supposed to be font id
	echo $fontid
	sed "s/\\$fontid 1/\\$2 1/" $1 > tmp.ps
	mv tmp.ps $1
done
I also wanted more readable code listing so I did

Code:
cat fontfreq.txt | grep "TheSansMonoConSemiLight" | ./fontreplace.sh original.ps /F132_0
cat fontfreq.txt | grep "HelveticaNeue-Condensed" | ./fontreplace.sh original.ps /F111_0
to the file. Now pdf looks better on my reader but there are still lots of empty space around text (prs500 considers headers and page numbers importand but I don't). To crop ps file you need to insert something like

Code:
[/CropBox [110 100 500 670] /PAGES pdfmark
after "%%BeginProlog", where 110 100 - bottom left corner , and 500 670 top right one (0 0 at bottom right). To find crop coordinates I used "gv" viewer , it shows position of cursor. If you dont have it you can do it by trial and error, useful information in this case is at the head of the ps file, look for something like

Code:
%%DocumentMedia: plain 612 792 0 () ()
%%BoundingBox: 0 0 612 792
Opening big files in editors is an ugly thing so I used a script to insert cropbox command
Code:
./insertcrop.sh original.ps 110 100 500 670 > original_wcrop.ps 

#!/bin/bash
# inserts crop box command into the file passed as the first argument
# values for crop are 2..5th cl arguments

prologln=`grep -m 1 -n "^%%BeginProlog$" $1 | cut -f1 -d:`

head -n $prologln $1
echo "[/CropBox [$2 $3 $4 $5] /PAGES pdfmark"
tail -n +`expr $prologln + 1` $1
Most ps viewers don't give crap about our cropbox command but ps2pdf from gs package does

Code:
ps2pdf original_wcrop.ps edited.pdf
The result contains everything (including headers etc.) but pdf viewers will show only stuff that is inside our cropbox (if you try to print you will get the whole page with headers). This way of cropping pdfs is not very straightforward but it works (pdfcrop produces garbage for some reason on my system).

Any comments on how to edit pdf/ps files will be appreciated.
Attached Files
File Type: pdf original.pdf (48.1 KB, 1379 views)
File Type: pdf edited.pdf (63.9 KB, 1456 views)
leha is offline   Reply With Quote
Old 03-25-2007, 11:37 PM   #2
EatingPie
Blueberry!
EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.
 
EatingPie's Avatar
 
Posts: 888
Karma: 133343
Join Date: Mar 2007
Device: Sony PRS-500 (RIP); PRS-600 (Good Riddance); PRS-505; PRS-650; PRS-350
Thanks for these utilities! There seems to be an overt emphasis on Windows for conversion utilities, with Linux and Mac sitting, dejected on the sidelines.

I hope to give these a run on Mac OS X, and if it works you can add that to the subject. Bash comes standard, and Fink (Mac OS X port of debian's apt) has xpdf and ghostscript, so those are easy.

I don't have my PRS500 yet, so it'll be a few days before I can test.

-Pie
EatingPie is offline   Reply With Quote
Advert
Old 03-30-2007, 11:14 PM   #3
EatingPie
Blueberry!
EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.EatingPie puts his or her pants on both legs at a time.
 
EatingPie's Avatar
 
Posts: 888
Karma: 133343
Join Date: Mar 2007
Device: Sony PRS-500 (RIP); PRS-600 (Good Riddance); PRS-505; PRS-650; PRS-350
As promised, I demoed these, and they work perfectly under Mac OS X.

However, I only tried them with the Reader Manual (how idiotic is it this thing is nigh impossible to read?). To say the least, this was far too complicated, as there are like 20 or 30 fonts, and insanely difficult to find text strings in a Postscript.

All in all this is a very complicated process and needs more automation. Don't know if that's possible. But there it is.

-Pie
EatingPie is offline   Reply With Quote
Old 04-02-2007, 03:26 PM   #4
leha
Member
leha began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Nov 2006
Device: prs-500
Quote:
Originally Posted by EatingPie
As promised, I demoed these, and they work perfectly under Mac OS X.

However, I only tried them with the Reader Manual (how idiotic is it this thing is nigh impossible to read?). To say the least, this was far too complicated, as there are like 20 or 30 fonts, and insanely difficult to find text strings in a Postscript.

All in all this is a very complicated process and needs more automation. Don't know if that's possible. But there it is.

-Pie
Nice to know that these scripts work on mac. The original post was not meant to be a complete howto on editing pdfs, I thought that it can stir some interest in the topic. Anyway, there is a chance I will write some utility that does font replacement more or less automatically. There is a problem though with the pdf format: positions of words is fixed but positions of letters are calculated from width of a font (how stupid is this). In other words if you replace a font with another one that has different letter widths the text will be screwed up (words will be overlapping or there will be huge spaces between them).
leha is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Looking for Linux PDF editing tools for DX format tobor Kindle Developer's Corner 1 06-19-2009 07:37 PM
Do need help editing text files? Nate the great Workshop 3 04-01-2009 01:18 PM
Poor editing? thibaulthalpern News 39 03-18-2009 07:47 PM
Reading PDF files on Windows or Linux Bob Russell PDF 18 02-14-2009 01:21 PM
Editing RTF Files DougFNJ Sony Reader 3 11-29-2007 01:27 PM


All times are GMT -4. The time now is 04:24 AM.


MobileRead.com is a privately owned, operated and funded community.