View Full Version : An algorithm to render PDF in small devices


caritas
04-22-2008, 08:52 AM
Hi,

I am interested in ebook reader for quite a while. But after trying with a 6-inch e-ink reader (Hanlin V3), I found it is almost useless to read normal PDF files on these machines. The font size is too small, while the page size is too wide.

So, a method to render PDF for these small devices is thought about and prototyped. The details are as follow:

1. Convert pdf to image. I use pdftoppm of xpdf. Such as:
pdftoppm -r 180 -f 245 -l 245 -gray -aa yes a.pdf a

2. Analyse the generated images. Break page into lines.

3. Divide each line long enough to two segments.

4. Rearrange the segments into a new page, with half of the width.

The example image before/after conversion is attached with the post. I think the result is acceptable.

The source code is attached with the post too. The source is released under the License of GPL v2/v3.

Best Regards,
Huang Ying

Basic Usage for version 0.4:

tar -xjf pi_0.4.tar.bz2
cd pi
. env.sh
cd test
pi_format.py chap.conf
/* output goes in out directory */
img_dir_to_pdf.sh out chap-rf.pdf


2008-09-20 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.8

* overall: Reorganize program in a more modular way.

* pi.image: Add unpaper support for scanned book

* pi.image: Add column compress support for scanned book

* pi.divide: Add simple divider for divide = 1

2008-08-30 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.7

* pi.py: Add LRF output support.

* pi.py: Add TOC support for LRF output format

* pi.py: Add output rotate support.

* pdfminfo: Add pdfminfo to extract PDF information such as TOC,
title, author, etc.

* overall: Add initial windows support, thanks ashkulz of
mobileread forum.

2008-08-11 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.6

* pi.py: Initial implementation of embolden.

* pi.py: Use norm coordinate in class Page and Line.

* pi.py: Add edge trimming support.

* pi.py: Add run pages mode.

* pi.py: Add page range support.

* pi.py: Re-work ImageOutput, split multi-page image.

* pi.py: Rotate during scale if approriate.

* img_dir_to_pdf.sh: Add color reduction support.

2008-05-17 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.5

* pi.py: Detect word, and break lines at word end when possible.

* pi.py: Re-align the 'split line segment' (second half of line)
to align with the next line's indenting when appropriate. This
will make the first line indent and bullet items line up better.

* img_dir_to_pdf.sh: Added to convert from images to pdf.

2008-05-10 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.4

* Some algorithms are configurable

* For some text may have problem, present both merged and divided
version.


2008-05-03 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.3

* Rewrite most algorithm in python except the image parsing (break
image into lines and characters). This will make it easier to
add new algorithm (hack).

* pi.py: Add some hacks to deal with equation and figure.


2008-04-29 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.2

* Split lines in two equal halves or optional equal thirds or
equal quarters

* Separate output image into customizable page size

* Flex can be designate by user configuration

* Calculate DPI for each page

* Figure detecting and special processing. The figures are scaled
to page width and output twice, scaled and split.


2008-04-23 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.1

radius
04-22-2008, 11:58 AM
Result looks excellent for the amount of intelligence used in the algorithm.

This is a good hack for documents we can't reflow and resize.

nrapallo
04-22-2008, 12:13 PM
Very nice idea, indeed!

I may try this out in PDFRead, as an alternative for smaller screen devices like the EBW1150. Hopefully I can just 'call' your executable from within PDFRead and avoid having to recode your efforts in python. :rolleyes:

I remember that the original developer of PDFRead was going to allow some type of reflow of pdf documents, but never released his efforts.

One question though:Is the split at half the page width "fixed" or can it be changed to a user inputted amount, like one-third or 25%?

caritas
04-22-2008, 10:18 PM
>Is the split at half the page width "fixed" or can it be changed to a user inputted amount, like >one-third or 25%?

Now it is fixed. But why split the line at 1/3 or 1/4? One longer line and one short line will be produced for one original line.

The actual page width generated now is 1/2+1/6 = 2/3 of original page text width. The additional 1/6 is used for finding the space between words.

nrapallo
04-22-2008, 10:47 PM
>Is the split at half the page width "fixed" or can it be changed to a user inputted amount, like >one-third or 25%?

Now it is fixed. But why split the line at 1/3 or 1/4? One longer line and one short line will be produced for one original line.

The actual page width generated now is 1/2+1/6 = 2/3 of original page text width. The additional 1/6 is used for finding the space between words.

Sorry, what I meant by split at 1/3 is to have three equal portions of the line being split and then triple the page height to add those (two) additional lines beneath the line being split. Now the resulting page would be 1/3+1/6 = 1/2 of the original.

This is just like you do for 1/2 split (two equal halves with one line below the other).

By extension, 1/4 split would result in four lines of text from one and quadruple the height!

The reason this would be helpful would be to gain more clarity by rendering/cropping shorter lines for smaller screens.

When I looked at your code, I thought this would be easy to do. I think the 1/6 would be constant amongst these differing split methods.

Am I on the right track here?

caritas
04-22-2008, 11:58 PM
OK, I see. It is easy to add such feature. And I think the 1/6 (flex) can be specified by user or analyzed from the PDF file too (by analyzing the average characters per line).

IceHand
04-23-2008, 08:44 AM
Very nice! Could you maybe include an option to split the resulting image into more than one image? For example cut at around 33% of the height without cutting the letters. I attached the original image that your program made and three images how the page could have been split with the option I'm thinking of.

Azayzel
04-23-2008, 10:26 AM
Interesting take on getting the 'ol rasterized PDF's into your portable reader! Too bad the resolution isn't much better on these devices, I've been using my iPod Touch to read PDF's even though I have a Sony eReader. Still waiting on something better, but until then I might give this a shot. Guess it just chops pictures up in the mix, huh?

Thanks for the new slant on an old issue!

sealbeater
04-23-2008, 02:55 PM
Quick comment, this doesn't compile under linux/ppc. Looks good tho, can it be scripted?

sealbeater
04-23-2008, 02:56 PM
I take that back, I had to delete the pi.o file, compiles fine, will test out.

-Thomas-
04-23-2008, 07:36 PM
Hey, this is way cool, I'll give it a try on some of my PDFs!

vinniet
04-23-2008, 09:23 PM
If anyone complies this under Windows, can they share it.

Thanks!

kentsin
04-24-2008, 04:51 AM
WOOW!

How about a version for old chinese books which line vertically?

alexxxm
04-24-2008, 04:57 AM
works flawlessly here (Linux Fedora FC8) - and much faster than I thought possible!

Next step I guess will be to reconstruct the document from the reformatted PGMs.
Do you know a way?

Alessandro

Crook
04-24-2008, 06:35 PM
I see a lot of potential in this idea. Some future improvements could be:

OCR of the generated images to reconstruct the PDF
Images (or otherwise unchopable content) could be rescaled down

Although the first one is not a trivial task...

nrapallo
04-25-2008, 08:50 AM
If anyone complies this under Windows, can they share it.

Thanks!

Yes, please, my cygwin installation needs updating and is not working yet... :thanks:

nrapallo
04-25-2008, 12:26 PM
Yes, please, my cygwin installation needs updating and is not working yet... :thanks:

OK my update is now finished. Attached .zip is a windows executable compiled under cygwin. It requires the two .dll included therein to properly 'run' the program 'pi.exe'. Can be placed in your path instead of the directory where pi.exe is (BTW, pi = pdftoimage)

Attached is a sample (converted to .gif) that was produced by invoking:pi page.pgm output.pgm

Notice how the indenting is retained, but this sometimes breaks down for bullet items. This is a great technique!

nrapallo
04-25-2008, 12:28 PM
Intentionally deleted.

p.s. I've always wanted to say that...

DaleDe
04-25-2008, 12:42 PM
Those using Windows who want to try this should get the compiled windows executable located here (http://www.mobileread.com/forums/showthread.php?p=174239#post174239)!

You need to collect all this stuff and make a wiki for it.

Dale

nrapallo
04-25-2008, 12:48 PM
You need to collect all this stuff and make a wiki for it.

Dale

Soon...

But that's like writing documentation and stuff... Brain freeze time. :alright:

Just ask tompe what his favourite task is NOT!

nrapallo
04-25-2008, 01:04 PM
Notice how the indenting is retained, but this sometimes breaks down for bullet items.

caritas:

An alternative way to align the 'next line segment' with the 'previous line segment' could be employed.

I think the 'next line segment' should be re-aligned with its 'next line segment' if not blank. This may help the bullet items line up better and avoid two indented lines when the beginning line of a paragraph is split.

What do you think?

IceHand
04-25-2008, 06:24 PM
Here's a little Bash script that will convert all PDFs in the current folder first to PGMs with pdftoppm then run this algorithm on the PGMs, make a small white border around the text (I like to have a small border), crop every images to three overlapping images (it's not exactly what I had in mind in the above post, but it'll work good enough in most cases) and finally convert it back to a new PDF.

Requirements: pi, pdftoppm (part of Poppler and/or Xpdf), ImageMagick (v6.3.2 or newer is needed for the -extent option to work properly), libtiff

#!/bin/bash
set -e

for i in *.pdf; do
if [ -f "$i" ]
then
echo "Converting file \"$i\". Please wait ..."
PDFName="`basename "$i" .pdf`"
mkdir "Temp-$PDFName"
cd "Temp-$PDFName"
pdftoppm -r 180 -gray "../$i" "$PDFName"

for i in *.pgm; do
pi "$i" "New-$i"
rm "$i"
done

for i in *.pgm; do
convert "$i" +compress -gravity Center -extent "106%x101%" -gravity East -extent "104%x100%" "`basename "$i" .pgm`.tif"
rm "$i"
done

for i in *.tif; do
convert "$i" +compress -gravity North -crop "100%x34%" +repage -depth 8 "`basename "$i" .tif`-1.tif"
convert "$i" +compress -gravity Center -crop "100%x34%" +repage -depth 8 "`basename "$i" .tif`-2.tif"
convert "$i" +compress -gravity South -crop "100%x34%" +repage -depth 8 "`basename "$i" .tif`-3.tif"
rm "$i"
done

tiffcp *.tif "New-$PDFName.tif"

tiff2pdf -z "New-$PDFName.tif" -o "New-$PDFName.pdf" -t "$PDFName"

rm *.tif

mv "New-$PDFName.pdf" ../

cd ..

rmdir "Temp-$PDFName"
else
echo "ERROR: No PDF files found"
exit 1
fi
done

echo "Done."
exit 0

nrapallo
04-25-2008, 10:52 PM
...make a small white border around the text (I like to have a small border), crop every images to three overlapping images (it's not exactly what I had in mind in the above post, but it'll work good enough in most cases) and finally convert it back to a new PDF.

Just out of curiosity, by splitting the resulting images (which are doubled in height) by three vertically, does it look good (aspect ratio wise) on your ereader. This split by three should be the best if the original pdf has no/little white margins.

I think if the orginal pdf has at least 20% white space on the side/margin (that gets cropped out anyways), then to retain the newer cropped image's aspect ratio, a split by four vertically should (theorectically) look better.

Any (practical) thoughts?

EDIT: the split by three or four refers to vertical cropping and not to the place to split horizontally the line in half (or in future third, quarters...).

IceHand
04-26-2008, 09:26 AM
[...] I think if the orginal pdf has at least 20% white space on the side/margin (that gets cropped out anyways), then to retain the newer cropped image's aspect ratio, a split by four should (theorectically) look better.

Any (practical) thoughts?Well, it rather depends on the original PDF and how large you want the text size to look like. I attached an image of a random page how it would with the split by three look on my Cybook (set to "Fit Height"). You can see there are rather large borders on the left and right side. If I set it to "Fit Width" there are only the small borders I added via my script, but the text size is too large for my taste and I would have to scroll down to see the whole page.

nrapallo
04-26-2008, 09:39 AM
Well, it rather depends on the original PDF and how large you want the text size to look like. I attached an image of a random page how it would with the split by three look on my Cybook (set to "Fit Height"). You can see there are rather large borders on the left and right side. If I set it to "Fit Width" there are only the small borders I added via my script, but the text size is too large for my taste and I would have to scroll down to see the whole page.

Exactly, why a split by 'four' may be useful, so as to get rid of those wider than acceptable white margins.

The size of the resulting text is largely due to the size format of the original pdf. If starting with a A4/Letter sized pdf with 10/12pt text, then the resulting text should not be so large as to be unacceptable. If starting with a Sony ereader sized pdf then why bother, the resulting text will appear huge. :eek:

In the end this technique will work best when the original pdf, as view on the ereader hardware, is just too small to read comfortably.

I have only done a few tests, but think that in real world conversions, the better split ratio will be between three and four.

IceHand
04-26-2008, 01:44 PM
Exactly, why a split by 'four' may be useful, so as to get rid of those wider than acceptable white margins. [...]
You're talking about the width split made by the algorithm right? If yes, I agree. Having longer lines would definitely be a good thing.
The concept is nice, but far from perfect yet. I noticed that the program aborts with a segmentation fault error when processing some pages, mostly with images.

nrapallo
04-26-2008, 01:59 PM
You're talking about the width split made by the algorithm right? If yes, I agree. Having longer lines would definitely be a good thing.
The concept is nice, but far from perfect yet. I noticed that the program aborts with a segmentation fault error when processing some pages, mostly with images.

:blink: No, the height split!

After the pi processing, your bash script splits the resulting image in three vertically. I am speculating that, in general, splitting be three or four would be the optimal way to view on a ereader screen. Can you try splitting by four vetically for a directory of pdf and compare with your first results (splitting by three vertically)? Is the large white margin issue lessened with the split by four?

I've run into many segmentation faults as well, primarily converting coloured pages. I think the routine that tries to identify the individual lines of text may not be as robust when there are no 'white line gaps' between the lines of text. It needs better bounds checking or defaults if something goes wrong.

IceHand
04-26-2008, 05:27 PM
:blink: No, the height split!

After the pi processing, your bash script splits the resulting image in three vertically. I am speculating that, in general, splitting be three or four would be the optimal way to view on a ereader screen. Can you try splitting by four vetically for a directory of pdf and compare with your first results (splitting by three vertically)? Is the large white margin issue lessened with the split by four?I'm sorry, I was confused there for a second (and edited my post, but you'd already quoted me so I changed it back).
Anyway, yes, splitting by four gives very good results concerning the left and right border – the text size is rather large though (see attached image).

It would be great if the pi algorithm would have an option to specify the page width and height in pixels and optimise the line breaks accordingly. The size of the text could then be defined by changing the image density (dpi) with pdftoppm. Right now there's no difference when viewing it with my Cybook, because the number of characters/words per line will always be the same (well of course the text will look fuzzy if the density is set too low).

nrapallo
04-26-2008, 05:56 PM
Anyway, yes, splitting by four gives very good results concerning the left and right border – the text size is rather large though (see attached image).

It would be great if the pi algorithm would have an option to specify the page width and height in pixels and optimise the line breaks accordingly.

BTW, what was the size of the original pdf? A4/Letter size? That text does look uncomfortably large.

I would suspect that the text would be more reasonable (size-wise) if the original had more words per line.

It is good that we are 'flushing' out what would be nice, in case the original poster wants to improve his algorithm to incorporate:Wish list
1) split lines in two equal halves or optional equal thirds or equal quarters
2) crop the resulting image by three or four to retain the ereader's aspect ratio (usually 0.75 = 480/640) instead of just having a doubled height page.
3) allow the 1/6 flex to be calculated by some words per line estimate or user-input.
4) allow coloured backgrounds/text input images instead of just grayscale (pgm vs ppm/png/pbm). Accomodate images somehow, perhaps shrink them down.
5) re-align the 'split line segment' (second half of line) to align with the next line's indenting if its not blank. This will make the first line indent and bullet items line up better.
6) avoid segmentation faults :)

IceHand
04-26-2008, 06:09 PM
BTW, what was the size of the original pdf? A4/Letter size? That text does look uncomfortably large.

I would suspect that the text would be more reasonable (size-wise) if the original had more words per line.
152x225 mm with about 12-16 words per line. And yes, if the text had had more words per line it would have looked better size wise.

Wish list
1) split lines in two equal halfs or optional equal thirds or equal quarters
2) crop the resulting image by three or four to retain the ereader's aspect ratio (usually 0.75 = 480/640) instead of just having a doubled height page.
3) allow the 1/6 flex to be calculated by some words per line estimate or user-input.
4) allow coloured backgrounds/text input images instead of just grayscale (pgm vs ppm/png/pbm). Accomodate images somehow, perhaps shrink them down.
5) avoid segmentation faults :)
6) fix the indentation algorithm. Right now the line that goes after the indented line will be indented as well.
EDIT: Ah, you've noticed that problem too :)

vinniet
04-27-2008, 09:45 PM
I took IceHand script along with the PI compile under windows. I am using a bash command from cygwin to run this under Windows. I have been able to replace all Unix commands with Windows equivalent. The script runs until I get a error under PI.

Converting file "Lotus Domino Administrator 6.pdf". Please wait ...
Error: No display font for 'Symbol'
Error: No display font for 'ZapfDingbats'
pi: Error reading row. Short read of 1514 bytes instead of 1530

I have included the PDF that is just a simple 2 page tech manual. Please let me know what I am doing wrong. Maybe I am trying too hard to get this to work under windows.

Thanks!

IceHand
04-28-2008, 07:04 AM
I think it's because the fonts are not embedded in the PDF file and pdftoppm seems to be unable to find the missing fonts in the fonts folder. Try again with embedded fonts or try to install the missing fonts (the first would be easier I guess).
I've attached the converted PDF how it would look if you had the required fonts installed.

caritas
04-28-2008, 11:18 PM
Most items in wish list is reasonable. Although I may have not enough ability to finish all of them. I have a new version now and hopes it accomplish some wishes. :)

nrapallo
04-29-2008, 12:56 AM
Most items in wish list is reasonable. Although I may have not enough ability to finish all of them. I have a new version now and hopes it accomplish some wishes. :)

Thank you for all your efforts!

Wishlists are guides to implement changes. Please do start with the easy ones and (eventually) work up to the hard ones. ;)

No pressure to do so, as we all should be grateful for (free) software that works as advertised! :thumbsup:

I have been thinking about your methods without actually following your code. These are some ideas you may choose to incorporate:

1. When the (halfway) split point falls on a word, the decision to add the flex to find the end of that word should take into account 'how much of that word is to the right'. In particular, if more of the word falls to the right of the split point, then the split should occur at the *previous* word; not the word where the split falls.

2. If the main pdf text appears justified, then the (halfway) split point will usually fall at the same point on each subsequent line. However, if the text appears left-aligned, then the split point should be calculated on each line's actual width. This will avoid having the first line segment always longer than the split next line segment.

nrapallo
04-29-2008, 01:59 AM
I have a new version now

Great sample .pdf (chap6.pdf). BTW, I was in the same university undergrad program as one of the authors (Alfred J. Menezes (http://www.cacr.math.uwaterloo.ca/hac/authors/ajm.html)), though I do not know him personally!

Yeah, Bachelor of Mathematics (1986) from the University of Waterloo!!! :party4:
How about that for school spirit! :yahoo:

IceHand
04-29-2008, 09:03 AM
Thanks a lot for your hard work, caritas!

The new version doesn't work on my laptop, any advice? When I try to run the pi executable directly I get the error:
bash: ./pi: cannot execute binary file

When I try the method described in the first post (. env.sh; pi_format.py chap6.conf) I get the error:
Traceback (most recent call last):
File "/home/icehand/Downloads/pi/bin/pi_format.py", line 16, in <module>
pi_lib.page_divide_all(tmpl_fn)
File "/home/icehand/Downloads/pi/bin/pi_lib.py", line 174, in page_divide_all
doc_conf.gen(file(tmpl_fn, 'r'))
File "/home/icehand/Downloads/pi/bin/pi_lib.py", line 38, in gen
page_param = get_page_param(self)
File "/home/icehand/Downloads/pi/bin/pi_lib.py", line 128, in get_page_param
page_info = get_page_info(pt_fn)
File "/home/icehand/Downloads/pi/bin/pi_lib.py", line 97, in get_page_info
p = Popen(['pi_page_info', fn], stdout = PIPE)
File "/usr/lib/python2.5/subprocess.py", line 594, in __init__
errread, errwrite)
File "/usr/lib/python2.5/subprocess.py", line 1091, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error

And finally when I try to compile the pi executable myself I get the error:
gcc -o pi -g pi.o -lnetpbm
pi.o: In function `pi_line_divide':
/home/icehand/Downloads/pi/pi/pi.c:546: undefined reference to `max'
/home/icehand/Downloads/pi/pi/pi.c:547: undefined reference to `min'
/home/icehand/Downloads/pi/pi/pi.c:561: undefined reference to `min'
pi.o: In function `pi_page_divide':
/home/icehand/Downloads/pi/pi/pi.c:824: undefined reference to `max'
/home/icehand/Downloads/pi/pi/pi.c:827: undefined reference to `min'
/home/icehand/Downloads/pi/pi/pi.c:828: undefined reference to `min'
/home/icehand/Downloads/pi/pi/pi.c:830: undefined reference to `min'
collect2: ld returned 1 exit status
make: *** [pi] Error 1

caritas
04-29-2008, 10:03 AM
> bash: ./pi: cannot execute binary file

This time, I use a x86_64 to develop, so the binary is for x86_64.

The compiling error seems comes from libnetpbm. The max and min is defined in libnetpbm. Make sure there is /usr/include/pm.h available. I use libnetpbm10 package in debian.

IceHand
04-29-2008, 11:43 AM
I have the equivalent of libnetpbm installed and /usr/include/pm.h is present (I use Arch Linux where packages are not split into a normal package and a lib package). Is a specific version of libnetpbm needed? I have version 10.35 installed.

Btw:
cat /usr/include/pm.h | grep max
pm_maxvaltobits(int const maxval);
pm_bitstomaxval(int const bits);
It doesn't look like "max" and "min" are defined in my version of pm.h

nrapallo
04-29-2008, 12:14 PM
I have the equivalent of libnetpbm installed and /usr/include/pm.h is present (I use Arch Linux where packages are not split into a normal package and a lib package). Is a specific version of libnetpbm needed? I have version 10.35 installed.

Btw:

It doesn't look like "max" and "min" are defined in my version of pm.h

In pi.c, just add:#define max(a,b) ((a) > (b) ? (a) : (b))
#define min(a,b) ((a) < (b) ? (a) : (b))


This addition allowed me to compile pi.c under cygwin/windows, but that is as far as I could get.

Sorry, I'm not on Linux so I'm adapting to your setup as best as I can.

nrapallo
05-01-2008, 01:49 PM
Basic Usage for version 0.2:

tar -xjf pi_0.2.tar.bz2
cd pi
. env.sh
cd test
pi_format.py chap.conf
/* output goes in out directory */


Caritas:

The 'pi_0.2.tar.bz2' unarchives as incomplete. In the resulting bin directory I get zero length files for 'pi_image_bbox', 'pi_image_crop', 'pi_page_divide' and 'pi_page_info'. Is this normal? I have used three different unarchivers (including 7-zip) and got the same results.

Please fix version 0.2 so we can all start to use your (wonderful) program!

IceHand
05-01-2008, 02:59 PM
The 'pi_0.2.tar.bz2' unarchives as incomplete. In the resulting bin directory I get zero length files for 'pi_image_bbox', 'pi_image_crop', 'pi_page_divide' and 'pi_page_info'. Is this normal?
It's normal, they are just links to the pi executable, I don't know the reason they exist though.

Btw, I was able to compile pi with the code you posted in your previous post, but it's not working as it should. When running the test chapter I get the error:
Error open pgm image file (null): Bad address.
pi_page_divide: pi.c:330: pi_image_save: Assertion `0' failed.

rmanasa
05-02-2008, 08:52 AM
Greetings -

Idiot in Da House time. First, thanks for all the effort to address this long standing, complex and frustrating issue. Greatly appreciated.

Now to business. I've extracted the Windows version, copied a pdf into that directory, opened a command prompt, cd'd over to that directory and typed "pi old.pdf new.pdf". As soon as I said ".pdf", most of you started shaking your heads, I know - pi is looking for pgm or pbm files.

I'm reasonably intelligent, but not a programmer. I haven't been able to figure out how to create either of these file types from reading the thread. Is there some kind soul out there who can show me how to learn and take advantage of this fine conversion program?

Looking forward to your reply. Thank you!

nrapallo
05-02-2008, 03:37 PM
Greetings -

Idiot in Da House time. First, thanks for all the effort to address this long standing, complex and frustrating issue. Greatly appreciated.

Don't worry, when it comes to using new (breaking edge) software, we all are at a disadvantage and feel frustrated when things don't work and that makes us fell a bit sheepish about asking for help (myself included). :)

Now to business. I've extracted the Windows version, copied a pdf into that directory, opened a command prompt, cd'd over to that directory and typed "pi old.pdf new.pdf". As soon as I said ".pdf", most of you started shaking your heads, I know - pi is looking for pgm or pbm files.

:rohard: Good that you caught yourself on that one! ;)

I was able to find a compiled 'pdftoppm.exe' for windows users at http://www.foolabs.com/xpdf/download.html .

Just get Win32 (built with MSVC): xpdf-3.02pl2-win32.zip (ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl2-win32.zip) and extract it; therein you will find the .exe and help .txt.

I think you already have pi-exe.zip (http://www.mobileread.com/forums/attachment.php?attachmentid=12386&d=1209137093).

That should do it. Just issue:pdftoppm -r 180 -f 1 -l 1 -gray -aa yes a.pdf a
where you can replace "-f 1 -l 1" with your first and last page numbers to convert, 'a.pdf' with your .pdf filename and the following 'a' with any prefix for the resulting .pgm filename you want.

p.s. don't venture onto version 0.2 just yet as no-one has been able to make that work yet on Linux/Mac or Windows. Stay tuned!

I'm reasonably intelligent, but not a programmer. I haven't been able to figure out how to create either of these file types from reading the thread. Is there some kind soul out there who can show me how to learn and take advantage of this fine conversion program?

Looking forward to your reply. Thank you!

You're welcome!

rmanasa
05-02-2008, 04:34 PM
Thanks for the consideration, Nick. I don't know why everyone says you're so mean. ;)


I extracted the pdftoppm program from your link, set it up as best I can, changed one default parameter - so it would create a pgm file instead of a ppm file - and gave it a go. That might be the source of subsequent problems with pi-exe, but it didn't prevent pdftoppm from doing it's thing.

The program created 52 pgm files from the pdf I'm using for testing purposes. While that's a lot, it's the one type of pdf I know I'm gonna need to convert monthly, so I figured I'd see what happened.

Took one of those 52 files, and did the "pi X.pgm Y" thing, which produced the following messages:

C:\Documents and Settings\Rick\My Documents\Unzipped\pi-exe\pi> pi CT0408-000001 .pgm ct0408-01
6 [main] pi 1832 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
1524 [main] pi 1832 open_stackdumpfile: Dumping stack trace to pi.exe.stackdump
1029497 [main] pi 1832 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
1062966 [main] pi 1832 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)

The stackdump file look slike this:

Exception: STATUS_ACCESS_VIOLATION at eip=004014C5
eax=000000FF ebx=0000002B ecx=7FF13198 edx=00000262 esi=00000004 edi=0066423C
ebp=0023CC38 esp=0023CC20 program=C:\Documents and Settings\Rick\My Documents\Unzipped\pi-exe\pi\pi.exe, pid 3676, thread main
cs=001B ds=0023 es=0023 fs=003B gs=0000 ss=0023
Stack trace:
Frame Function Args
0023CC38 004014C5 (00661350, 00660210, 00000000, 000006D9)
0023CCB8 004026CD (00000003, 006601A8, 00660090, 610BEEB7)
0023CD98 610060D8 (00000000, 0023CDD0, 61005450, 0023CDD0)
61005450 61004416 (0000009C, A02404C7, E8611021, FFFFFF48)
1314401 [main] pi 3676 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
1356063 [main] pi 3676 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)


As far as I know, I'm not using the 0.2 version, that you cautioned about (though who knows? Anything's possible wit Da Idiot in Da House.):D

Looking forward to your reply. Thank you!

nrapallo
05-02-2008, 06:38 PM
Thanks for the consideration, Nick. I don't know why everyone says you're so mean. ;)
Hersay! :cool:

I extracted the pdftoppm program from your link, set it up as best I can, changed one default parameter

OK now this forewarns me that something bad is going to happen... :rolleyes:

- so it would create a pgm file instead of a ppm file

the '-gray' switch should take care of that, I think.

- and gave it a go. That might be the source of subsequent problems with pi-exe, but it didn't prevent pdftoppm from doing it's thing.

The program created 52 pgm files from the pdf I'm using for testing purposes. While that's a lot, it's the one type of pdf I know I'm gonna need to convert monthly, so I figured I'd see what happened.

Took one of those 52 files, and did the "pi X.pgm Y" thing,

you did mean 'pi X.pgm Y.pgm'?

which produced the following messages:

C:\Documents and Settings\Rick\My Documents\Unzipped\pi-exe\pi> pi CT0408-000001 .pgm ct0408-01
6 [main] pi 1832 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
1524 [main] pi 1832 open_stackdumpfile: Dumping stack trace to pi.exe.stackdump
1029497 [main] pi 1832 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
1062966 [main] pi 1832 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)

Yes, this is what is referred to in this thread as a segmentation fault. It basically is the program's way of crashing.

The stackdump file look slike this:

Exception: STATUS_ACCESS_VIOLATION at eip=004014C5
eax=000000FF ebx=0000002B ecx=7FF13198 edx=00000262 esi=00000004 edi=0066423C
ebp=0023CC38 esp=0023CC20 program=C:\Documents and Settings\Rick\My Documents\Unzipped\pi-exe\pi\pi.exe, pid 3676, thread main
cs=001B ds=0023 es=0023 fs=003B gs=0000 ss=0023
Stack trace:
Frame Function Args
0023CC38 004014C5 (00661350, 00660210, 00000000, 000006D9)
0023CCB8 004026CD (00000003, 006601A8, 00660090, 610BEEB7)
0023CD98 610060D8 (00000000, 0023CDD0, 61005450, 0023CDD0)
61005450 61004416 (0000009C, A02404C7, E8611021, FFFFFF48)
1314401 [main] pi 3676 _cygtls::handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
1356063 [main] pi 3676 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)

All this means is that the .pdf is too low in resolution, or has a non-white background, or the 'white-gaps' between lines are not easily detectable. In otherwords, your .pdf cannot be converted by version 0.1 "as is".

As far as I know, I'm not using the 0.2 version, that you cautioned about (though who knows? Anything's possible wit Da Idiot in Da House.):D

Looking forward to your reply. Thank you!

Yes, you don't have version 0.2 as no-one has that one working yet!

rmanasa
05-02-2008, 08:25 PM
Very good sir. It seems that, despite my proclivity for making my own trouble, nothing I did in this experiment was out of bounds, and the results were not unexpected (if I *could* have done something to enhance my chances of better results, please be specific and I'll try it.) This is indeed a complex pdf - it's an online magazine, with non-white background, columns that stop and start in artsy, irregular fashion, lots of advertising, etc. So, I get that it's a bit more challenging than a simple document that's been ported from Word to pdf format.

That being said, I have a pdf-to-Palm converter that came with my Treo 650 that has converted almost all content of all the issues of this monthly e-magazine, oddities and all. Only the very first issues of the mag had any problem converting, and those issues were mostly word breaks: no lost content, cracked image conversion, whatever.

I find it baffling that I can carry all these converted files on my Treo and none of them on my Sony Reader. How can this be so transparent on one device and so impossible on another? (Only the truly ignorant can ask such questions. Probably the one and only blessing I enjoy in this matter.) :)

nrapallo
05-02-2008, 08:40 PM
Very good sir. It seems that, despite my proclivity for making my own trouble, nothing I did in this experiment was out of bounds, and the results were not unexpected (if I *could* have done something to enhance my chances of better results, please be specific and I'll try it.) This is indeed a complex pdf - it's an online magazine, with non-white background, columns that stop and start in artsy, irregular fashion, lots of advertising, etc. So, I get that it's a bit more challenging than a simple document that's been ported from Word to pdf format.

That being said, I have a pdf-to-Palm converter that came with my Treo 650 that has converted almost all content of all the issues of this monthly e-magazine, oddities and all. Only the very first issues of the mag had any problem converting, and those issues were mostly word breaks: no lost content, cracked image conversion, whatever.

I find it baffling that I can carry all these converted files on my Treo and none of them on my Sony Reader. How can this be so transparent on one device and so impossible on another? (Only the truly ignorant can ask such questions. Probably the one and only blessing I enjoy in this matter.) :)

Since this program, pi.exe, is not working yet, and not being the author of it, I cannot improve on your chances to get your .pdf processed properly.

May I offer another route: Try PDFRead 1.8.2 as explained here (http://www.mobileread.com/forums/showthread.php?p=172748#post172748) to a user looking to convert .pdf to the Kindle. Just use the prs-505 Profile for the Sony .lrf output format. The 'default' is a landscape mode. Try also 'landscape-half' or 'landscape-full' layout modes.

Please note that the resulting ebook will be just 'images', but they can be rotated, dilated, sharpened, etc...

BTW, I'm the author of that software, so feel free to send me any questions you may have over at the PDFRead main thread here (http://www.mobileread.com/forums/showthread.php?t=21906).

rmanasa
05-02-2008, 09:52 PM
k. I will check it out over the weekend. The primary issue, as you know, is getting the text large enough to read without a magnifying glass. I can deal with anything else - lack of graphics, charts, etc. - but the older I get, the blinder I become. :)

caritas
05-03-2008, 03:39 AM
Version 0.3 is released. You can get the source code from the first post of the thread.

Moho
05-03-2008, 10:10 AM
Hi there,

i have a few question how to use this tool. First i used xpdf to get a whole bunch of pgms out of a pdf. Then i used pi.exe (based on pi 01 i think) to convert the pgm in a more readable pgm. I opend the pgm with gimp and it seems to bee good.
First Question is now how to convert all pgms automatically (the ebook has over 400 Pages) and how to convert the whole pgms back in one pdf? Thanks in advance :)

IceHand
05-03-2008, 11:27 AM
Version 0.3 is released. You can get the source code from the first post of the thread.
Works like a charm with the PDFs I've tested, thank you :)

nrapallo
05-05-2008, 12:16 AM
Overall, pi version 0.3 works well, but I ran into some obstacles trying to 'windows-ize' it.

I succeeded in converting the sample .pdf using 'pi_format chap6.conf' on a Windows PC, but it was a brute-force finish that cannot be used in general. More testing/exploring is required to yield a windows only solution (in addition to the working linux based solution offered by the original poster).

In pi.py, I had to change the bold line to conform with pdftoppm.exe (from xpdf) output of the form "chap6-004-page-000004.pgm" i.e 6 digit page number prior to .pgm.
def get_img(self, dpi = 150, out_prefix = None):
pdf_fn = self.doc.pdf_fn
if out_prefix is None:
out_prefix = '%spage' % (self.output_prefix,)
spage = '%d' % (self.page_no,)
sdpi = '%d' % (dpi,)
ret = call(['pdftoppm', '-r', sdpi, '-f', spage, '-l', spage, '-gray',
pdf_fn, out_prefix])
assert(ret == 0)
img_fn = '%s-%06d.pgm' % (out_prefix, self.page_no)
return img_fn

Also, pi.py was crashing when the bold line below was executed, hence the commenting out (but it leaves behind the .pgm since deleting doesn't work for some unknown reason).Traceback (most recent call last):
File "pi_format.py", line 29, in <module>
File "pi_format.py", line 7, in test_all
File "pi.pyc", line 667, in __init__
File "pi.pyc", line 704, in get_avg_page_stat
File "pi.pyc", line 337, in __init__
File "pi.pyc", line 386, in parse
WindowsError: [Error 32] The process cannot access the file because it is being
used by another process: 'out/chap6-004-page-000004.pgm'
def parse(self, dpi = None):
if dpi is None:
dpi = self.dpi
img_fn = self.get_img(dpi)
p = Popen(['pi_page_parse', img_fn], stdout = PIPE)
self.lines = []
for l in p.stdout:
ws = l.split()
if ws[0] == 'char':
pair = map(int, ws[1:])
ch = Char(pair)
ln.append_char(ch)
elif ws[0] == 'line':
bbox = map(int, ws[1:])
ln = Line(self, bbox)
self.append_line(ln)
else:
self.bbox = map(int, ws[1:])
self.img = Image.open(img_fn)
#os.unlink(img_fn)
self.set_space()

But then when I thought everything was working, I was getting random aborts due to PIL .pgm reading/writing problems as shown below in bold:
page: 4
Error: No display font for 'Symbol'
Error: No display font for 'ZapfDingbats'
Traceback (most recent call last):
File "pi_format.py", line 29, in <module>
File "pi_format.py", line 8, in test_all
File "pi.pyc", line 722, in reformat
File "pi.pyc", line 605, in divide
File "pi.pyc", line 647, in put_seg
File "pi.pyc", line 109, in get_img
File "Image.pyc", line 737, in crop
File "ImageFile.pyc", line 192, in load
IOError: image file is truncated (1111 bytes not processed)

The odd thing is the .pgm image files appear ok even though I get the 'truncated' message. The only way I got it to finish was to generate all the .pgm first, protect them from overwriting by marking them as 'read-only' and then allow 'pi_format chap6.conf' to finish.

In the end, I was able to collect all the generated .gifs and create a 1150 .imp ebook (and the first 17 pages only for Kindle/Cybook .prc and Sony .lrf ebooks). The results are far from perfect, but promising.

caritas
05-10-2008, 09:20 AM
Version 0.4 is released.

ChangeLog:
- Some algorithms are configurable
- For some text may have problem, present both merged and divided version

bazzargh
05-15-2008, 08:07 AM
BTW reading over this thread it seems worth mentioning some previous research into reflowing scanned text for small devices:
http://pubs.iupr.org/DATA/2002-breuel-wdabook.pdf

In that paper they first identify text line segments, images, and column boundaries, similarly to this program, but then the text is segmented into words. Once you've broken the document down into word-sized chunks and know how they aggregate into paragraphs and columns theres numerous ways to reflow the document; one they describe is embedding all the images into html so the scanned document now reflows when you resize the browser window. They go on to talk about how to output this most compactly for a PDA. Interesting stuff.

caritas
05-15-2008, 09:52 AM
Thank you very much for your information.

caritas
05-17-2008, 01:27 AM
Version 0.5 is released.

ChangeLog:

* pi.py: Detect word, and break lines at word end when possible.

* pi.py: Re-align the 'split line segment' (second half of line)
to align with the next line's indenting when appropriate. This
will make the first line indent and bullet items line up better.

* img_dir_to_pdf.sh: Added to convert from images to pdf.

ashkulz
05-18-2008, 08:01 AM
BTW reading over this thread it seems worth mentioning some previous research into reflowing scanned text for small devices:
http://pubs.iupr.org/DATA/2002-breuel-wdabook.pdf

Actually, those are the very algorithms I was trying to work on (I'm the original author of PDFRead) but then gave up on, as it required too much effort to implement them from scratch.

Happily, one of the author of these publications (Thomas Breuel) is now leading the development of Ocropus (http://code.google.com/p/ocropus/) at Google, which is a document analysis and OCR system. Browsing through the code, most of the algorithms already seem to be implemented (and some advances from that, too): I plan to integrate it sometime into PDFRead soon. (I've already contributed some patches to get it compiling under windows). The library interface can be scripted via Lua, so I'm currently trying to put together the bits and pieces to get that approach working.

J_A
05-18-2008, 08:05 PM
Nick,

Thanks for turning me on to PDFRead. One question. I have a secure pdf document purchased form from Wiley awhile back and I can only print it out or read it in Adobe Digital Editions. Is there any way to get the document into PDFRead to convert it? It's spitting out blank pages now I gather because of the security on the file.

JA

nrapallo
05-18-2008, 08:38 PM
Nick,

Thanks for turning me on to PDFRead. One question. I have a secure pdf document purchased form from Wiley awhile back and I can only print it out or read it in Adobe Digital Editions. Is there any way to get the document into PDFRead to convert it? It's spitting out blank pages now I gather because of the security on the file.

JA

Try printing your secure pdf using a pdf printer driver like the free PrimoPDF (http://www.primopdf.com/) printer driver.

bazzargh
05-20-2008, 12:20 PM
Actually, those are the very algorithms I was trying to work on (I'm the original author of PDFRead) ... I plan to integrate it sometime into PDFRead soon.

Excellent! And thanks for the link to OCRopus. I'm going to have to have a play with this stuff.

kergoth
06-03-2008, 04:48 PM
I'm getting an error trying to use this:


clarson@foul mine home#mv!187caf1$ pi_format.py mine.conf
Traceback (most recent call last):
File "/home/clarson/Desktop/pi/pi/bin/pi_format.py", line 59, in <module>
test_all(sys.argv[1])
File "/home/clarson/Desktop/pi/pi/bin/pi_format.py", line 15, in test_all
doc = pi.Doc(conf)
File "/home/clarson/Desktop/pi/pi/bin/pi.py", line 1091, in __init__
self.get_avg_page_stat()
File "/home/clarson/Desktop/pi/pi/bin/pi.py", line 1144, in get_avg_page_stat
self.avg_lh = middle(avg_lhs)
File "/home/clarson/Desktop/pi/pi/bin/pi.py", line 58, in middle
return sl[(len(sl) + 1) / 2]
IndexError: list index out of range

caritas
06-05-2008, 02:09 AM
Yes. This is a bug.

You can fix this by changing line 58 of pi.py from

return sl[(len(sl)+1)/2]

to

return sl[len(sl)/2]

I will fix this in next version.

nclark
06-09-2008, 09:41 PM
has anyone had success compiling this on leopard?

nrapallo
06-09-2008, 10:40 PM
has anyone had success compiling this on leopard?

or windows? If so, how?

nclark
06-09-2008, 10:48 PM
well i don't know how you'd compile it on windows, but the author provided a link to a cygwin binary for it further up:

http://www.mobileread.com/forums/showpost.php?p=174239&postcount=17

my issue is compiling pi.c... i get a slew of errors when i try to build it against netpbm from macports.

nrapallo
06-09-2008, 11:51 PM
well i don't know how you'd compile it on windows, but the author provided a link to a cygwin binary for it further up:

http://www.mobileread.com/forums/showpost.php?p=174239&postcount=17

my issue is compiling pi.c... i get a slew of errors when i try to build it against netpbm from macports.

Oops, that's me you are referring to and a previous version's windows executable.

I am not the author and since version 0.2 I have not been able to use this code in windows due to some strange behaviour of python using pipes to external programs. It's not the author's fault, just a defect/deficiency of my windows installation! :(

nclark
06-10-2008, 12:08 AM
Oops, that's me you are referring to

ah, so i am.

anyone out there, luck on OS X or windows?

IceHand
06-10-2008, 08:47 AM
No, sorry. On Linux it works fine ...

serpentium
07-04-2008, 06:16 PM
has anyone had success compiling this on leopard?

why doesnt work in osX??? i want it!!! :help:

and... what about to put the code in the software (http://www.mobileread.com/forums/showthread.php?p=209021#post209021) to make lrf from cbz? I will really like to convert cbz to img, render them with this program and compile a lrf, with just one click :) (and maybe another click to send to my prs505 from calibre :))

DaleDe
07-04-2008, 06:27 PM
why doesnt work in osX??? i want it!!! :help:

and... what about to put the code in the software (http://www.mobileread.com/forums/showthread.php?p=209021#post209021) to make lrf from cbz? I will really like to convert cbz to img, render them with this program and compile a lrf, with just one click :) (and maybe another click to send to my prs505 from calibre :))

It works on OSX if you have Python installed.

Dale

nclark
07-11-2008, 01:35 AM
It works on OSX if you have Python installed.

Dale

OS X ships with python. it always has.

/System/Library/Frameworks/Python.framework/Versions/2.5/Resources/Python.app/

i haven't had any luck getting it to work (w/ leopard)

hansl
07-22-2008, 11:34 AM
Hi,
I tried this on Windows XP SP2 and found that the Windows version cuts images in two. Then I fiddled around to get the original package running on XP without success.
Depressed I started what I wanted to avoid and tried a Linux live CD, namely Puppy Linux 4.0 together with its development environment since I had to compile and install some packages missing in the distribution like the Python Image Library and others.
This was not too difficult since you always get a good hint where you are currently stuck :-)
In the end it worked out fine and I think this might be an alternative for Windows and MacOS users. If anybody is interested I could add the details in another post.

Only one thing still itches me in the Linux Version: I couldn't get pdftoppm to not ignore my crop box inside the input pdf file. Funny enough, pdftoppm treats crop boxes right in the Windows version??

Greetings,
hansl

hansl
07-23-2008, 02:28 AM
I forgot to praise caritas in my first post. pi is by far the best pdf-preparation-for-conversion-to-ebook tool I know, and the least time consuming regarding interaction and document preparation.
I only fell over this thread by chance and think this tool should be placed in a more prominent place and get a more telling name like ebookpdf (pdf2ebookpdf is probably too long).

hansl

nrapallo
07-23-2008, 10:05 AM
I forgot to praise caritas in my first post. pi is by far the best pdf-preparation-for-conversion-to-ebook tool I know, and the least time consuming regarding interaction and document preparation.

Can you give us a short description of what you needed to get it to compile and your experiences with your Live CD? i.e., which libraries were added (and from where), where did you store the compiled executable, how do you (re-)use the program with your Live CD setup, can you produce a 'windows binaries'from that setup, etc...

I only fell over this thread by chance and think this tool should be placed in a more prominent place and get a more telling name like ebookpdf (pdf2ebookpdf is probably too long).

hansl

BTW, from a previous version, it appears that pi is short for pdftoimage. You know those Linux/Unix types, always shortening their typing experience...

hansl
07-28-2008, 05:05 PM
Can you give us a short description of what you needed to get it to compile and your experiences with your Live CD? i.e., which libraries were added (and from where), where did you store the compiled executable, how do you (re-)use the program with your Live CD setup, can you produce a 'windows binaries'from that setup, etc...

Hi Nick and thanks for asking. I downloaded and installed the following:

- Puppy Linux 4.00 "Dingo" ISO CD-ROM image from puppylinux.org (burn the CD and follow installation instructions from puppylinux.org)

- devx_400.sfs from ibiblio.org - contains development environment for compiling and installation of additional source packages (no install, just save the file in the same directory as the next one below)

- pup_save.2fs created on C: drive by Puppy for additional persistent packages, virtual RAM and faster bootup (you may choose the size, I recommend the maximum which is around 1.3 GB)

- pup_400.sfs and zdrv_400.sfs copied to C: drive by Puppy for faster bootup

- tiff-3.8.2 from www.remotesensing.org (source installation)
- libpng-1.2.29 from www.libpng.org (source installation)
- zlib-1.2.3 from www.zlib.net (source installation)
all source installations went flawless using ./configure, make, make install

- xpdf-3.0.2 from ibiblio.org (binary installation with puppy package manager)

- imagemagick 6.0.6.2-2.7 from dotpups.de
(binary installation, newer version available but I got "convert" only running with this version)

- Python Image Library 1.1.6 from www.pythonware.com (built with python commands)

and finally
- pi_0.5 where I had to make a change in img_dir_to_pdf.sh:
The call to tiff2pdf with arguments -z -o ... always produced a barely readable white characters on black background pdf.
tiff2pdf -n -z -o ... solves the problem and produces black chars on white bg.

So that's it and no, I didn't even try to produce a Windows binary. I preferred to let success shine on me :cool: . Also, I had no trouble with Puppy Linux besides finding the right Imagemagick version.

BTW document exchange between Puppy and Windows is easy, you can mount the Windows filesystem in Puppy. And don't worry, the desktop environment is very newbie friendly. So, most installations and mounts are point and click.

Now I can admit that I have a Solaris past and feel happy to dive a little into the Unix feeling again.
:thanks:

hansl

Edit: I forgot to mention that like all 32 bit dinosaurs I had to delete pi.o in order to get pi compiled. All files I added to the live CD have been saved in pup_save.2fs on my hard disk, i.e. I didn't really change the CD but what it loads on bootup from my HD (I have to leave for backup now ...)

Hanselda
08-06-2008, 06:59 AM
That is really excellent work. In fact I made something very similar. But I did not go as far as to analyze the image!

Some ideas:
1. Try to use the command pdfimage from pdflib, this can compile all the png images directly into a single PDF. It is much faster than using convert again.

2. Try to quantize the color of the png file. This will reduce the image file size significantly. For e-ink screen the color depth is only 4 - 16, compared to standard 8 bit channel with 256 colors.

3. This method in fact can also work for djvu file. With ddjvu command one can convert certain page into pgm:
'ddjvu -page=%i -scale=%i -format=pgm %s %s' %(pageno, dpi, inputfile, outputfile)

kentsin
08-10-2008, 04:36 AM
I use another pdf and got

page: 1
page: 2
Traceback (most recent call last):
File "/home/kentsin/pi/bin/pi_format.py", line 59, in <module>
test_all(sys.argv[1])
File "/home/kentsin/pi/bin/pi_format.py", line 16, in test_all
doc.reformat()
File "/home/kentsin/pi/bin/pi.py", line 1154, in reformat
page = Page(self, pn)
File "/home/kentsin/pi/bin/pi.py", line 690, in __init__
BasicPage.__init__(self, doc, page_no, dpi)
File "/home/kentsin/pi/bin/pi.py", line 569, in __init__
self.dpi = self.get_dpi()
File "/home/kentsin/pi/bin/pi.py", line 695, in get_dpi
dpi = self.doc.target_width * 50 / width
ZeroDivisionError: float division

caritas
08-11-2008, 09:59 AM
Version 0.6 is released. Binary and source can be downloaded from the first post of thread.

ChangeLog:

2008-08-11 Huang Ying <ying.huang.caritas@gmail.com>

* Version: 0.6

* pi.py: Initial implementation of embolden.

* pi.py: Use norm coordinate in class Page and Line.

* pi.py: Add edge trimming support.

* pi.py: Add run pages mode.

* pi.py: Add page range support.

* pi.py: Re-work ImageOutput, split multi-page image.

* pi.py: Rotate during scale if approriate.

* img_dir_to_pdf.sh: Add color reduction support.

Gianfranco
08-11-2008, 07:03 PM
I used v0.5 to merge all files into a pdf, but the result was negated. The text was white and the page was black, what could have caused this?

Am I the only one who has experienced it?

Best regards
Gianfranco Alongi

PS: Great tool :)!

hansl
08-12-2008, 05:14 AM
I used v0.5 to merge all files into a pdf, but the result was negated. The text was white and the page was black, what could have caused this?

Am I the only one who has experienced it?

Best regards
Gianfranco Alongi

PS: Great tool :)!
I had the same problem and it went away with this fix:

in img_dir_to_pdf.sh line 27 change
tiff2pdf -z -o $cwd/$pdf_fn pdf-$pdf_fn.tiff
to
tiff2pdf -n -z -o $cwd/$pdf_fn pdf-$pdf_fn.tiff

I have not tried but in v0.6 caritas changed that to
tiff2pdf -nz ... so I guess it will work with 0.6 natively

hansl

Gianfranco
08-12-2008, 09:27 AM
Okay. Nice.
I'll try v 0.6 directly once I come home from work :)
And once again;;; what a great tool :)

Maybe you should consider releasing a howto and tutorial on the tool caritas?

Gianfranco
08-12-2008, 06:23 PM
I used the new release and I am pleased :)
I wrote about this a little in my blog (http://writert.blogspot.com)

xiblack
08-20-2008, 02:45 AM
Overall, pi version 0.3 works well, but I ran into some obstacles trying to 'windows-ize' it.

I succeeded in converting the sample .pdf using 'pi_format chap6.conf' on a Windows PC, but it was a brute-force finish that cannot be used in general. More testing/exploring is required to yield a windows only solution (in addition to the working linux based solution offered by the original poster).

In pi.py, I had to change the bold line to conform with pdftoppm.exe (from xpdf) output of the form "chap6-004-page-000004.pgm" i.e 6 digit page number prior to .pgm.
def get_img(self, dpi = 150, out_prefix = None):
pdf_fn = self.doc.pdf_fn
if out_prefix is None:
out_prefix = '%spage' % (self.output_prefix,)
spage = '%d' % (self.page_no,)
sdpi = '%d' % (dpi,)
ret = call(['pdftoppm', '-r', sdpi, '-f', spage, '-l', spage, '-gray',
pdf_fn, out_prefix])
assert(ret == 0)
img_fn = '%s-%06d.pgm' % (out_prefix, self.page_no)
return img_fn


Hi,

I try the latest pi_06 on my SuSE OSS 10.0.0, it didnt work until I try the fix above.

After the fix, pi_06 works well but I encounter this error after some pages generated:


...
page: 30
page: 31
page: 32
Traceback (most recent call last):
File "/home/name/download/pi/bin/pi_format.py", line 67, in ?
test_all(sys.argv[1])
File "/home/name/download/pi/bin/pi_format.py", line 16, in test_all
doc.reformat()
File "/home/name/download/pi/bin/pi.py", line 1495, in reformat
page.rend()
File "/home/name/download/pi/bin/pi.py", line 761, in rend
self.img = self.img.filter(ImageFilter.MinFilter(3))
File "/usr/lib/python2.4/site-packages/PIL/Image.py", line 715, in filter
self.load()
File "/usr/lib/python2.4/site-packages/PIL/ImageFile.py", line 148, in load
self.im = Image.core.map_buffer(
ValueError: buffer is not large enough


I wonder where I can set the buffer larger or is it a limit of anything?

ashkulz
08-25-2008, 11:09 AM
I've attached a working version of pi-0.6 which will work under Windows. I had to make a few changes in the code, which have been attached as a diff. Probably caritas could apply them in the next release (they're generic).

Usage: Unzip pi-0.6-win32.zip somewhere and run as instructed above by caritas (You'll need a working Python (http://www.python.org) with PIL (http://www.pythonware.com/products/pil/) installation). In case you want the proper fonts, unzip xpdf-fonts.zip in the same directory and adjust the paths in bin/xpdfrc (right now it's hardcoded to C:\pi).

Enjoy!

nrapallo
08-25-2008, 11:46 AM
I've attached a working version of pi-0.6 which will work under Windows. I had to make a few changes in the code, which have been attached as a diff. Probably caritas could apply them in the next release (they're generic).

Usage: Unzip pi-0.6-win32.zip somewhere and run as instructed above by caritas (You'll need a working Python (http://www.python.org) with PIL (http://www.pythonware.com/products/pil/) installation). In case you want the proper fonts, unzip xpdf-fonts.zip in the same directory and adjust the paths in bin/xpdfrc (right now it's hardcoded to C:\pi).

Enjoy!

Well done Ashish! :2thumbsup

Now I FINALLY can get to try this out (in WinXP) and perhaps incorporate it into PDFRead. Or do you want to do that as I'm at a disadvantage not knowing python as well as you (and it is your original creation)?

Thank you for doing this; I had given up trying to get my windows implementation to work.

BTW, I got a proxy server working for the REB1200 if you are interested. It's in the Fictionwise forum and called Linreb.

Regards,

caritas
08-30-2008, 04:33 AM
I've attached a working version of pi-0.6 which will work under Windows. I had to make a few changes in the code, which have been attached as a diff. Probably caritas could apply them in the next release (they're generic).



Thank you very much!

I will add it to the next version.

caritas
08-30-2008, 05:50 AM
Version 0.7 is released, ChangeLog is as follow:

2008-08-30 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.7

* pi.py: Add LRF output support.

* pi.py: Add TOC support for LRF output format

* pi.py: Add output rotate support.

* pdfminfo: Add pdfminfo to extract PDF information such as TOC,
title, author, etc.

* overall: Add initial windows support, thanks ashkulz of
mobileread forum.

ashkulz
08-30-2008, 08:44 PM
Version 0.7 is released, ChangeLog is as follow:

2008-08-30 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.7

* pi.py: Add LRF output support.

* pi.py: Add TOC support for LRF output format

* pi.py: Add output rotate support.

* pdfminfo: Add pdfminfo to extract PDF information such as TOC,
title, author, etc.

* overall: Add initial windows support, thanks ashkulz of
mobileread forum. I'm attaching pi_page_parse 0.7 compiled for windows. The usage should be similiar to the 0.6 version (if you want, install 0.6 first and then replace all *.py in the bin folder from the 0.7 version).

nrapallo
09-10-2008, 01:21 PM
I tried ashkulz's win32 executable (v0.6) and obtained great results trying to convert the sample chap6.pdf into my readers native .imp format using PDFRead on the resulting .png in the out folder.

The optimal PDFRead settings used were:

1. In Format 'imgdir'
2. Out Format 'imp2' for EBW1150 or 'imp1' for REB1200. Just substitue your reader's format here instead.
3. Use a 'portrait-p' profile and 'portrait-full' layout mode
4. Check the 'no dilation' box (I tried dilation and since the pi .png's are in a lower resolution it looks terrible!)
5. Click 'Convert' :)

Looks promising, now only to get that pi algorithm incorporated into PDFRead (with GUI)!

nrapallo
09-10-2008, 06:12 PM
I tried ashkulz's win32 executable (v0.6) and obtained great results trying to convert the sample chap6.pdf into my readers native .imp format using PDFRead on the resulting .png in the out folder.

The optimal PDFRead settings used were:

1. In Format 'imgdir'
2. Out Format 'imp2' for EBW1150 or 'imp1' for REB1200. Just substitue your reader's format here instead.
3. Use a 'portrait-p' profile and 'portrait-full' layout mode
4. Check the 'no dilation' box (I tried dilation and since the pi .png's are in a lower resolution it looks terrible!)
5. Click 'Convert' :)

Looks promising, now only get that pi algorithm incorporated into PDFRead (with GUI)!


Just a note that I attached .lrf and .prc versions of the above sample chap6.pdf here (http://www.mobileread.com/forums/showthread.php?p=250704#post250704).

This is for the other (popular) small screened ebook readers, Sony PRS-500/505 and Kindle. :)

caritas
09-20-2008, 09:16 AM
2008-09-20 Huang Ying <huang.ying.caritas@gmail.com>

* Version: 0.8

* overall: Reorganize program in a more modular way.

* pi.image: Add unpaper support for scanned book

* pi.image: Add column compress support for scanned book

* pi.divide: Add simple divider for divide = 1

inew
10-08-2008, 03:14 AM
Following the discussions in http://www.mobileread.com/forums/showthread.php?t=30261, I cam to this thread.

I really appreciate the idea and effort of you guys. I just have a tiny question. Could you please please please consolidate a package that be easier to use?

I read through the discussion and notices that the program "should" be able to convert a pdf to a lrf according to the described algorithm. However, I just do not have the know how-to to test the program.

Is it possible for some of you experts to develop a binary package/script with a simple interface that can be run by a dummy people like me? I understand that you experts' major interests a in impoving the image processing algorithm. But if the package can be used by simple clicks/command, it will attract more people's interests and suggestions.

Thanks

massyah
11-10-2008, 06:12 AM
Hi everyone !
First of all, thanks caritas for your work and for sharing it ! You really had a good idea !

I've managed to run your code from os x 10.5.5, but with lot of hassle and trial. It now works flawlessly!(Still I think that there's some room for optimization!).

If anyone else is interested in running this program on a mac os x box, I'd be glad to share my experience with them !
(Either in this thread, another thread or by PM).

Sam.

DaleDe
11-10-2008, 02:36 PM
Hi everyone !
First of all, thanks caritas for your work and for sharing it ! You really had a good idea !

I've managed to run your code from os x 10.5.5, but with lot of hassle and trial. It now works flawlessly!(Still I think that there's some room for optimization!).

If anyone else is interested in running this program on a mac os x box, I'd be glad to share my experience with them !
(Either in this thread, another thread or by PM).

Sam.

How about building a wiki page?

Dale

Taesoo Kwon
11-12-2008, 03:33 AM
Hi Caritas,
I really like your idea of PDF-reflow. I also developed a program for a similar purpose (rendering multi-column PDFs in small devices - I read lots of two-column papers these days.) I wonder if it is okay to adopt your idea into my program?

Taesoo Kwon.

caritas
11-14-2008, 03:45 AM
I wonder if it is okay to adopt your idea into my program?


It's my pleasure that this program is helpful to you. You can adopt the idea. But if you want to use my source code directly, please open source your source code too.

BTW: Can you give me a link to your program?

Taesoo Kwon
11-14-2008, 08:10 PM
Thank you. I will. The link is:

http://jupiter.kaist.ac.kr/~taesoo/projects/paperCrop/index_eng.html

dsimunic
11-29-2008, 01:36 PM
Sam,

would be great if you could share how you made it work for Leopard. I've spent all day adapting and fixing and finally got stuck with compiling pi_page_parse - could not find pgm.h file to include.

Would be great if you could outline the steps and dependencies for compiling.

Thanks,

damir

marcusgennaroz
12-10-2008, 08:20 AM
Sam, please could you give us some documentation on how to run it on Mac OSX?

Thanks!
Marco

Artair
12-13-2008, 03:15 AM
Nice algorithm.Many many thanks.

daesdaemar
12-13-2008, 10:49 AM
OK, I'll be the first to admit I'm a dummy and need some help. Please, I need BASIC instructions on how to run this.

I have extracted the "pi" folder to my C: drive and open a cmd prompt window and cd to the pi directory and then I am stuck and can get nothing else to work.

Note that I am trainable. I asked for this type of basic help for mobidedrm and can now use that script quite nicely. I just need to learn how to get it to run.

Thanks in advance.

EDIT: OK, I think I have figured out that the original script from caritas does not run under Windows. I did download the Windows version by ashkulz, but am still stuck. I do have python 2.6 and PIL installed.

EDIT yet again: here is a pic of the error message I am getting.

daesdaemar
12-15-2008, 05:24 PM
Bump my post immediately prior to this one. Don't think I ever had a "no-response" for over two days before???

daesdaemar
12-20-2008, 09:33 AM
Bump again... Does anyone still use this script or follow its discussion?

ashkulz
12-20-2008, 11:23 AM
Bump again... Does anyone still use this script or follow its discussion? I guess everyone is away for the holidays... anyway, your problem is that you have PIL installed incorrectly. Either get a 2.6 version of PIL or install Python 2.5.2 and the 2.5 version [the _imaging module is the C code which implements PIL].

nrapallo
12-20-2008, 02:00 PM
I guess everyone is away for the holidays...

:bigwave:

Ashish:

Would it be possible to update your pi-0.6 for Win32 to the more recent 0.8 version? I too cannot use this latest version and have never been able to use pi in windows until I used your port. :thumbsup:

ashkulz
12-22-2008, 11:28 AM
pi-0.8 isn't as easy to port to win32 as it uses bundled programs (yapdfinfo and yapdftoxml) which need to be compiled with poppler. I'll see what I can see this coming weekend...

caritas
12-23-2008, 08:07 AM
I am now working on ibsuite (image book suite), which combines pi and some other e-book image tools. And the intention is to make it easier to compile and install. The main target platform is unix like platform. I think maybe it can work with cygwin under windows.

Bob Russell
12-23-2008, 10:21 AM
That's awesome! This is a tool I'm anxious to try, but is a bit more than I want to bite off in current form (because I'm lazy, not because it's not nice as it is).

But to see this in a simple to use tool, even if only in Linux or cygwin, would be awesome!

In addition, are there still any thoughts about adding it to PDFRead?

ashkulz
12-23-2008, 01:03 PM
Actually, caritas was all ready for it (he even released 0.8 by implementing some of the ideas we discussed) but I got caught up with a lot of work in real life and couldn't make progress on anything (as you might have noticed by my absence on the forum).

I'm thinking of restarting work on PDFRead, but don't know when it'll be done [no commitments considering my tardy track record]. I plan to do it as a single C++ executable and have got the necessary libraries to cross-compile for Windows (developing on Linux) -- the algorithm has still to be ported.

Either way, there have been lots of good improvements on the PDF front lately -- soPDF, PaperCrop, pi and PDFRead are now available and are good in certain aspects. We should also give thanks to alex_d (for PDFRasterFarian) and cacapee (for pdflrf) as they have served as inspiration for these tools, although they are are no longer legally available.

Bob Russell
12-23-2008, 02:09 PM
...I'm thinking of restarting work on PDFRead, but don't know when it'll be done...

Either way, there have been lots of good improvements on the PDF front lately -- soPDF, PaperCrop, pi and PDFRead are now available and are good in certain aspects. We should also give thanks to alex_d (for PDFRasterFarian) and cacapee (for pdflrf) as they have served as inspiration for these tools, although they are are no longer legally available.That's awesome news in any timeframe. I haven't checked out the newer tools yet, but am delighted with the simplicity of PDFRead, so that's what I've been using. I guess I'll have to take another look now that some time has passed and more work has been done.

I don't think I'll ever stop being impressed with the amazing community achievements of software developers whose collective work produces such useful applications made available freely. I think I echo the sentiments of many when I say "Thanks, thanks and more thanks!"

caritas
04-05-2009, 11:50 AM
Finally, there is successor for this program, renamed to IBSuite.

http://www.mobileread.com/forums/showthread.php?t=44247

Shyne
05-18-2010, 11:50 AM
is this only for Linux?