An algorithm to render PDF in small devices - Page 4

rmanasa · 05-02-2008, 07:25 PM

Very good sir. It seems that, despite my proclivity for making my own trouble, nothing I did in this experiment was out of bounds, and the results were not unexpected (if I *could* have done something to enhance my chances of better results, please be specific and I'll try it.) This is indeed a complex pdf - it's an online magazine, with non-white background, columns that stop and start in artsy, irregular fashion, lots of advertising, etc. So, I get that it's a bit more challenging than a simple document that's been ported from Word to pdf format.

That being said, I have a pdf-to-Palm converter that came with my Treo 650 that has converted almost all content of all the issues of this monthly e-magazine, oddities and all. Only the very first issues of the mag had any problem converting, and those issues were mostly word breaks: no lost content, cracked image conversion, whatever.

I find it baffling that I can carry all these converted files on my Treo and none of them on my Sony Reader. How can this be so transparent on one device and so impossible on another? (Only the truly ignorant can ask such questions. Probably the one and only blessing I enjoy in this matter.)

nrapallo · 05-02-2008, 07:40 PM

Quote:

Originally Posted by rmanasa

Very good sir. It seems that, despite my proclivity for making my own trouble, nothing I did in this experiment was out of bounds, and the results were not unexpected (if I *could* have done something to enhance my chances of better results, please be specific and I'll try it.) This is indeed a complex pdf - it's an online magazine, with non-white background, columns that stop and start in artsy, irregular fashion, lots of advertising, etc. So, I get that it's a bit more challenging than a simple document that's been ported from Word to pdf format.

That being said, I have a pdf-to-Palm converter that came with my Treo 650 that has converted almost all content of all the issues of this monthly e-magazine, oddities and all. Only the very first issues of the mag had any problem converting, and those issues were mostly word breaks: no lost content, cracked image conversion, whatever.

I find it baffling that I can carry all these converted files on my Treo and none of them on my Sony Reader. How can this be so transparent on one device and so impossible on another? (Only the truly ignorant can ask such questions. Probably the one and only blessing I enjoy in this matter.)

Since this program, pi.exe, is not working yet, and not being the author of it, I cannot improve on your chances to get your .pdf processed properly.

May I offer another route: Try PDFRead 1.8.2 as explained here to a user looking to convert .pdf to the Kindle. Just use the prs-505 Profile for the Sony .lrf output format. The 'default' is a landscape mode. Try also 'landscape-half' or 'landscape-full' layout modes.

Please note that the resulting ebook will be just 'images', but they can be rotated, dilated, sharpened, etc...

BTW, I'm the author of that software, so feel free to send me any questions you may have over at the PDFRead main thread here.

rmanasa · 05-02-2008, 08:52 PM

k. I will check it out over the weekend. The primary issue, as you know, is getting the text large enough to read without a magnifying glass. I can deal with anything else - lack of graphics, charts, etc. - but the older I get, the blinder I become.

caritas · 05-03-2008, 02:39 AM

Version 0.3 is released. You can get the source code from the first post of the thread.

Moho · 05-03-2008, 09:10 AM

Hi there,

i have a few question how to use this tool. First i used xpdf to get a whole bunch of pgms out of a pdf. Then i used pi.exe (based on pi 01 i think) to convert the pgm in a more readable pgm. I opend the pgm with gimp and it seems to bee good.
First Question is now how to convert all pgms automatically (the ebook has over 400 Pages) and how to convert the whole pgms back in one pdf? Thanks in advance

IceHand · 05-03-2008, 10:27 AM

Quote:

Originally Posted by caritas

Version 0.3 is released. You can get the source code from the first post of the thread.

Works like a charm with the PDFs I've tested, thank you

nrapallo · 05-04-2008, 11:16 PM

Overall, pi version 0.3 works well, but I ran into some obstacles trying to 'windows-ize' it.

I succeeded in converting the sample .pdf using 'pi_format chap6.conf' on a Windows PC, but it was a brute-force finish that cannot be used in general. More testing/exploring is required to yield a windows only solution (in addition to the working linux based solution offered by the original poster).

In pi.py, I had to change the bold line to conform with pdftoppm.exe (from xpdf) output of the form "chap6-004-page-000004.pgm" i.e 6 digit page number prior to .pgm.

Code:

def get_img(self, dpi = 150, out_prefix = None):
        pdf_fn = self.doc.pdf_fn
        if out_prefix is None:
            out_prefix = '%spage' % (self.output_prefix,)
        spage = '%d' % (self.page_no,)
        sdpi = '%d' % (dpi,)
        ret = call(['pdftoppm', '-r', sdpi, '-f', spage, '-l', spage, '-gray',
                    pdf_fn, out_prefix])
        assert(ret == 0)
        img_fn = '%s-%06d.pgm' % (out_prefix, self.page_no)
        return img_fn

Also, pi.py was crashing when the bold line below was executed, hence the commenting out (but it leaves behind the .pgm since deleting doesn't work for some unknown reason).

Traceback (most recent call last):
File "pi_format.py", line 29, in <module>
File "pi_format.py", line 7, in test_all
File "pi.pyc", line 667, in __init__
File "pi.pyc", line 704, in get_avg_page_stat
File "pi.pyc", line 337, in __init__
File "pi.pyc", line 386, in parse
WindowsError: [Error 32] The process cannot access the file because it is being
used by another process: 'out/chap6-004-page-000004.pgm'

Code:

def parse(self, dpi = None):
        if dpi is None:
            dpi = self.dpi
        img_fn = self.get_img(dpi)
        p = Popen(['pi_page_parse', img_fn], stdout = PIPE)
        self.lines = []
        for l in p.stdout:
            ws = l.split()
            if  ws[0] == 'char':
                pair = map(int, ws[1:])
                ch = Char(pair)
                ln.append_char(ch)
            elif ws[0] == 'line':
                bbox = map(int, ws[1:])
                ln = Line(self, bbox)
                self.append_line(ln)
            else:
                self.bbox = map(int, ws[1:])
        self.img = Image.open(img_fn)
        #os.unlink(img_fn)
        self.set_space()

But then when I thought everything was working, I was getting random aborts due to PIL .pgm reading/writing problems as shown below in bold:

Code:

page: 4
Error: No display font for 'Symbol'
Error: No display font for 'ZapfDingbats'
Traceback (most recent call last):
  File "pi_format.py", line 29, in <module>
  File "pi_format.py", line 8, in test_all
  File "pi.pyc", line 722, in reformat
  File "pi.pyc", line 605, in divide
  File "pi.pyc", line 647, in put_seg
  File "pi.pyc", line 109, in get_img
  File "Image.pyc", line 737, in crop
  File "ImageFile.pyc", line 192, in load
IOError: image file is truncated (1111 bytes not processed)

The odd thing is the .pgm image files appear ok even though I get the 'truncated' message. The only way I got it to finish was to generate all the .pgm first, protect them from overwriting by marking them as 'read-only' and then allow 'pi_format chap6.conf' to finish.

In the end, I was able to collect all the generated .gifs and create a 1150 .imp ebook (and the first 17 pages only for Kindle/Cybook .prc and Sony .lrf ebooks). The results are far from perfect, but promising.

caritas · 05-10-2008, 08:20 AM

Version 0.4 is released.

ChangeLog:
- Some algorithms are configurable
- For some text may have problem, present both merged and divided version

bazzargh · 05-15-2008, 07:07 AM

BTW reading over this thread it seems worth mentioning some previous research into reflowing scanned text for small devices:
http://pubs.iupr.org/DATA/2002-breuel-wdabook.pdf

In that paper they first identify text line segments, images, and column boundaries, similarly to this program, but then the text is segmented into words. Once you've broken the document down into word-sized chunks and know how they aggregate into paragraphs and columns theres numerous ways to reflow the document; one they describe is embedding all the images into html so the scanned document now reflows when you resize the browser window. They go on to talk about how to output this most compactly for a PDA. Interesting stuff.

caritas · 05-15-2008, 08:52 AM

Thank you very much for your information.

caritas · 05-17-2008, 12:27 AM

Version 0.5 is released.

ChangeLog:

* pi.py: Detect word, and break lines at word end when possible.

* pi.py: Re-align the 'split line segment' (second half of line)
to align with the next line's indenting when appropriate. This
will make the first line indent and bullet items line up better.

* img_dir_to_pdf.sh: Added to convert from images to pdf.

ashkulz · 05-18-2008, 07:01 AM

Quote:

Originally Posted by bazzargh

BTW reading over this thread it seems worth mentioning some previous research into reflowing scanned text for small devices:
http://pubs.iupr.org/DATA/2002-breuel-wdabook.pdf

Actually, those are the very algorithms I was trying to work on (I'm the original author of PDFRead) but then gave up on, as it required too much effort to implement them from scratch.

Happily, one of the author of these publications (Thomas Breuel) is now leading the development of Ocropus at Google, which is a document analysis and OCR system. Browsing through the code, most of the algorithms already seem to be implemented (and some advances from that, too): I plan to integrate it sometime into PDFRead soon. (I've already contributed some patches to get it compiling under windows). The library interface can be scripted via Lua, so I'm currently trying to put together the bits and pieces to get that approach working.

J_A · 05-18-2008, 07:05 PM

Nick,

Thanks for turning me on to PDFRead. One question. I have a secure pdf document purchased form from Wiley awhile back and I can only print it out or read it in Adobe Digital Editions. Is there any way to get the document into PDFRead to convert it? It's spitting out blank pages now I gather because of the security on the file.

JA

nrapallo · 05-18-2008, 07:38 PM

Quote:

Originally Posted by J_A

Nick,

Thanks for turning me on to PDFRead. One question. I have a secure pdf document purchased form from Wiley awhile back and I can only print it out or read it in Adobe Digital Editions. Is there any way to get the document into PDFRead to convert it? It's spitting out blank pages now I gather because of the security on the file.

JA

Try printing your secure pdf using a pdf printer driver like the free PrimoPDF printer driver.

bazzargh · 05-20-2008, 11:20 AM

Quote:

Originally Posted by ashkulz

Actually, those are the very algorithms I was trying to work on (I'm the original author of PDFRead) ... I plan to integrate it sometime into PDFRead soon.

Excellent! And thanks for the link to OCRopus. I'm going to have to have a play with this stuff.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
My images are disappearing on small screen devices	sbrwake	Kindle Formats	2	01-10-2009 09:01 PM
Pre-render and cache PDF pages?	nekokami	iRex	3	07-02-2008 03:26 AM
PDF Text too small!	thacursedpie	iRex	9	03-18-2008 02:53 PM
Spies can run small devices on body heat. What about eBooks?	mogui	News	23	09-21-2007 01:31 PM
over 2 mins to render PDF page	reh_reh	iRex	6	11-11-2006 07:57 AM

05-02-2008, 07:25 PM	#46
rmanasa Junior Member Posts: 8 Karma: 10 Join Date: Oct 2007 Device: PRS-505	Very good sir. It seems that, despite my proclivity for making my own trouble, nothing I did in this experiment was out of bounds, and the results were not unexpected (if I could have done something to enhance my chances of better results, please be specific and I'll try it.) This is indeed a complex pdf - it's an online magazine, with non-white background, columns that stop and start in artsy, irregular fashion, lots of advertising, etc. So, I get that it's a bit more challenging than a simple document that's been ported from Word to pdf format. That being said, I have a pdf-to-Palm converter that came with my Treo 650 that has converted almost all content of all the issues of this monthly e-magazine, oddities and all. Only the very first issues of the mag had any problem converting, and those issues were mostly word breaks: no lost content, cracked image conversion, whatever. I find it baffling that I can carry all these converted files on my Treo and none of them on my Sony Reader. How can this be so transparent on one device and so impossible on another? (Only the truly ignorant can ask such questions. Probably the one and only blessing I enjoy in this matter.)

05-02-2008, 08:52 PM	#48
rmanasa Junior Member Posts: 8 Karma: 10 Join Date: Oct 2007 Device: PRS-505	k. I will check it out over the weekend. The primary issue, as you know, is getting the text large enough to read without a magnifying glass. I can deal with anything else - lack of graphics, charts, etc. - but the older I get, the blinder I become.

05-03-2008, 02:39 AM	#49
caritas Enthusiast Posts: 26 Karma: 161 Join Date: Feb 2008 Device: Sony PRS505	Version 0.3 is released. You can get the source code from the first post of the thread.

05-03-2008, 09:10 AM	#50
Moho Junior Member Posts: 1 Karma: 10 Join Date: May 2008 Device: Sony 505	Hi there, i have a few question how to use this tool. First i used xpdf to get a whole bunch of pgms out of a pdf. Then i used pi.exe (based on pi 01 i think) to convert the pgm in a more readable pgm. I opend the pgm with gimp and it seems to bee good. First Question is now how to convert all pgms automatically (the ebook has over 400 Pages) and how to convert the whole pgms back in one pdf? Thanks in advance

05-10-2008, 08:20 AM	#53
caritas Enthusiast Posts: 26 Karma: 161 Join Date: Feb 2008 Device: Sony PRS505	Version 0.4 is released. ChangeLog: - Some algorithms are configurable - For some text may have problem, present both merged and divided version

05-15-2008, 07:07 AM	#54
bazzargh Junior Member Posts: 7 Karma: 10 Join Date: May 2008 Device: iliad	BTW reading over this thread it seems worth mentioning some previous research into reflowing scanned text for small devices: http://pubs.iupr.org/DATA/2002-breuel-wdabook.pdf In that paper they first identify text line segments, images, and column boundaries, similarly to this program, but then the text is segmented into words. Once you've broken the document down into word-sized chunks and know how they aggregate into paragraphs and columns theres numerous ways to reflow the document; one they describe is embedding all the images into html so the scanned document now reflows when you resize the browser window. They go on to talk about how to output this most compactly for a PDA. Interesting stuff.

05-15-2008, 08:52 AM	#55
caritas Enthusiast Posts: 26 Karma: 161 Join Date: Feb 2008 Device: Sony PRS505	Thank you very much for your information.

05-17-2008, 12:27 AM	#56
caritas Enthusiast Posts: 26 Karma: 161 Join Date: Feb 2008 Device: Sony PRS505	Version 0.5 is released. ChangeLog: * pi.py: Detect word, and break lines at word end when possible. * pi.py: Re-align the 'split line segment' (second half of line) to align with the next line's indenting when appropriate. This will make the first line indent and bullet items line up better. * img_dir_to_pdf.sh: Added to convert from images to pdf.

05-18-2008, 07:05 PM	#58
J_A Junior Member Posts: 1 Karma: 10 Join Date: May 2008 Device: Kindle	Nick, Thanks for turning me on to PDFRead. One question. I have a secure pdf document purchased form from Wiley awhile back and I can only print it out or read it in Adobe Digital Editions. Is there any way to get the document into PDFRead to convert it? It's spitting out blank pages now I gather because of the security on the file. JA

Advert

Advert