Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 05-02-2008, 07:25 PM   #46
rmanasa
Junior Member
rmanasa began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Oct 2007
Device: PRS-505
Very good sir. It seems that, despite my proclivity for making my own trouble, nothing I did in this experiment was out of bounds, and the results were not unexpected (if I *could* have done something to enhance my chances of better results, please be specific and I'll try it.) This is indeed a complex pdf - it's an online magazine, with non-white background, columns that stop and start in artsy, irregular fashion, lots of advertising, etc. So, I get that it's a bit more challenging than a simple document that's been ported from Word to pdf format.

That being said, I have a pdf-to-Palm converter that came with my Treo 650 that has converted almost all content of all the issues of this monthly e-magazine, oddities and all. Only the very first issues of the mag had any problem converting, and those issues were mostly word breaks: no lost content, cracked image conversion, whatever.

I find it baffling that I can carry all these converted files on my Treo and none of them on my Sony Reader. How can this be so transparent on one device and so impossible on another? (Only the truly ignorant can ask such questions. Probably the one and only blessing I enjoy in this matter.)
rmanasa is offline   Reply With Quote
Old 05-02-2008, 07:40 PM   #47
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Quote:
Originally Posted by rmanasa View Post
Very good sir. It seems that, despite my proclivity for making my own trouble, nothing I did in this experiment was out of bounds, and the results were not unexpected (if I *could* have done something to enhance my chances of better results, please be specific and I'll try it.) This is indeed a complex pdf - it's an online magazine, with non-white background, columns that stop and start in artsy, irregular fashion, lots of advertising, etc. So, I get that it's a bit more challenging than a simple document that's been ported from Word to pdf format.

That being said, I have a pdf-to-Palm converter that came with my Treo 650 that has converted almost all content of all the issues of this monthly e-magazine, oddities and all. Only the very first issues of the mag had any problem converting, and those issues were mostly word breaks: no lost content, cracked image conversion, whatever.

I find it baffling that I can carry all these converted files on my Treo and none of them on my Sony Reader. How can this be so transparent on one device and so impossible on another? (Only the truly ignorant can ask such questions. Probably the one and only blessing I enjoy in this matter.)
Since this program, pi.exe, is not working yet, and not being the author of it, I cannot improve on your chances to get your .pdf processed properly.

May I offer another route: Try PDFRead 1.8.2 as explained here to a user looking to convert .pdf to the Kindle. Just use the prs-505 Profile for the Sony .lrf output format. The 'default' is a landscape mode. Try also 'landscape-half' or 'landscape-full' layout modes.

Please note that the resulting ebook will be just 'images', but they can be rotated, dilated, sharpened, etc...

BTW, I'm the author of that software, so feel free to send me any questions you may have over at the PDFRead main thread here.

Last edited by nrapallo; 05-02-2008 at 07:43 PM.
nrapallo is offline   Reply With Quote
Advert
Old 05-02-2008, 08:52 PM   #48
rmanasa
Junior Member
rmanasa began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Oct 2007
Device: PRS-505
k. I will check it out over the weekend. The primary issue, as you know, is getting the text large enough to read without a magnifying glass. I can deal with anything else - lack of graphics, charts, etc. - but the older I get, the blinder I become.
rmanasa is offline   Reply With Quote
Old 05-03-2008, 02:39 AM   #49
caritas
Enthusiast
caritas doesn't littercaritas doesn't litter
 
Posts: 26
Karma: 161
Join Date: Feb 2008
Device: Sony PRS505
Version 0.3 is released. You can get the source code from the first post of the thread.
caritas is offline   Reply With Quote
Old 05-03-2008, 09:10 AM   #50
Moho
Junior Member
Moho began at the beginning.
 
Posts: 1
Karma: 10
Join Date: May 2008
Device: Sony 505
Hi there,

i have a few question how to use this tool. First i used xpdf to get a whole bunch of pgms out of a pdf. Then i used pi.exe (based on pi 01 i think) to convert the pgm in a more readable pgm. I opend the pgm with gimp and it seems to bee good.
First Question is now how to convert all pgms automatically (the ebook has over 400 Pages) and how to convert the whole pgms back in one pdf? Thanks in advance
Moho is offline   Reply With Quote
Advert
Old 05-03-2008, 10:27 AM   #51
IceHand
Linux User
IceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavensIceHand is a rising star in the heavens
 
IceHand's Avatar
 
Posts: 323
Karma: 13682
Join Date: Aug 2007
Location: Germany
Device: Kindle 3
Quote:
Originally Posted by caritas View Post
Version 0.3 is released. You can get the source code from the first post of the thread.
Works like a charm with the PDFs I've tested, thank you
IceHand is offline   Reply With Quote
Old 05-04-2008, 11:16 PM   #52
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Overall, pi version 0.3 works well, but I ran into some obstacles trying to 'windows-ize' it.

I succeeded in converting the sample .pdf using 'pi_format chap6.conf' on a Windows PC, but it was a brute-force finish that cannot be used in general. More testing/exploring is required to yield a windows only solution (in addition to the working linux based solution offered by the original poster).

In pi.py, I had to change the bold line to conform with pdftoppm.exe (from xpdf) output of the form "chap6-004-page-000004.pgm" i.e 6 digit page number prior to .pgm.
Code:
def get_img(self, dpi = 150, out_prefix = None):
        pdf_fn = self.doc.pdf_fn
        if out_prefix is None:
            out_prefix = '%spage' % (self.output_prefix,)
        spage = '%d' % (self.page_no,)
        sdpi = '%d' % (dpi,)
        ret = call(['pdftoppm', '-r', sdpi, '-f', spage, '-l', spage, '-gray',
                    pdf_fn, out_prefix])
        assert(ret == 0)
        img_fn = '%s-%06d.pgm' % (out_prefix, self.page_no)
        return img_fn
Also, pi.py was crashing when the bold line below was executed, hence the commenting out (but it leaves behind the .pgm since deleting doesn't work for some unknown reason).
Traceback (most recent call last):
File "pi_format.py", line 29, in <module>
File "pi_format.py", line 7, in test_all
File "pi.pyc", line 667, in __init__
File "pi.pyc", line 704, in get_avg_page_stat
File "pi.pyc", line 337, in __init__
File "pi.pyc", line 386, in parse
WindowsError: [Error 32] The process cannot access the file because it is being
used by another process: 'out/chap6-004-page-000004.pgm'
Code:
def parse(self, dpi = None):
        if dpi is None:
            dpi = self.dpi
        img_fn = self.get_img(dpi)
        p = Popen(['pi_page_parse', img_fn], stdout = PIPE)
        self.lines = []
        for l in p.stdout:
            ws = l.split()
            if  ws[0] == 'char':
                pair = map(int, ws[1:])
                ch = Char(pair)
                ln.append_char(ch)
            elif ws[0] == 'line':
                bbox = map(int, ws[1:])
                ln = Line(self, bbox)
                self.append_line(ln)
            else:
                self.bbox = map(int, ws[1:])
        self.img = Image.open(img_fn)
        #os.unlink(img_fn)
        self.set_space()
But then when I thought everything was working, I was getting random aborts due to PIL .pgm reading/writing problems as shown below in bold:
Code:
page: 4
Error: No display font for 'Symbol'
Error: No display font for 'ZapfDingbats'
Traceback (most recent call last):
  File "pi_format.py", line 29, in <module>
  File "pi_format.py", line 8, in test_all
  File "pi.pyc", line 722, in reformat
  File "pi.pyc", line 605, in divide
  File "pi.pyc", line 647, in put_seg
  File "pi.pyc", line 109, in get_img
  File "Image.pyc", line 737, in crop
  File "ImageFile.pyc", line 192, in load
IOError: image file is truncated (1111 bytes not processed)
The odd thing is the .pgm image files appear ok even though I get the 'truncated' message. The only way I got it to finish was to generate all the .pgm first, protect them from overwriting by marking them as 'read-only' and then allow 'pi_format chap6.conf' to finish.

In the end, I was able to collect all the generated .gifs and create a 1150 .imp ebook (and the first 17 pages only for Kindle/Cybook .prc and Sony .lrf ebooks). The results are far from perfect, but promising.
Attached Files
File Type: imp chap6-001-0.imp (2.42 MB, 503 views)
File Type: prc chap6-001-0-pages1-17.prc (883.1 KB, 2834 views)
File Type: lrf chap6-001-0-pages1-17.lrf (554.6 KB, 483 views)
File Type: zip gif-pages1-17.zip (1.21 MB, 500 views)

Last edited by nrapallo; 05-04-2008 at 11:23 PM. Reason: added resulting .gif for first 17 pages in ebook for viewing
nrapallo is offline   Reply With Quote
Old 05-10-2008, 08:20 AM   #53
caritas
Enthusiast
caritas doesn't littercaritas doesn't litter
 
Posts: 26
Karma: 161
Join Date: Feb 2008
Device: Sony PRS505
Version 0.4 is released.

ChangeLog:
- Some algorithms are configurable
- For some text may have problem, present both merged and divided version
caritas is offline   Reply With Quote
Old 05-15-2008, 07:07 AM   #54
bazzargh
Junior Member
bazzargh began at the beginning.
 
Posts: 7
Karma: 10
Join Date: May 2008
Device: iliad
BTW reading over this thread it seems worth mentioning some previous research into reflowing scanned text for small devices:
http://pubs.iupr.org/DATA/2002-breuel-wdabook.pdf

In that paper they first identify text line segments, images, and column boundaries, similarly to this program, but then the text is segmented into words. Once you've broken the document down into word-sized chunks and know how they aggregate into paragraphs and columns theres numerous ways to reflow the document; one they describe is embedding all the images into html so the scanned document now reflows when you resize the browser window. They go on to talk about how to output this most compactly for a PDA. Interesting stuff.
bazzargh is offline   Reply With Quote
Old 05-15-2008, 08:52 AM   #55
caritas
Enthusiast
caritas doesn't littercaritas doesn't litter
 
Posts: 26
Karma: 161
Join Date: Feb 2008
Device: Sony PRS505
Thank you very much for your information.
caritas is offline   Reply With Quote
Old 05-17-2008, 12:27 AM   #56
caritas
Enthusiast
caritas doesn't littercaritas doesn't litter
 
Posts: 26
Karma: 161
Join Date: Feb 2008
Device: Sony PRS505
Version 0.5 is released.

ChangeLog:

* pi.py: Detect word, and break lines at word end when possible.

* pi.py: Re-align the 'split line segment' (second half of line)
to align with the next line's indenting when appropriate. This
will make the first line indent and bullet items line up better.

* img_dir_to_pdf.sh: Added to convert from images to pdf.
caritas is offline   Reply With Quote
Old 05-18-2008, 07:01 AM   #57
ashkulz
Addict
ashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enoughashkulz will become famous soon enough
 
ashkulz's Avatar
 
Posts: 350
Karma: 705
Join Date: Dec 2006
Location: Mumbai, India
Device: Kindle 1/REB 1200
Quote:
Originally Posted by bazzargh View Post
BTW reading over this thread it seems worth mentioning some previous research into reflowing scanned text for small devices:
http://pubs.iupr.org/DATA/2002-breuel-wdabook.pdf
Actually, those are the very algorithms I was trying to work on (I'm the original author of PDFRead) but then gave up on, as it required too much effort to implement them from scratch.

Happily, one of the author of these publications (Thomas Breuel) is now leading the development of Ocropus at Google, which is a document analysis and OCR system. Browsing through the code, most of the algorithms already seem to be implemented (and some advances from that, too): I plan to integrate it sometime into PDFRead soon. (I've already contributed some patches to get it compiling under windows). The library interface can be scripted via Lua, so I'm currently trying to put together the bits and pieces to get that approach working.
ashkulz is offline   Reply With Quote
Old 05-18-2008, 07:05 PM   #58
J_A
Junior Member
J_A began at the beginning.
 
Posts: 1
Karma: 10
Join Date: May 2008
Device: Kindle
Nick,

Thanks for turning me on to PDFRead. One question. I have a secure pdf document purchased form from Wiley awhile back and I can only print it out or read it in Adobe Digital Editions. Is there any way to get the document into PDFRead to convert it? It's spitting out blank pages now I gather because of the security on the file.

JA
J_A is offline   Reply With Quote
Old 05-18-2008, 07:38 PM   #59
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Quote:
Originally Posted by J_A View Post
Nick,

Thanks for turning me on to PDFRead. One question. I have a secure pdf document purchased form from Wiley awhile back and I can only print it out or read it in Adobe Digital Editions. Is there any way to get the document into PDFRead to convert it? It's spitting out blank pages now I gather because of the security on the file.

JA
Try printing your secure pdf using a pdf printer driver like the free PrimoPDF printer driver.

Last edited by nrapallo; 05-19-2008 at 10:48 PM. Reason: added link to free PrimoPDF download
nrapallo is offline   Reply With Quote
Old 05-20-2008, 11:20 AM   #60
bazzargh
Junior Member
bazzargh began at the beginning.
 
Posts: 7
Karma: 10
Join Date: May 2008
Device: iliad
Quote:
Originally Posted by ashkulz View Post
Actually, those are the very algorithms I was trying to work on (I'm the original author of PDFRead) ... I plan to integrate it sometime into PDFRead soon.
Excellent! And thanks for the link to OCRopus. I'm going to have to have a play with this stuff.
bazzargh is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
My images are disappearing on small screen devices sbrwake Kindle Formats 2 01-10-2009 09:01 PM
Pre-render and cache PDF pages? nekokami iRex 3 07-02-2008 03:26 AM
PDF Text too small! thacursedpie iRex 9 03-18-2008 02:53 PM
Spies can run small devices on body heat. What about eBooks? mogui News 23 09-21-2007 01:31 PM
over 2 mins to render PDF page reh_reh iRex 6 11-11-2006 07:57 AM


All times are GMT -4. The time now is 11:21 AM.


MobileRead.com is a privately owned, operated and funded community.