05-02-2008, 07:25 PM | #46 |
Junior Member
Posts: 8
Karma: 10
Join Date: Oct 2007
Device: PRS-505
|
Very good sir. It seems that, despite my proclivity for making my own trouble, nothing I did in this experiment was out of bounds, and the results were not unexpected (if I *could* have done something to enhance my chances of better results, please be specific and I'll try it.) This is indeed a complex pdf - it's an online magazine, with non-white background, columns that stop and start in artsy, irregular fashion, lots of advertising, etc. So, I get that it's a bit more challenging than a simple document that's been ported from Word to pdf format.
That being said, I have a pdf-to-Palm converter that came with my Treo 650 that has converted almost all content of all the issues of this monthly e-magazine, oddities and all. Only the very first issues of the mag had any problem converting, and those issues were mostly word breaks: no lost content, cracked image conversion, whatever. I find it baffling that I can carry all these converted files on my Treo and none of them on my Sony Reader. How can this be so transparent on one device and so impossible on another? (Only the truly ignorant can ask such questions. Probably the one and only blessing I enjoy in this matter.) |
05-02-2008, 07:40 PM | #47 | |
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
Quote:
May I offer another route: Try PDFRead 1.8.2 as explained here to a user looking to convert .pdf to the Kindle. Just use the prs-505 Profile for the Sony .lrf output format. The 'default' is a landscape mode. Try also 'landscape-half' or 'landscape-full' layout modes. Please note that the resulting ebook will be just 'images', but they can be rotated, dilated, sharpened, etc... BTW, I'm the author of that software, so feel free to send me any questions you may have over at the PDFRead main thread here. Last edited by nrapallo; 05-02-2008 at 07:43 PM. |
|
Advert | |
|
05-02-2008, 08:52 PM | #48 |
Junior Member
Posts: 8
Karma: 10
Join Date: Oct 2007
Device: PRS-505
|
k. I will check it out over the weekend. The primary issue, as you know, is getting the text large enough to read without a magnifying glass. I can deal with anything else - lack of graphics, charts, etc. - but the older I get, the blinder I become.
|
05-03-2008, 02:39 AM | #49 |
Enthusiast
Posts: 26
Karma: 161
Join Date: Feb 2008
Device: Sony PRS505
|
Version 0.3 is released. You can get the source code from the first post of the thread.
|
05-03-2008, 09:10 AM | #50 |
Junior Member
Posts: 1
Karma: 10
Join Date: May 2008
Device: Sony 505
|
Hi there,
i have a few question how to use this tool. First i used xpdf to get a whole bunch of pgms out of a pdf. Then i used pi.exe (based on pi 01 i think) to convert the pgm in a more readable pgm. I opend the pgm with gimp and it seems to bee good. First Question is now how to convert all pgms automatically (the ebook has over 400 Pages) and how to convert the whole pgms back in one pdf? Thanks in advance |
Advert | |
|
05-03-2008, 10:27 AM | #51 |
Linux User
Posts: 323
Karma: 13682
Join Date: Aug 2007
Location: Germany
Device: Kindle 3
|
|
05-04-2008, 11:16 PM | #52 |
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
Overall, pi version 0.3 works well, but I ran into some obstacles trying to 'windows-ize' it.
I succeeded in converting the sample .pdf using 'pi_format chap6.conf' on a Windows PC, but it was a brute-force finish that cannot be used in general. More testing/exploring is required to yield a windows only solution (in addition to the working linux based solution offered by the original poster). In pi.py, I had to change the bold line to conform with pdftoppm.exe (from xpdf) output of the form "chap6-004-page-000004.pgm" i.e 6 digit page number prior to .pgm. Code:
def get_img(self, dpi = 150, out_prefix = None): pdf_fn = self.doc.pdf_fn if out_prefix is None: out_prefix = '%spage' % (self.output_prefix,) spage = '%d' % (self.page_no,) sdpi = '%d' % (dpi,) ret = call(['pdftoppm', '-r', sdpi, '-f', spage, '-l', spage, '-gray', pdf_fn, out_prefix]) assert(ret == 0) img_fn = '%s-%06d.pgm' % (out_prefix, self.page_no) return img_fn Traceback (most recent call last): File "pi_format.py", line 29, in <module> File "pi_format.py", line 7, in test_all File "pi.pyc", line 667, in __init__ File "pi.pyc", line 704, in get_avg_page_stat File "pi.pyc", line 337, in __init__ File "pi.pyc", line 386, in parse WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'out/chap6-004-page-000004.pgm' Code:
def parse(self, dpi = None): if dpi is None: dpi = self.dpi img_fn = self.get_img(dpi) p = Popen(['pi_page_parse', img_fn], stdout = PIPE) self.lines = [] for l in p.stdout: ws = l.split() if ws[0] == 'char': pair = map(int, ws[1:]) ch = Char(pair) ln.append_char(ch) elif ws[0] == 'line': bbox = map(int, ws[1:]) ln = Line(self, bbox) self.append_line(ln) else: self.bbox = map(int, ws[1:]) self.img = Image.open(img_fn) #os.unlink(img_fn) self.set_space() Code:
page: 4 Error: No display font for 'Symbol' Error: No display font for 'ZapfDingbats' Traceback (most recent call last): File "pi_format.py", line 29, in <module> File "pi_format.py", line 8, in test_all File "pi.pyc", line 722, in reformat File "pi.pyc", line 605, in divide File "pi.pyc", line 647, in put_seg File "pi.pyc", line 109, in get_img File "Image.pyc", line 737, in crop File "ImageFile.pyc", line 192, in load IOError: image file is truncated (1111 bytes not processed) In the end, I was able to collect all the generated .gifs and create a 1150 .imp ebook (and the first 17 pages only for Kindle/Cybook .prc and Sony .lrf ebooks). The results are far from perfect, but promising. Last edited by nrapallo; 05-04-2008 at 11:23 PM. Reason: added resulting .gif for first 17 pages in ebook for viewing |
05-10-2008, 08:20 AM | #53 |
Enthusiast
Posts: 26
Karma: 161
Join Date: Feb 2008
Device: Sony PRS505
|
Version 0.4 is released.
ChangeLog: - Some algorithms are configurable - For some text may have problem, present both merged and divided version |
05-15-2008, 07:07 AM | #54 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2008
Device: iliad
|
BTW reading over this thread it seems worth mentioning some previous research into reflowing scanned text for small devices:
http://pubs.iupr.org/DATA/2002-breuel-wdabook.pdf In that paper they first identify text line segments, images, and column boundaries, similarly to this program, but then the text is segmented into words. Once you've broken the document down into word-sized chunks and know how they aggregate into paragraphs and columns theres numerous ways to reflow the document; one they describe is embedding all the images into html so the scanned document now reflows when you resize the browser window. They go on to talk about how to output this most compactly for a PDA. Interesting stuff. |
05-15-2008, 08:52 AM | #55 |
Enthusiast
Posts: 26
Karma: 161
Join Date: Feb 2008
Device: Sony PRS505
|
Thank you very much for your information.
|
05-17-2008, 12:27 AM | #56 |
Enthusiast
Posts: 26
Karma: 161
Join Date: Feb 2008
Device: Sony PRS505
|
Version 0.5 is released.
ChangeLog: * pi.py: Detect word, and break lines at word end when possible. * pi.py: Re-align the 'split line segment' (second half of line) to align with the next line's indenting when appropriate. This will make the first line indent and bullet items line up better. * img_dir_to_pdf.sh: Added to convert from images to pdf. |
05-18-2008, 07:01 AM | #57 | |
Addict
Posts: 350
Karma: 705
Join Date: Dec 2006
Location: Mumbai, India
Device: Kindle 1/REB 1200
|
Quote:
Happily, one of the author of these publications (Thomas Breuel) is now leading the development of Ocropus at Google, which is a document analysis and OCR system. Browsing through the code, most of the algorithms already seem to be implemented (and some advances from that, too): I plan to integrate it sometime into PDFRead soon. (I've already contributed some patches to get it compiling under windows). The library interface can be scripted via Lua, so I'm currently trying to put together the bits and pieces to get that approach working. |
|
05-18-2008, 07:05 PM | #58 |
Junior Member
Posts: 1
Karma: 10
Join Date: May 2008
Device: Kindle
|
Nick,
Thanks for turning me on to PDFRead. One question. I have a secure pdf document purchased form from Wiley awhile back and I can only print it out or read it in Adobe Digital Editions. Is there any way to get the document into PDFRead to convert it? It's spitting out blank pages now I gather because of the security on the file. JA |
05-18-2008, 07:38 PM | #59 | |
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
Quote:
Last edited by nrapallo; 05-19-2008 at 10:48 PM. Reason: added link to free PrimoPDF download |
|
05-20-2008, 11:20 AM | #60 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2008
Device: iliad
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
My images are disappearing on small screen devices | sbrwake | Kindle Formats | 2 | 01-10-2009 09:01 PM |
Pre-render and cache PDF pages? | nekokami | iRex | 3 | 07-02-2008 03:26 AM |
PDF Text too small! | thacursedpie | iRex | 9 | 03-18-2008 02:53 PM |
Spies can run small devices on body heat. What about eBooks? | mogui | News | 23 | 09-21-2007 01:31 PM |
over 2 mins to render PDF page | reh_reh | iRex | 6 | 11-11-2006 07:57 AM |