View Full Version : Create reflowable content for the Sony Reader with deskUNPDF
sammykrupa 05-12-2007, 03:50 PM Docudesk's new program is out, and it is excellent (on Mac atleast!):
http://labs.docudesk.com/latest-technologies/2007/5/8/create-reflowable-content-for-the-sony-reader-with-deskunpdf.html
I've tested its Windows version. For pdf files based on images, the lrf output result is not desirable to me, obviously the conversion depends entirely on the program's OCR capability. In this respect the program does not have much advantage compared with ther OCR softwares.
For text based pdf documents, this program does a wonderful job. Its speed of conversion is fast. Batch file processing is great. It makes me wonder whether there could be a program that can reflow the image-based pdf to lrf without OCR.
nekokami 05-12-2007, 11:45 PM I wish it had an output other than lrf, so we iLiad users could use it. But I guess that's what PDFtoHTML is for -- now that we have fbreader to read html. :)
jimmyzou 05-13-2007, 12:24 PM This is really wonderful tools for Sony reader users. I try it and immediately put it on my first piority than Scansoft's PDF converter before
tsgreer 05-13-2007, 06:53 PM This thing is awesome so far. Not sure if I can create a linked Table of Contents yet since I just downloaded it, but I like it better than Libriate for creating .lrf files. I can finally have italics and some formatting when I make books. I can also do illustrated versions now too. Yay!
kovidgoyal 05-13-2007, 08:00 PM You'd get more features with pdftohtml + html2lrf/BookDesigner
tsgreer 05-13-2007, 08:20 PM You'd get more features with pdftohtml + html2lrf/BookDesigner
Well I'm on a basic non-intel, non-windows having Mac, so my options were pretty limited until this came out. I don't know any programming code and I don't use Terminal, so I am the guy that has to wait for the nice and easy GUI's to come out. I may try to figure out the programming stuff, but I just don't have enough time in the day...
kovidgoyal 05-13-2007, 09:11 PM Ah that would explain your reluctance. The hard part is really installing the tools, not using them. A simple use case would look like
pdftohtml my.pdf
html2lrf my.html
But yeah, until you can get past the installation hurdle, you're better off with the GUI.
ddesk 05-14-2007, 06:46 PM The final release version of deskUNPDF Professional is spec'd to perform PDF-HTML conversion, handle pdf-BBeB TOC conversions and internal links, the OCR engine will be enabled for extracting text from images and fixing text from PDFs with non-standard font encodings (all of this is detailed in the readme file). On the pdftohtml->html2lrf solution, I can tell you that deskUNPDF will outperform pdftohtml in creating structured text, paragraphs etc, from PDFs hands down. Besides this, doing an extra conversion (pdf-html-lrf vs pdf-lrf) is always going to me more lossy.
kovidgoyal 05-14-2007, 07:02 PM That's great, are you going to release the pdf->html converter as a standalone app/library as well. What's it written in?
dsyzling 05-15-2007, 05:23 AM re pdftohtml - does this extact embedded images? Last time I tried the 0.39 Windows command line tool it only extracted text (in simple mode). Complex mode converted to png but for final conversion to lrf that wasn't too useful for me. All formatting, headings, document structure was lost as well.
Darren
nekokami 05-15-2007, 09:35 AM ... fixing text from PDFs with non-standard font encodings
This is particularly interesting to me. I've had a couple of PDFs that I wasn't able to convert using other tools because of non-standard encodings.
ddesk 05-16-2007, 01:51 PM That's great, are you going to release the pdf->html converter as a standalone app/library as well. What's it written in?
The entire conversion engine, PDF to all formats, will be available as an API. It is written in Java, which is compiled to native code for various platforms using GCJ (incidentally, we have an article on our labs site about building an OS X cross compiler for GCJ). For the initial release, the library will be available as a Java class (linked via JNI) and a COM component for Windows. That said, our main focus is on creating simple to use end user applications. Its nice to have different tools available, especially open source ones. Have you thought of creating simple installer packages for your python prs500-gui app? It was a pretty high bar to get all of the needed dependencies installed (at least on OS X), too much for the average user. With all of the features it offers, I know it would be a welcomed contribution.
kovidgoyal 05-16-2007, 03:22 PM There is an installer for windows and for linux its just a couple of commands. However, I don't have convenient access to an OSX machine, so I can't maintain an OSX installer. It's a pity...
A cross platform text extraction engine for PDF is a really useful thing. I'm looking forward to it.
how can I do html->lrf conversion with docudesk's software?
I'd prefer not to convert the html to pdf... that forgets about the structure and adds noise such as page headers footers
I tried using your PDF virtual printer, the results are acceptable... maybe it's just some more config i.e. turning these things off.
JSWolf 05-16-2007, 05:14 PM how can I do html->lrf conversion with docudesk's software?
I'd prefer not to convert the html to pdf... that forgets about the structure and adds noise such as page headers footers
I tried using your PDF virtual printer, the results are acceptable... maybe it's just some more config i.e. turning these things off.
HTML2LRF would do what you want..
http://www.mobileread.com/forums/showthread.php?t=10582
Enjoy!
i tried a couple of converters
HTML2LRF does not preserve formatting
It did create a good TOC though...
kovidgoyal 05-16-2007, 08:10 PM i tried a couple of converters
HTML2LRF does not preserve formatting
It did create a good TOC though...
Umm which html2lrf are you talking about?
JSWolf 05-17-2007, 12:54 AM i tried a couple of converters
HTML2LRF does not preserve formatting
It did create a good TOC though...
Did you use HTML2LRF that came with LIBPRS500? Or did you use some other HTML2LRF? I did link you to the proper version. However, if you used some other version, then no wonder it did not work.
Jon
|