![]() |
#1 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Calibre PDF conversions - LRF/EPUB vs RTF
I have been experimenting with Calibre's PDF conversions with varying results.
The thing that struck me most is the difference in the success of the PDF conversion depending on the choice of output format. LRF, EPUB, LIT are much better than RTF. -- See my 2nd post for details. For many books you would just read the LRF/EPUB once and move onto the next book. However, sometimes you have a favourite book you will want to read many times and you are prepared to put some effort into making it look as good as possible. In these cases being able to convert the PDF to something easily editable, like RTF, is a real bonus. You can then re-upload your finished labour-of-love to Calibre for posterity. My question is this. Are the problems converting to RTF a Calibre defect or some limitation of the RTF format? Or could it be some conversion paramenter I haven't set correctly? Whilst I'm on the subject, it would also be nice to have HTML as an output format which is easily editable. I know you can convert LIT to HTML with 3rd party products (with variable results) but doing everything in Calibre would be great. Are there any plans for this? I'd also like to say that, despite anything I may have said in this post, Calibre's conversion of PDF (for the novels I've tried) is still way better than most of the other PDF-to-Word converter programs I've been able to try. -- Jackie |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,129
Karma: 27110892
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
calibre's RTF output is rather new and could use improvement. Note that you can edit EPUB files using for example, Sigil a free EPUB editor.
You can also get the HTML from calibre's PDf conversion by using the debug settings, it will output the HTML to the specified directory. Oh and calibre's PDF conversion is going to get even better ![]() |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
(... continued )
For anyone who's interested in the detailed findings…
PDF source was created from MSWord using CutePDF, one of those pseudo-print utilities that direct your print file to a PDF. I got the following results when converting to LRF, EPUB or LIT using Calibre v6.13 (Look&Feel fields, Input-char-encoding and Transliterate-unicode-to-ASCII, both set to blank):- Very good: smart quotes (double and single) italics, bold, bold italics m-dash, ellipsis currency symbols (dollar, cents, GB Pound, Euro), Western European accented chars (e-acute, cedilla etc) Sorry, no experience with Eastern European languages so didn't try them. fractions OK: Small Caps (converted to standard caps) Ordinals (converted to standard lowercase) Subscript, superscript (converted to standard size in EPUB/LRF – sort-of-OK, looked better in LIT) Not so good: Strikethrough and underline (just the underlying standard text) Graphics (appears in converted file but not in the right place) Using exactly the same PDF input and Calibre conversion parameters, these were the results when converting to RTF:- ![]() All of the following special chars were replaced with a question-mark (?) smart quotes (double and single) m-dash, ellipsis currency symbols (except the dollar), fractions accented chars (e-acute, cedilla etc) All the other things gave the same result as for LRF, EPUB, LIT. I tried each of the following in the Look&Feel Input-char-encoding field but none of them made any improvements:- UTF8, UTF-8, latin1, iso-8859-1, windows-1252, cp1252 Switching on the Look&Feel Transliterate-unicode-to-ASCII before converting to RTF, did improve things a lot:- Smart quotes, double & single, became standard quotes, double & single. - easy enough to edit back to smart quotes in MSWord if you want to put in the effort m-dash became -- (again, easy to edit) Currency symbols were converted to acceptable text equivalents, e.g euro became EU (although I would have thought that GBP might be better than PS for British pound sterling) Fractions were converted to standard size chars e.g. 1/2 and 3/4 Accented chars were converted to their standard base chars, e.g e-acute became e - not really a problem for me, as a Brit, but not so good for French, German etc Regards, Jackie |
![]() |
![]() |
![]() |
#4 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Quote:
![]() I didn't know about the HTML output via debug. I shall have to give that a try. Re: Sigil - I don't know anything about EPUB other than reading them, so I might need to psych myself up before looking at that. I'm glad to hear you've got long-term plans for PDF conversions. in my brief experience, the paragraph reconstruction could use some refinement. I found some paragraphs were being combined for no apparent reason - although it wouldn't seriously affect the reading experience. On a plus note, I'd just like to say that Calibre's recognition of italics was better than anything else I tried. |
|
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,129
Karma: 27110892
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Just look at the debug section of the conversion options, it's pretty self explanantory
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Debug
Thanks, Kovid.
I've now tried it. It looks very promising. I need to give it a go with a 'proper' PDF and try to decide which of the 4 different versions of HTML (input, parsed, processed, structure) will be the best candidate for beautifying. Do you think it's just me, or do you think that perhaps others may not know that DEBUG produces useable HTML? Perhaps an explicit reference in your Help Text in the DEBUG box may make it more obvious. After all there's plenty of room in there. --Jackie |
![]() |
![]() |
![]() |
#7 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,129
Karma: 27110892
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Sure, will be in next release.
|
![]() |
![]() |
![]() |
#8 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Follow-up
Kovid,
Following your advice yesterday, I have been experimenting with the various stages of HTML output from the convert-ebook debug option. I have tried 2 different PDFs and found something strange with both of them. The Parsed, Processed and Structure HTML versions all had far too much italic and bold when viewed in my browser (Firefox). I finally tracked the problem down to a few strange HTML tags, namely <b/> and/or <i/> which appeared at the Parsed stage and still remained at the Structure-stage. They were not present in the Input-stage HTML. Both these tags caused problems with the text following. If viewed in the browser all the remaining text following an <i/> was italic. Similarly, all the text following a <b/> was bold. So by the time there had been one of each, the remaining text to the end of file was bold and italic. The good news is that when I manually deleted the strange tags, all the text became correct again in the browser, i.e. no intended bold and italic were lost, so I was able to carry on experimenting. Are these strange tags meant to be there? ![]() |
![]() |
![]() |
![]() |
#9 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,129
Karma: 27110892
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
yeah they are removed in the final stage creation of the epub. Your browser is interpreting the files as HTML when they are really XHTML.
A <i/> tag is a self-closed italic tag which is the same as <i></i> |
![]() |
![]() |
![]() |
#10 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Quote:
![]() However, are you sure that they are removed. I've just had a closer look at the resulting EPUB. An <i/> tag is located at page 21 (Chapter 1) - EPUB text stays italic until page 105 (beg Chapter 4). Similarly a <b/> tag at page 1016 (Chapter 26) - EPUB text stays bold until page 1095 (Chapter 28) For completeness, I created an LRF with all the same settings. The LRF has bold and italics in all the right places. The <i/> and <b/> tags do still exist in the debug HTML. Think I'll stick with reading LRFs for the time being. |
|
![]() |
![]() |
![]() |
#11 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
... and there's more ...
Since I wrote the above I've found out how to look at the HTML inside the EPUB and can see that it's split into pieces.
The <i/> tags have become <i class="calibre4"/> and the <b/> tags have become <b class="calibre5"/> So the formatting is correcting itself at the HTML split points. |
![]() |
![]() |
![]() |
#12 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,129
Karma: 27110892
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I don't see this with my PDF files. Open a ticket and atatch one of your PDF files to it.
|
![]() |
![]() |
![]() |
#13 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Kovid, I'm happy to open a ticket but I can't really attach the PDF as it's a commercial one with the DRM removed (by someone who knew what they were doing). It's also large - about 7Mb.
Could you do your tests with a small selection from one or more of the debug stages and/or one of the epub sections? There were about 10 of these - only 3 with the problem tags. If so perhaps you could specify a combo which would suffice and I'll do my best. In the meantime I'll see if I can find a way to extract one of the PDF's problem pages without destroying the problem. Any suggestions for free software which will do this gratefully accepted. -- jackie |
![]() |
![]() |
![]() |
#14 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
I've found a PDF Splitter so ignore the above. I should be able to add something useful to the ticket.
|
![]() |
![]() |
![]() |
#15 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,245
Karma: 16537488
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Ticket #3564 created
|
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
LRFTools. Convert LRF to EPUB, HTML, PDF and RTF | elinares | LRF | 279 | 07-30-2011 11:48 PM |
Conversions from RTF (to mobi/epub) | Gwen Morse | Calibre | 6 | 10-14-2010 06:00 AM |
How to create non-embedded Unicode EPUB,LRF,TXT,RTF,PDF | alexmobile | Sony Reader | 1 | 09-23-2009 10:04 PM |
[Old Thread] unable to convert ebooks(rtf, txt,lit,html,pdf) to lrf in calibre .4.131 | jackdeth191 | Calibre | 9 | 05-02-2009 02:55 AM |
Rtf, LRF or epub ? | edman | Sony Reader | 10 | 01-17-2009 12:13 AM |