11-27-2011, 02:07 PM | #1 |
Junior Member
Posts: 4
Karma: 10
Join Date: Nov 2010
Device: iPad
|
Overlapping text when converting html to mobi/epub
I had PDF I scanned, without OCR initially. Today I decided to use ABBYY to OCR the book and export it to html, which I assumed would be easier to convert to .mobi. However, the exported html copy of the book has these huge gaps pertaining to the borders between the text on sequential pages. This causes Calibre to overlap mounds of text when converting to mobi or epub. I tried playing around with the conversion options, but haven't stumbled upon a way to have it ignore/remove that white space. Is there a way to do that in Calibre?
Before After Last edited by TopCat; 11-27-2011 at 02:09 PM. |
11-28-2011, 12:42 AM | #2 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Enable heuristics - ABBYY has some ugly markup that tries to look exactly like the original book, and this markup is often quite screwy and fails to work as designed. The huge gap is a page break, you can also enable unwrap lines with heuristics. The Heuristics routines try to preserve as much of the original formatting as possible while cleaning out the garbage.
|
Advert | |
|
11-28-2011, 02:53 AM | #3 |
Junior Member
Posts: 4
Karma: 10
Join Date: Nov 2010
Device: iPad
|
Heuristics didn't do anything to fix the problem, unfortunately. Given that I'm using a trial of ABBYY, do you have any suggestions on a neater OCR app that can help with the transition from textual PDF to mobi/epub?
|
11-28-2011, 03:19 AM | #4 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I think ABBYY is probably the best in terms of maintaining the original formatting unfortunately. If you want you could open a bug, attach the original ABBYY generated html book, mark it private, and have Kovid assign it to me. I maintain the function that attempts to clean up ABBYY markup, but to be honest it's had limited testing, just a few of my own docs from that I'd converted, and it's possible different versions of ABBYY generated markup won't work with the function.
Last edited by ldolse; 11-28-2011 at 03:23 AM. |
11-28-2011, 06:13 AM | #5 |
Comparer of the Ephemeris
Posts: 1,496
Karma: 424697
Join Date: Mar 2009
Device: iPad
|
I have had some better results using ABBYY to convert to RTF, then to the destination format. It's not perfect, but it does seem to do a better job with page breaks.
G |
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Overlapping text | jeeperz | Kobo Reader | 23 | 09-26-2011 09:03 PM |
Converting Mobi or HTML file to Epub | Patuba | Sigil | 1 | 07-23-2011 04:14 PM |
Converting Mobi or HTML file to Epub | Patuba | ePub | 7 | 07-19-2011 12:11 PM |
Calibre Indent Issue When Removing Blank Lines (Converting From HTML to MOBI or EPUB) | David Derrico | Calibre | 5 | 08-04-2010 12:13 AM |
EPUB Overlapping Text - Please Help | coaver | Calibre | 16 | 07-27-2010 12:40 AM |