|  01-20-2014, 11:07 PM | #1 | 
| Evangelist            Posts: 467 Karma: 369018 Join Date: Nov 2010 Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902 | 
				
				Can Writer2ePub merge paragraphs?
			 
			
			The problem I have is as follows: When I run OCR in FineReader on a scanned book, it splits a paragraph into two if a paragraph begins on one page and ends on the next. When I try to export a book as html or epub - I have those annoying wrong paragraph splits, sometimes in the middle of a sentence or even in the middle of a word. I don't see an option in FineReader to switch off that wrong behavior - it clearly has enough information to handle paragraphs correctly when exporting to html, epub or fb2. Maybe I am missing something very obvious. It is hard to believe FineReader 11 cannot handle page breaks correctly. So I tried to export to odt to see if OpenOffice would save the file correctly as html - without those wrong splits. Unfortunately, no. My last hope now is Writer2ePub. But it also does not merge those wrongly split paragraphs... What it could do is this: 1. if the split occurs in the middle of a word - to merge paragraphs together (a hyphen may have to be removed). 2. if the split occurs in the middle of a sentence - to merge paragraphs together (a space between words may have to be inserted instead). 3. if the split occurs between sentences, the simplest thing - not to merge together, #1 and #2 is already good enough, the most annoying problem fixed, but perhaps it is possible to try to determine, to merge or not to merge: - by the presence/absence of indentation in the text on the new page? - by examining exactly with what characters the first paragraph ends and the next begins, for example, there should not be direct speech followed by direct speech in the same paragraph, such as (in English): “I know.” “I know you know.” So ” “ (if occurring after merging) would indicated a required split (i.e., not to merge). And so on. Can something be done about that? I do something similar, at least partially, by regex search/replace in html of epub, for example, a split in the middle of a word eliminated: (?s)-</p>\s*<p>([a-z]) replace: \1 Or when the new paragraph begins with a lowercase letter, it clearly always has to be merged with the previous one: (?s)(.)</p>\s*<p>([a-z]) replace: \1 \2 This should be done after the word splits are already fixed. And so on. There are more cases, and often html is much more complicated than just <p>...</p>. So the best would be to have the merging done automatically when saving or exporting to epub. At that stage additional information about page breaks in the original odt document perhaps can also be used (in epub it is no longer available), and about the indentation (or no indentation) of the text right after the page break (also no longer available). Last edited by parkher; 01-21-2014 at 12:35 AM. | 
|   |   | 
|  01-21-2014, 12:13 AM | #2 | 
| Writer2ePub creator            Posts: 354 Karma: 121129 Join Date: Sep 2009 Location: Genova, Italy Device: Cybook Bebook iLiad Kindle HanlinV2 Readius SonyPRS500 SonyPRS700 etc | 
			
			Try to use PerfectEpub to clean the OCR errors. It solves all your problems: http://lukesblog.it/ebooks/ebook-tools/perfectepub/ Luke | 
|   |   | 
|  01-21-2014, 08:17 AM | #3 | |
| Evangelist            Posts: 467 Karma: 369018 Join Date: Nov 2010 Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902 | Quote: 
 It really does all those things that I usually do with regex. Splitting " ", for example - I do this too  Not sure why it is called PerfectEpub, though. It is more than that. PerfectHTML too, etc. It fixes the text in OpenOffice and then you can do whatever you want with it: to save as html or as odt and convert odt to epub with the Calibre converter, for example. What is the best strategy to work with it on an epub I already have, though? Probably: to convert epub to htmlz (with the Calibre converter, for example), to unpack htmlz and then to open html in OpenOffice. With this approach all the pictures show up in OpenOffice too. Or do you have a stand-alone version, perhaps a PerfectEpub tool that can be launched from SIGIL with "Open with"? | |
|   |   | 
|  01-21-2014, 11:55 AM | #4 | |
| Writer2ePub creator            Posts: 354 Karma: 121129 Join Date: Sep 2009 Location: Genova, Italy Device: Cybook Bebook iLiad Kindle HanlinV2 Readius SonyPRS500 SonyPRS700 etc | Quote: 
 About the name, the author choose it, and I retained it. I agree, can be called PerfectEbook, instead  To use it at the best, start to perform one change at time. Just think what are the priority of the correction to do, i.e.: there are a lot of leading spaces? How many dashes there are? Luke | |
|   |   | 
|  01-21-2014, 09:46 PM | #5 | 
| Evangelist            Posts: 467 Karma: 369018 Join Date: Nov 2010 Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902 | 
			
			Yes, I am doing exactly that - one change at a time. BTW, there are many many messages that have to be ignored, at least in English language books: "2077 lines that end without punctuation" Here is nothing to fix - direct speech ends usually this way: "Hey!" "What?" "Nothing." This would give three such messages. But in many other languages, it is a useful message. Perhaps PerfectEpub should look one character back beyond the final ",”,', ’,», etc.? So that both these cases are found to be correct: "Nothing big". - Nothing "big". However, just " without . on other side - wrong. The thing is, because this message now has to be ignored, some legitimate cases where punctuation is really missing might be skipped as well. Last edited by parkher; 01-21-2014 at 10:03 PM. | 
|   |   | 
|  01-22-2014, 01:26 AM | #6 | 
| Writer2ePub creator            Posts: 354 Karma: 121129 Join Date: Sep 2009 Location: Genova, Italy Device: Cybook Bebook iLiad Kindle HanlinV2 Readius SonyPRS500 SonyPRS700 etc | |
|   |   | 
|  01-22-2014, 01:06 PM | #7 | |
| Guru            Posts: 691 Karma: 3026110 Join Date: Dec 2008 Location: Lancashire, U.K. Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro + | Quote: 
 That is a great add-on. In the past I've used a lot of batch jobs using the Alternative Searching add-on for many repetitive tasks. PerfectEPUB makes a lot of these much quicker and easier. BobC | |
|   |   | 
|  | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Writer2ePub 1.2.0 beta | eBookLuke | Writer2ePub | 43 | 01-27-2014 01:12 AM | 
| Is Writer2Epub as good as it seems? | Gregg Bell | Writer2ePub | 5 | 08-06-2013 01:04 AM | 
| Word to OO to Writer2epub | Notjohn | Writer2ePub | 6 | 06-13-2013 10:43 PM | 
| writer2epub Styles | Jacques_N | Software | 2 | 09-23-2011 02:59 PM | 
| Merge feature request (different merge) | Tarran | Calibre | 1 | 05-24-2010 10:57 AM |