MobileRead Forums - View Single Post

parkher · 01-20-2014, 11:07 PM

The problem I have is as follows:
When I run OCR in FineReader on a scanned book, it splits a paragraph into two if a paragraph begins on one page and ends on the next.
When I try to export a book as html or epub - I have those annoying wrong paragraph splits, sometimes in the middle of a sentence or even in the middle of a word. I don't see an option in FineReader to switch off that wrong behavior - it clearly has enough information to handle paragraphs correctly when exporting to html, epub or fb2.
Maybe I am missing something very obvious. It is hard to believe FineReader 11 cannot handle page breaks correctly.

So I tried to export to odt to see if OpenOffice would save the file correctly as html - without those wrong splits. Unfortunately, no.

My last hope now is Writer2ePub. But it also does not merge those wrongly split paragraphs...
What it could do is this:
1. if the split occurs in the middle of a word - to merge paragraphs together (a hyphen may have to be removed).
2. if the split occurs in the middle of a sentence - to merge paragraphs together (a space between words may have to be inserted instead).
3. if the split occurs between sentences, the simplest thing - not to merge together, #1 and #2 is already good enough, the most annoying problem fixed, but perhaps it is possible to try to determine, to merge or not to merge:
- by the presence/absence of indentation in the text on the new page?
- by examining exactly with what characters the first paragraph ends and the next begins, for example, there should not be direct speech followed by direct speech in the same paragraph, such as (in English):
“I know.” “I know you know.”
So ” “ (if occurring after merging) would indicated a required split (i.e., not to merge).
And so on.
Can something be done about that?

I do something similar, at least partially, by regex search/replace in html of epub, for example, a split in the middle of a word eliminated:

(?s)-\s*([a-z])
replace:
\1

Or when the new paragraph begins with a lowercase letter, it clearly always has to be merged with the previous one:

(?s)(.)\s*([a-z])
replace:
\1 \2

This should be done after the word splits are already fixed.

And so on. There are more cases, and often html is much more complicated than just ....
So the best would be to have the merging done automatically when saving or exporting to epub.
At that stage additional information about page breaks in the original odt document perhaps can also be used (in epub it is no longer available), and about the indentation (or no indentation) of the text right after the page break (also no longer available).

01-20-2014, 11:07 PM	#1
parkher Evangelist Posts: 467 Karma: 369018 Join Date: Nov 2010 Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902	Can Writer2ePub merge paragraphs? The problem I have is as follows: When I run OCR in FineReader on a scanned book, it splits a paragraph into two if a paragraph begins on one page and ends on the next. When I try to export a book as html or epub - I have those annoying wrong paragraph splits, sometimes in the middle of a sentence or even in the middle of a word. I don't see an option in FineReader to switch off that wrong behavior - it clearly has enough information to handle paragraphs correctly when exporting to html, epub or fb2. Maybe I am missing something very obvious. It is hard to believe FineReader 11 cannot handle page breaks correctly. So I tried to export to odt to see if OpenOffice would save the file correctly as html - without those wrong splits. Unfortunately, no. My last hope now is Writer2ePub. But it also does not merge those wrongly split paragraphs... What it could do is this: 1. if the split occurs in the middle of a word - to merge paragraphs together (a hyphen may have to be removed). 2. if the split occurs in the middle of a sentence - to merge paragraphs together (a space between words may have to be inserted instead). 3. if the split occurs between sentences, the simplest thing - not to merge together, #1 and #2 is already good enough, the most annoying problem fixed, but perhaps it is possible to try to determine, to merge or not to merge: - by the presence/absence of indentation in the text on the new page? - by examining exactly with what characters the first paragraph ends and the next begins, for example, there should not be direct speech followed by direct speech in the same paragraph, such as (in English): “I know.” “I know you know.” So ” “ (if occurring after merging) would indicated a required split (i.e., not to merge). And so on. Can something be done about that? I do something similar, at least partially, by regex search/replace in html of epub, for example, a split in the middle of a word eliminated: (?s)-</p>\s<p>([a-z]) replace: \1 Or when the new paragraph begins with a lowercase letter, it clearly always has to be merged with the previous one: (?s)(.)</p>\s<p>([a-z]) replace: \1 \2 This should be done after the word splits are already fixed. And so on. There are more cases, and often html is much more complicated than just <p>...</p>. So the best would be to have the merging done automatically when saving or exporting to epub. At that stage additional information about page breaks in the original odt document perhaps can also be used (in epub it is no longer available), and about the indentation (or no indentation) of the text right after the page break (also no longer available). Last edited by parkher; 01-21-2014 at 12:35 AM.