Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Writer2ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 01-20-2014, 11:07 PM   #1
parkher
Evangelist
parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.
 
Posts: 467
Karma: 369018
Join Date: Nov 2010
Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902
Can Writer2ePub merge paragraphs?

The problem I have is as follows:
When I run OCR in FineReader on a scanned book, it splits a paragraph into two if a paragraph begins on one page and ends on the next.
When I try to export a book as html or epub - I have those annoying wrong paragraph splits, sometimes in the middle of a sentence or even in the middle of a word. I don't see an option in FineReader to switch off that wrong behavior - it clearly has enough information to handle paragraphs correctly when exporting to html, epub or fb2.
Maybe I am missing something very obvious. It is hard to believe FineReader 11 cannot handle page breaks correctly.

So I tried to export to odt to see if OpenOffice would save the file correctly as html - without those wrong splits. Unfortunately, no.

My last hope now is Writer2ePub. But it also does not merge those wrongly split paragraphs...
What it could do is this:
1. if the split occurs in the middle of a word - to merge paragraphs together (a hyphen may have to be removed).
2. if the split occurs in the middle of a sentence - to merge paragraphs together (a space between words may have to be inserted instead).
3. if the split occurs between sentences, the simplest thing - not to merge together, #1 and #2 is already good enough, the most annoying problem fixed, but perhaps it is possible to try to determine, to merge or not to merge:
- by the presence/absence of indentation in the text on the new page?
- by examining exactly with what characters the first paragraph ends and the next begins, for example, there should not be direct speech followed by direct speech in the same paragraph, such as (in English):
“I know.” “I know you know.”
So ” “ (if occurring after merging) would indicated a required split (i.e., not to merge).
And so on.
Can something be done about that?

I do something similar, at least partially, by regex search/replace in html of epub, for example, a split in the middle of a word eliminated:

(?s)-</p>\s*<p>([a-z])
replace:
\1

Or when the new paragraph begins with a lowercase letter, it clearly always has to be merged with the previous one:

(?s)(.)</p>\s*<p>([a-z])
replace:
\1 \2

This should be done after the word splits are already fixed.

And so on. There are more cases, and often html is much more complicated than just <p>...</p>.
So the best would be to have the merging done automatically when saving or exporting to epub.
At that stage additional information about page breaks in the original odt document perhaps can also be used (in epub it is no longer available), and about the indentation (or no indentation) of the text right after the page break (also no longer available).

Last edited by parkher; 01-21-2014 at 12:35 AM.
parkher is offline   Reply With Quote
Old 01-21-2014, 12:13 AM   #2
eBookLuke
Writer2ePub creator
eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.
 
eBookLuke's Avatar
 
Posts: 354
Karma: 121129
Join Date: Sep 2009
Location: Genova, Italy
Device: Cybook Bebook iLiad Kindle HanlinV2 Readius SonyPRS500 SonyPRS700 etc
Try to use PerfectEpub to clean the OCR errors. It solves all your problems:
http://lukesblog.it/ebooks/ebook-tools/perfectepub/

Luke
eBookLuke is offline   Reply With Quote
Old 01-21-2014, 08:17 AM   #3
parkher
Evangelist
parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.
 
Posts: 467
Karma: 369018
Join Date: Nov 2010
Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902
Quote:
Originally Posted by eBookLuke View Post
Try to use PerfectEpub to clean the OCR errors. It solves all your problems:
http://lukesblog.it/ebooks/ebook-tools/perfectepub/

Luke
Thanks!
It really does all those things that I usually do with regex.
Splitting " ", for example - I do this too

Not sure why it is called PerfectEpub, though. It is more than that. PerfectHTML too, etc.
It fixes the text in OpenOffice and then you can do whatever you want with it: to save as html or as odt and convert odt to epub with the Calibre converter, for example.

What is the best strategy to work with it on an epub I already have, though?

Probably: to convert epub to htmlz (with the Calibre converter, for example), to unpack htmlz and then to open html in OpenOffice. With this approach all the pictures show up in OpenOffice too.

Or do you have a stand-alone version, perhaps a PerfectEpub tool that can be launched from SIGIL with "Open with"?
parkher is offline   Reply With Quote
Old 01-21-2014, 11:55 AM   #4
eBookLuke
Writer2ePub creator
eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.
 
eBookLuke's Avatar
 
Posts: 354
Karma: 121129
Join Date: Sep 2009
Location: Genova, Italy
Device: Cybook Bebook iLiad Kindle HanlinV2 Readius SonyPRS500 SonyPRS700 etc
Quote:
Originally Posted by parkher View Post
Or do you have a stand-alone version, perhaps a PerfectEpub tool that can be launched from SIGIL with "Open with"?
Sorry, this is a tool that uses the power of *Office to do most of the work. Any document readable by *Office can be cleaned by PE.

About the name, the author choose it, and I retained it. I agree, can be called PerfectEbook, instead

To use it at the best, start to perform one change at time. Just think what are the priority of the correction to do, i.e.: there are a lot of leading spaces? How many dashes there are?

Luke
eBookLuke is offline   Reply With Quote
Old 01-21-2014, 09:46 PM   #5
parkher
Evangelist
parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.
 
Posts: 467
Karma: 369018
Join Date: Nov 2010
Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902
Yes, I am doing exactly that - one change at a time.

BTW, there are many many messages that have to be ignored, at least in English language books:

"2077 lines that end without punctuation"

Here is nothing to fix - direct speech ends usually this way:

"Hey!"
"What?"
"Nothing."

This would give three such messages.
But in many other languages, it is a useful message.

Perhaps PerfectEpub should look one character back beyond the final ",”,', ’,», etc.?
So that both these cases are found to be correct:
"Nothing big".
- Nothing "big".
However, just " without . on other side - wrong.

The thing is, because this message now has to be ignored, some legitimate cases where punctuation is really missing might be skipped as well.

Last edited by parkher; 01-21-2014 at 10:03 PM.
parkher is offline   Reply With Quote
Old 01-22-2014, 01:26 AM   #6
eBookLuke
Writer2ePub creator
eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.eBookLuke knows the chase is better than the catch.
 
eBookLuke's Avatar
 
Posts: 354
Karma: 121129
Join Date: Sep 2009
Location: Genova, Italy
Device: Cybook Bebook iLiad Kindle HanlinV2 Readius SonyPRS500 SonyPRS700 etc
Quote:
Originally Posted by parkher View Post
"2077 lines that end without punctuation"
Yes, I know it… It will be solved in the next release, please stay tuned

Luke
eBookLuke is offline   Reply With Quote
Old 01-22-2014, 01:06 PM   #7
BobC
Guru
BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.
 
Posts: 691
Karma: 3026110
Join Date: Dec 2008
Location: Lancashire, U.K.
Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +
Quote:
Originally Posted by eBookLuke View Post
Try to use PerfectEpub to clean the OCR errors. It solves all your problems:
http://lukesblog.it/ebooks/ebook-tools/perfectepub/

Luke
Luke,

That is a great add-on. In the past I've used a lot of batch jobs using the Alternative Searching add-on for many repetitive tasks. PerfectEPUB makes a lot of these much quicker and easier.

BobC
BobC is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Writer2ePub 1.2.0 beta eBookLuke Writer2ePub 43 01-27-2014 01:12 AM
Is Writer2Epub as good as it seems? Gregg Bell Writer2ePub 5 08-06-2013 01:04 AM
Word to OO to Writer2epub Notjohn Writer2ePub 6 06-13-2013 10:43 PM
writer2epub Styles Jacques_N Software 2 09-23-2011 02:59 PM
Merge feature request (different merge) Tarran Calibre 1 05-24-2010 10:57 AM


All times are GMT -4. The time now is 12:15 PM.


MobileRead.com is a privately owned, operated and funded community.