04-15-2015, 11:59 PM | #16 |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Could easily be a PDF conversion or some such. On that note, the line unwrapping factor is a very important thing to consider when converting from PDF!!
|
04-16-2015, 01:20 AM | #17 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
one pattern to rule them all would be good, except that you can construct so many pathological borderline cases of mid sentence breaks, with speech and punctuation marks stuff, that's it's probably impossible.
the use of full stops for other than sentence ends does help either. e.g. I doubt any one automated rule could detect and fix a mid sentence break like "Is that Mr. Smith ? " she said It's been a while since I've wanted to fix up one of these - nowadays I just read well formed books - but my preference was to use 2 or 3 passes with different simple rules e.g. one that focuses on para start checks, one that focused on on para ends, and maybe another one for speech issues. I'm sure there's a big old thread somewhere, for borderline case solutions |
04-16-2015, 10:18 AM | #18 |
Wizard
Posts: 1,078
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
I know it's way beyond my capabilities, but could a F&R Regex-Function be written to make the multiple passes through to join the text in such mal-formed books?
I've moved the Sigil Saved Search Regex into Calibre, and it will catch a lot of the cases, but nothing like cybmole's |
04-16-2015, 11:24 AM | #19 |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
cybmole -- that is no excuse not to catch as many as you can in one go.
phossler -- regex functions have to act on the result of single search, so I'm not sure that would work. You could certainly test the match to see which case it looks like, though. |
04-16-2015, 12:38 PM | #20 |
Grand Sorcerer
Posts: 27,553
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I second the abandonment of source material that is littered with incorrect paragraph breaks. Life's too short for that kind of work.
|
04-16-2015, 03:42 PM | #21 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Yes, it was interesting learning how to do it, and learning regex in the process, but unless you are wanting to read old stuff that is not sold in ebook form, thus is only in poor scans, it's not worth it. There are a few books \ authors I like , 60s -70s sci fi , that never got properly republished e.g much of Roger Zelazny.
Last edited by cybmole; 04-18-2015 at 09:39 AM. |
04-17-2015, 01:02 PM | #22 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Guys, I don't agree... It depends on how you do the post OCR. My addin tries to recreate the original paragraphs and has quite a good rate. If a paragraph ends with a point or other end sign it actually looks if the first word on the second line would have fitted (lengthwise) on the same line. If so, it is probably the end and begin of a paragraph. If it wouldn't fit, it is probably a continuation of the same paragraph. It is a fairly fast process.
Of course you need Word for the add-in and the OCR source. |
04-17-2015, 02:02 PM | #23 |
Grand Sorcerer
Posts: 27,553
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I'm sure it works wonderfully, Tox, but anything less than a 100% success rate would be unacceptable to me in this regard. I don't OCR, nor deal with OCR source, though, so line unwrapping just isn't something I need to deal with these days. I'll happily leave creating quality ebooks from bad to mediocre OCR to others.
Last edited by DiapDealer; 04-17-2015 at 02:07 PM. |
04-17-2015, 03:58 PM | #24 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
It is not 100%, but rather around 95% I think. Depends on the book itself of course. If there is no OCR source this method will not work of course. No way to automate it fully and correctly then.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
regex search/replace - how to? | Alt68er | Sigil | 1 | 03-11-2014 08:53 PM |
Regex search details | DiapDealer | Editor | 4 | 02-22-2014 11:58 AM |
Regex search and replace | dwlamb | Sigil | 6 | 04-12-2013 02:34 PM |
regex search/replace | Sharlene | Sigil | 10 | 01-28-2012 04:14 AM |
need regex help search and replace | schuster | Calibre | 4 | 01-10-2011 09:00 AM |