Search regex problem - Page 2

eschwartz · 04-15-2015, 11:59 PM

Could easily be a PDF conversion or some such. On that note, the line unwrapping factor is a very important thing to consider when converting from PDF!!

cybmole · 04-16-2015, 01:20 AM

one pattern to rule them all would be good, except that you can construct so many pathological borderline cases of mid sentence breaks, with speech and punctuation marks stuff, that's it's probably impossible.
the use of full stops for other than sentence ends does help either.
e.g. I doubt any one automated rule could detect and fix a mid sentence break like

"Is that Mr.
Smith ? "
she said

It's been a while since I've wanted to fix up one of these - nowadays I just read well formed books - but my preference was to use 2 or 3 passes with different simple rules
e.g. one that focuses on para start checks, one that focused on on para ends, and maybe another one for speech issues.

I'm sure there's a big old thread somewhere, for borderline case solutions

phossler · 04-16-2015, 10:18 AM

I know it's way beyond my capabilities, but could a F&R Regex-Function be written to make the multiple passes through to join the text in such mal-formed books?

I've moved the Sigil Saved Search Regex into Calibre, and it will catch a lot of the cases, but nothing like cybmole's

eschwartz · 04-16-2015, 11:24 AM

cybmole -- that is no excuse not to catch as many as you can in one go.

phossler -- regex functions have to act on the result of single search, so I'm not sure that would work.
You could certainly test the match to see which case it looks like, though.

DiapDealer · 04-16-2015, 12:38 PM

I second the abandonment of source material that is littered with incorrect paragraph breaks. Life's too short for that kind of work.

cybmole · 04-16-2015, 03:42 PM

Yes, it was interesting learning how to do it, and learning regex in the process, but unless you are wanting to read old stuff that is not sold in ebook form, thus is only in poor scans, it's not worth it. There are a few books \ authors I like , 60s -70s sci fi , that never got properly republished e.g much of Roger Zelazny.

Toxaris · 04-17-2015, 01:02 PM

Guys, I don't agree... It depends on how you do the post OCR. My addin tries to recreate the original paragraphs and has quite a good rate. If a paragraph ends with a point or other end sign it actually looks if the first word on the second line would have fitted (lengthwise) on the same line. If so, it is probably the end and begin of a paragraph. If it wouldn't fit, it is probably a continuation of the same paragraph. It is a fairly fast process.
Of course you need Word for the add-in and the OCR source.

DiapDealer · 04-17-2015, 02:02 PM

I'm sure it works wonderfully, Tox, but anything less than a 100% success rate would be unacceptable to me in this regard. I don't OCR, nor deal with OCR source, though, so line unwrapping just isn't something I need to deal with these days. I'll happily leave creating quality ebooks from bad to mediocre OCR to others.

Toxaris · 04-17-2015, 03:58 PM

It is not 100%, but rather around 95% I think. Depends on the book itself of course. If there is no OCR source this method will not work of course. No way to automate it fully and correctly then.

04-16-2015, 03:42 PM	#21
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	Yes, it was interesting learning how to do it, and learning regex in the process, but unless you are wanting to read old stuff that is not sold in ebook form, thus is only in poor scans, it's not worth it. There are a few books \ authors I like , 60s -70s sci fi , that never got properly republished e.g much of Roger Zelazny. Last edited by cybmole; 04-18-2015 at 09:39 AM.

04-17-2015, 02:02 PM	#23
DiapDealer Grand Sorcerer Posts: 27,553 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I'm sure it works wonderfully, Tox, but anything less than a 100% success rate would be unacceptable to me in this regard. I don't OCR, nor deal with OCR source, though, so line unwrapping just isn't something I need to deal with these days. I'll happily leave creating quality ebooks from bad to mediocre OCR to others. Last edited by DiapDealer; 04-17-2015 at 02:07 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
regex search/replace - how to?	Alt68er	Sigil	1	03-11-2014 08:53 PM
Regex search details	DiapDealer	Editor	4	02-22-2014 11:58 AM
Regex search and replace	dwlamb	Sigil	6	04-12-2013 02:34 PM
regex search/replace	Sharlene	Sigil	10	01-28-2012 04:14 AM
need regex help search and replace	schuster	Calibre	4	01-10-2011 09:00 AM

04-15-2015, 11:59 PM	#16
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Could easily be a PDF conversion or some such. On that note, the line unwrapping factor is a very important thing to consider when converting from PDF!!

04-16-2015, 01:20 AM	#17
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	one pattern to rule them all would be good, except that you can construct so many pathological borderline cases of mid sentence breaks, with speech and punctuation marks stuff, that's it's probably impossible. the use of full stops for other than sentence ends does help either. e.g. I doubt any one automated rule could detect and fix a mid sentence break like "Is that Mr. Smith ? " she said It's been a while since I've wanted to fix up one of these - nowadays I just read well formed books - but my preference was to use 2 or 3 passes with different simple rules e.g. one that focuses on para start checks, one that focused on on para ends, and maybe another one for speech issues. I'm sure there's a big old thread somewhere, for borderline case solutions

04-16-2015, 10:18 AM	#18
phossler Wizard Posts: 1,078 Karma: 412718 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	I know it's way beyond my capabilities, but could a F&R Regex-Function be written to make the multiple passes through to join the text in such mal-formed books? I've moved the Sigil Saved Search Regex into Calibre, and it will catch a lot of the cases, but nothing like cybmole's

04-16-2015, 11:24 AM	#19
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	cybmole -- that is no excuse not to catch as many as you can in one go. phossler -- regex functions have to act on the result of single search, so I'm not sure that would work. You could certainly test the match to see which case it looks like, though.

04-16-2015, 12:38 PM	#20
DiapDealer Grand Sorcerer Posts: 27,553 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I second the abandonment of source material that is littered with incorrect paragraph breaks. Life's too short for that kind of work.

04-17-2015, 01:02 PM	#22
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Guys, I don't agree... It depends on how you do the post OCR. My addin tries to recreate the original paragraphs and has quite a good rate. If a paragraph ends with a point or other end sign it actually looks if the first word on the second line would have fitted (lengthwise) on the same line. If so, it is probably the end and begin of a paragraph. If it wouldn't fit, it is probably a continuation of the same paragraph. It is a fairly fast process. Of course you need Word for the add-in and the OCR source.

04-17-2015, 03:58 PM	#24
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	It is not 100%, but rather around 95% I think. Depends on the book itself of course. If there is no OCR source this method will not work of course. No way to automate it fully and correctly then.