Quote:
Originally Posted by icearch
You said that I didn't capture the most common type of broken line which is when it ends with characters, that is what I'm going to achieve.
|
That's not quite what I said.
I said your regex wouldn't capture the most common type of broken line: namely, the kind of line that ends in a
letter.
For example:
Code:
Some day, when we have enough money, we will go to Disney
World.
If what you are suggesting is going through and manually marking off every paragraph break one-by-one, that's incredibly inefficient. I honestly think it would take a full day. A pretty average 300-page paperback will have around 1800 paragraphs and around 150-200 erroneous breaks in your standard OCR conversion (it'll be many times more if it's a straight pdf conversion. So at an absolute minimum, you're talking about evaluating, one-by-one around 2000 matches, and depending on the kind of conversion you're talking about it could be several times that.
This is simply not feasible. I don't just mean it's a lot of work. I mean it's impossible to do effectively. This is for the same reason airport security checkers become ineffective after about 20 minutes; your brain is just going to start "autocorrecting" based on expectations.
This is also why you need to find a way to automate the process as much as you can. In my experience (and I've done literally thousands of these), any single search that requires you to individually check more than 180 results is going to cause a noticeable quality drop-off in your final product. It's the difference between doing a readthrough and catching an error every 20-50 pages, and doing a readthrough and finding errors every second or fifth page. We're talking an order of magnitude.
I'm not here to yuck your yum. You floated a proposed solution, asking for feedback as to whether it's possible. The answer is: Yes, your search is not an inherently broken search. It should return every single instance of a line that ends in the punctuation marks you listed between those brackets (although, like I said, \p{P} may be more comprehensive as a regex solution than listing each punctuation mark that you can think of individually).
But, as to whether this is an effective way to correct erroneously broken lines, I think that the originally proposed regex solution in the first post has real issues. If the ONLY thing you had to worry about in a given document were correctly reflecting paragraph breaks, I'd STILL say this approach would be problematic. But when you consider that
most files that require you to rejoin erroneously broken lines also have a whole host of other issues (often related to the OCR process), it is—strictly in my opinion—a misallocation of mental (and temporal) resources to spotcheck every single instance of a line ending in a punctuation mark.