09-12-2019, 03:36 PM | #1 |
Junior Member
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Regex to count line wraps?
I'm finding that a lot of files that were converted from PDF have line wrap issues. Tons of line breaks in the middle of sentences.
The number of paragraphs that start with a lowercase letter would be a great indicator of PDF conversion linewrap issues. Is it possible to create a regex that counts those occurrences and saves the count in a column? This would be a great measure of quality. Perhaps even the ratio of lower/uppercase paragraph starts. Please help |
09-12-2019, 05:31 PM | #2 |
null operator (he/him)
Posts: 20,550
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
I'm not a 100% sure but I think Sigil's ePubTidyTool plugin might have an option to fix broken paragraphs ==>> https://www.mobileread.com/forums/sh...d.php?t=264378
If you have access to recent MS Word you could try opening the PDF in it, and then you can use Toxaris's eBook Tools MS Word add-in which has specific tools to help with PDFs. This add in can also create the epub - the code it generates is generally considered to be much 'cleaner' than other Word to epub converters, I believe some people use it for that feature alone. Another useful Word add-in is TransTools for Word / About (not free) which has some specific PDF cleanup features. It has a fair degree of overlap with the other one, but if you're doing a lot of PDF->EPUB it has a couple of things the other one doesn't. BR |
Advert | |
|
09-13-2019, 07:48 AM | #3 |
Junior Member
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Thanks BetterRed,
I'll give it a look. It's helpful to repair these files, but if I had a regex tool to count these page breaks that start with lowercase I think I could just avoid them entirely and find a better version. |
09-13-2019, 08:04 AM | #4 |
null operator (he/him)
Posts: 20,550
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Have you tried googling for "regex to find lines in HTML that start with lower case"
BR |
09-13-2019, 11:01 AM | #5 |
Well trained by Cats
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
I don't know of a Calibre plugin that will LOG the counts of a REGEX term into the Library (DB)
Sigil's editor can give you a count of founds for the current search (scope) before you pull the trigger (and shoot yourself in the foot ) (I thought Calibre's Edit had that feature, but I could not find it. Neither can LOG the result) |
Advert | |
|
09-13-2019, 11:50 AM | #6 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The search menu in the editor lets you count th enumber of matches
|
09-13-2019, 11:58 AM | #7 |
Well trained by Cats
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
09-13-2019, 01:01 PM | #8 |
Junior Member
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Thanks all, much appreciated! Love this community. The goal is to identify the troubled files so they can be tuned up. Trying to figure out a batch way to search docs and either count or set a threshold for these specific line breaks that start with a lowercase letter.
Here's a simple example, (<p.*>[a-z]) But I have no idea how you could run that kind of html search across all the documents or how I could tag or filter the results. |
09-13-2019, 11:05 PM | #9 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You would need to write a plugin such as the quality check plugin for this kindof thing.
|
09-15-2019, 01:30 PM | #10 |
Junior Member
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Thanks Kovid, that sounds about right. The word count plugin has most of the moving parts, as it counts words and stores that count in a column. Just need to change the word it's looking for to represent the page breaks / lowercase scenario.
Best place for it would probably by the Quality Check plugin, but it doesn't have a unified search across document types unfortunately. Thanks again for the direction on this. Such a great platform and community. |
09-15-2019, 03:13 PM | #11 |
Junior Member
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Alternatively, there are a couple tweaks we could make to the unwrap lines feature in the heuretics section. Seems like the majority of issues are not length based, but fall into two scenarios.
1 line break in middle of sentences, with no punctuation preceding and lowercase at the start. 2 sentences that have line break in between open and closed quotes. Options to drop line breaks within those two scenarios would have a major impact on the readability of these documents. |
09-15-2019, 05:29 PM | #12 |
null operator (he/him)
Posts: 20,550
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
The second case might be what the author intended, see ==>> punctuation - Why does the multi-paragraph quotation rule exist? - English Language & Usage Stack Exchange
BR |
09-15-2019, 09:12 PM | #13 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I'm certainly open to adding more options to heuristics. Perhaps an option such as "Fix truncated lines" or similar. However that is not my code, so I am not particularly eager to work on it myself, patches welcome. The relevant code is in conversion/preprocess.py IIRC
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex Problem / Line that does't end with .</p> | mcam77 | Sigil | 6 | 03-25-2013 06:38 PM |
how do I span more than one line with regex | BartB | Sigil | 3 | 12-11-2011 05:12 PM |
Importing RegEx Line | TheEldest | Calibre | 1 | 07-05-2011 10:18 PM |
Insert new line with regex | deckoff | Sigil | 6 | 08-08-2010 11:24 AM |
Sigil Inserting hard line wraps at ~100 characters | ldolse | Sigil | 6 | 08-07-2009 11:00 AM |