|
|
#1 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Regex to count line wraps?
I'm finding that a lot of files that were converted from PDF have line wrap issues. Tons of line breaks in the middle of sentences.
The number of paragraphs that start with a lowercase letter would be a great indicator of PDF conversion linewrap issues. Is it possible to create a regex that counts those occurrences and saves the count in a column? This would be a great measure of quality. Perhaps even the ratio of lower/uppercase paragraph starts. Please help
|
|
|
|
|
|
#2 |
|
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,018
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
I'm not a 100% sure but I think Sigil's ePubTidyTool plugin might have an option to fix broken paragraphs ==>> https://www.mobileread.com/forums/sh...d.php?t=264378
If you have access to recent MS Word you could try opening the PDF in it, and then you can use Toxaris's eBook Tools MS Word add-in which has specific tools to help with PDFs. This add in can also create the epub - the code it generates is generally considered to be much 'cleaner' than other Word to epub converters, I believe some people use it for that feature alone. Another useful Word add-in is TransTools for Word / About (not free) which has some specific PDF cleanup features. It has a fair degree of overlap with the other one, but if you're doing a lot of PDF->EPUB it has a couple of things the other one doesn't. BR |
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Thanks BetterRed,
I'll give it a look. It's helpful to repair these files, but if I had a regex tool to count these page breaks that start with lowercase I think I could just avoid them entirely and find a better version. |
|
|
|
|
|
#4 |
|
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,018
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Have you tried googling for "regex to find lines in HTML that start with lower case"
BR |
|
|
|
|
|
#5 |
|
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,269
Karma: 61916422
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
I don't know of a Calibre plugin that will LOG the counts of a REGEX term into the Library (DB)
Sigil's editor can give you a count of founds for the current search (scope) before you pull the trigger (and shoot yourself in the foot )(I thought Calibre's Edit had that feature, but I could not find it. Neither can LOG the result) |
|
|
|
| Advert | |
|
|
|
|
#6 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The search menu in the editor lets you count th enumber of matches
|
|
|
|
|
|
#7 |
|
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,269
Karma: 61916422
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
|
|
|
|
|
#8 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Thanks all, much appreciated! Love this community. The goal is to identify the troubled files so they can be tuned up. Trying to figure out a batch way to search docs and either count or set a threshold for these specific line breaks that start with a lowercase letter.
Here's a simple example, (<p.*>[a-z]) But I have no idea how you could run that kind of html search across all the documents or how I could tag or filter the results. |
|
|
|
|
|
#9 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You would need to write a plugin such as the quality check plugin for this kindof thing.
|
|
|
|
|
|
#10 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Thanks Kovid, that sounds about right. The word count plugin has most of the moving parts, as it counts words and stores that count in a column. Just need to change the word it's looking for to represent the page breaks / lowercase scenario.
Best place for it would probably by the Quality Check plugin, but it doesn't have a unified search across document types unfortunately. Thanks again for the direction on this. Such a great platform and community. |
|
|
|
|
|
#11 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
|
Alternatively, there are a couple tweaks we could make to the unwrap lines feature in the heuretics section. Seems like the majority of issues are not length based, but fall into two scenarios.
1 line break in middle of sentences, with no punctuation preceding and lowercase at the start. 2 sentences that have line break in between open and closed quotes. Options to drop line breaks within those two scenarios would have a major impact on the readability of these documents. |
|
|
|
|
|
#12 |
|
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 22,018
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
The second case might be what the author intended, see ==>> punctuation - Why does the multi-paragraph quotation rule exist? - English Language & Usage Stack Exchange
BR |
|
|
|
|
|
#13 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I'm certainly open to adding more options to heuristics. Perhaps an option such as "Fix truncated lines" or similar. However that is not my code, so I am not particularly eager to work on it myself, patches welcome. The relevant code is in conversion/preprocess.py IIRC
|
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Regex Problem / Line that does't end with .</p> | mcam77 | Sigil | 6 | 03-25-2013 07:38 PM |
| how do I span more than one line with regex | BartB | Sigil | 3 | 12-11-2011 06:12 PM |
| Importing RegEx Line | TheEldest | Calibre | 1 | 07-05-2011 11:18 PM |
| Insert new line with regex | deckoff | Sigil | 6 | 08-08-2010 12:24 PM |
| Sigil Inserting hard line wraps at ~100 characters | ldolse | Sigil | 6 | 08-07-2009 12:00 PM |