Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 09-12-2019, 03:36 PM   #1
kboogie222
Junior Member
kboogie222 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
Regex to count line wraps?

I'm finding that a lot of files that were converted from PDF have line wrap issues. Tons of line breaks in the middle of sentences.

The number of paragraphs that start with a lowercase letter would be a great indicator of PDF conversion linewrap issues.

Is it possible to create a regex that counts those occurrences and saves the count in a column?

This would be a great measure of quality. Perhaps even the ratio of lower/uppercase paragraph starts.

Please help
kboogie222 is offline   Reply With Quote
Old 09-12-2019, 05:31 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,550
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
I'm not a 100% sure but I think Sigil's ePubTidyTool plugin might have an option to fix broken paragraphs ==>> https://www.mobileread.com/forums/sh...d.php?t=264378

If you have access to recent MS Word you could try opening the PDF in it, and then you can use Toxaris's eBook Tools MS Word add-in which has specific tools to help with PDFs. This add in can also create the epub - the code it generates is generally considered to be much 'cleaner' than other Word to epub converters, I believe some people use it for that feature alone.

Another useful Word add-in is TransTools for Word / About (not free) which has some specific PDF cleanup features. It has a fair degree of overlap with the other one, but if you're doing a lot of PDF->EPUB it has a couple of things the other one doesn't.

BR
BetterRed is online now   Reply With Quote
Advert
Old 09-13-2019, 07:48 AM   #3
kboogie222
Junior Member
kboogie222 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
Thanks BetterRed,
I'll give it a look. It's helpful to repair these files, but if I had a regex tool to count these page breaks that start with lowercase I think I could just avoid them entirely and find a better version.
kboogie222 is offline   Reply With Quote
Old 09-13-2019, 08:04 AM   #4
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,550
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Have you tried googling for "regex to find lines in HTML that start with lower case"

BR
BetterRed is online now   Reply With Quote
Old 09-13-2019, 11:01 AM   #5
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
I don't know of a Calibre plugin that will LOG the counts of a REGEX term into the Library (DB)

Sigil's editor can give you a count of founds for the current search (scope) before you pull the trigger (and shoot yourself in the foot )

(I thought Calibre's Edit had that feature, but I could not find it. Neither can LOG the result)
theducks is offline   Reply With Quote
Advert
Old 09-13-2019, 11:50 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The search menu in the editor lets you count th enumber of matches
kovidgoyal is online now   Reply With Quote
Old 09-13-2019, 11:58 AM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by kovidgoyal View Post
The search menu in the editor lets you count th enumber of matches

The one place I did not look
theducks is offline   Reply With Quote
Old 09-13-2019, 01:01 PM   #8
kboogie222
Junior Member
kboogie222 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
Thanks all, much appreciated! Love this community. The goal is to identify the troubled files so they can be tuned up. Trying to figure out a batch way to search docs and either count or set a threshold for these specific line breaks that start with a lowercase letter.
Here's a simple example,
(<p.*>[a-z])

But I have no idea how you could run that kind of html search across all the documents or how I could tag or filter the results.
kboogie222 is offline   Reply With Quote
Old 09-13-2019, 11:05 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You would need to write a plugin such as the quality check plugin for this kindof thing.
kovidgoyal is online now   Reply With Quote
Old 09-15-2019, 01:30 PM   #10
kboogie222
Junior Member
kboogie222 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
Thanks Kovid, that sounds about right. The word count plugin has most of the moving parts, as it counts words and stores that count in a column. Just need to change the word it's looking for to represent the page breaks / lowercase scenario.

Best place for it would probably by the Quality Check plugin, but it doesn't have a unified search across document types unfortunately.

Thanks again for the direction on this. Such a great platform and community.
kboogie222 is offline   Reply With Quote
Old 09-15-2019, 03:13 PM   #11
kboogie222
Junior Member
kboogie222 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
Alternatively, there are a couple tweaks we could make to the unwrap lines feature in the heuretics section. Seems like the majority of issues are not length based, but fall into two scenarios.
1 line break in middle of sentences, with no punctuation preceding and lowercase at the start.
2 sentences that have line break in between open and closed quotes.

Options to drop line breaks within those two scenarios would have a major impact on the readability of these documents.
kboogie222 is offline   Reply With Quote
Old 09-15-2019, 05:29 PM   #12
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,550
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
The second case might be what the author intended, see ==>> punctuation - Why does the multi-paragraph quotation rule exist? - English Language & Usage Stack Exchange

BR
BetterRed is online now   Reply With Quote
Old 09-15-2019, 09:12 PM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I'm certainly open to adding more options to heuristics. Perhaps an option such as "Fix truncated lines" or similar. However that is not my code, so I am not particularly eager to work on it myself, patches welcome. The relevant code is in conversion/preprocess.py IIRC
kovidgoyal is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex Problem / Line that does't end with .</p> mcam77 Sigil 6 03-25-2013 06:38 PM
how do I span more than one line with regex BartB Sigil 3 12-11-2011 05:12 PM
Importing RegEx Line TheEldest Calibre 1 07-05-2011 10:18 PM
Insert new line with regex deckoff Sigil 6 08-08-2010 11:24 AM
Sigil Inserting hard line wraps at ~100 characters ldolse Sigil 6 08-07-2009 11:00 AM


All times are GMT -4. The time now is 05:37 AM.


MobileRead.com is a privately owned, operated and funded community.