04-26-2016, 12:22 PM | #1 |
Head of lunatic asylum
Posts: 349
Karma: 77620
Join Date: Jun 2012
Location: UTC +1
Device: Tolino Vision 3HD
|
Delete paragraphs in scanned books (S & R with regexes)
Scanned books show in the view screen or e-reader often unwanted paragraphs in respect to book page numbers. The terms may look different, but they appear en masse and therefore elimination using S & R and regaxes would be advantageous. Marked syntax (red) should be deleted. Note the book page numbers always differ (of course). Where are our great regex masters!? Some examples from different books: Example 1 Code:
keine Anzeichen für körperliche Mängel zu erkennen. </p>
<p class="calibre2">Normal? Der US-Geheimdienst OSS (Office of Strategic 169</p>
<p class="calibre2"></p>
<p class="calibre2">Studies, Vorläufer der CIA), oder genauer, der von ihm
Example 2 Note hyphen, also to delete. Code:
derartigen Mangel hingewiesen hätten, aber die ärztlichen Feststel-170</p>
<p class="calibre2"></p>
<p class="calibre2">lungen lauteten nach dem Krieg nicht anders als
Example 3 Code:
die natürlich ihre Blöße nicht deckten, denn es war </p>
<p class="calibre2">17</p>
<p class="calibre2"></p>
<p class="calibre2">keiner anwesend (außer mir), der nicht mindestens seine
Example 3a Code:
das viel zu herb und zu modisch für sie ist, irgendein <b class="calibre3">19</b></p>
<p class="calibre2"></p>
<p class="calibre2">Zeug, das, glaube ich, Taiga heißt, noch in der Wohnung
Example 4 Note Roman rather than Arabic numerals! Code:
bewundernden Kommentare von westlichen Besuchern in Maos China, XVI </p>
<p class="calibre2"></p>
<p class="calibre2">dass Chinesen außerordentliche Menschen seien, die es
Example 5 Code:
ihr Büro war für die [306] Sicherheit eines Parkabschnitts zuständig.
Last edited by chaot; 06-02-2016 at 02:27 PM. Reason: add Interna, Example 3a |
04-26-2016, 12:53 PM | #2 |
Wizard
Posts: 1,161
Karma: 1404241
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
Guess, you don't touch conversion parameters before you start your conversion work and everything is on standard setup. This is for PDF conversion not the best choice.
You need to set a better line unwrapping factor for PDF input files. Standard is .45, check out something between .15 and .25 |
04-26-2016, 01:19 PM | #3 |
Head of lunatic asylum
Posts: 349
Karma: 77620
Join Date: Jun 2012
Location: UTC +1
Device: Tolino Vision 3HD
|
Sorry, this has nothing to do with PDF; these are all excerpts from (now) EPUBs, were before real books scanned.
Last edited by chaot; 04-26-2016 at 02:09 PM. Reason: add: from (now) EPUBs, were before real books scanned. |
04-26-2016, 05:44 PM | #4 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Question: Is there an actual space before the final closing </p>? And can it actually be relied upon?
In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All. Regex Solutions I would handle this specific cleanup in a few passes. First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text. Before Examples I would just do a simple Search and Replace to strip out all: <p class="calibre2"></p> and <p class="calibre2"/> Example #1-3 If you run the above Search/Replaces, then example #1-3 can be condensed into this: Search: [0-9]+</p>\s+<p class="calibre2"> Replace: *BLANK OR A SPACE* Note: In these examples, Red denotes the Regex that matches the page numbers. Note: In English, the Red portion says "look for 1 or more numbers in a row". The Blue portion says "look for 1 or more whitespace characters". Note: There can be legitimate usages of numbers (for example, years/dates/ages). Be careful. Example #4 Search: [IXVL]+</p>\s+<p class="calibre2"> Replace: *BLANK OR A SPACE* Note: In English, Red says "look for the 1 or more 'I' + 'X' + 'L' + 'V' in a row". This should match roman numerals like "IX", "XIII", "XXIV". Note: "I" is used very often in English, so be careful. Note: Make sure you have the "Case-sensitive" button turned on. Example #5 Search: \[[0-9]+\] Replace: *BLANK OR A SPACE* Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket". After Examples Beyond that point, you stated that hyphens should be removed... I would strongly recommend against this. Each one of these has to be checked on a case-by-case basis. The hyphen may actually be a hard hyphen (for example, in the word "all-purpose" might have been broken across pages). For checking hyphens at the end of paragraphs, I personally run this regex: Search: -</p>\s+<p> Replace: *BLANK* It shouldn't be too bad manually correcting these. In reality, you only have to check a handful of hyphens that were at the end of pages. I would highly recommend learning at least the basics of Regex: http://www.regular-expressions.info/quickstart.html There is also a huge "Regex examples" thread in the Sigil section of the forums: https://www.mobileread.com/forums/sho...d.php?t=167971 These examples you posted are relatively easy. Side Note: Thanks for saving your example images as PNG. Vastly superior compared to people who post screenshots as JPG. Last edited by Tex2002ans; 04-26-2016 at 06:12 PM. |
04-27-2016, 01:40 PM | #5 | ||||||
Head of lunatic asylum
Posts: 349
Karma: 77620
Join Date: Jun 2012
Location: UTC +1
Device: Tolino Vision 3HD
|
@Tex2002ans, thank you very much!
That looks like a lot of work - and you will probably be able and willing to help in cause of other related questions. Quote:
Quote:
Note: Adding an Example 3a in Post #1 (same book as Example 3) Treating the whole catalog of problems at once I often lack the Internet, that means going on selective. Simple things first. Quote:
Probably you mean key parts of the code!? Quote:
Code:
ihr Büro war für die [306] Sicherheit
Code:
ihr Büro war für die Sicherheit [2 blank spaces] Code:
ihr Büro war für die Sicherheit [4 blank spaces] Don't be angry, I'm relatively sure the solution (for the elimination of a space) is to find anywhere - only I would like a little sense of achievement quick and now. What's the different in S&R between settings Regex and Regex-Function? Quote:
Quote:
Would may be worth to create out of all these examples there something like a (regax examples) library - you know, cataloged and without bla-bla. Last edited by chaot; 04-27-2016 at 01:45 PM. |
||||||
04-27-2016, 02:45 PM | #6 | ||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
There is also Calibre's built-in compare: "File" -> "Compare to another book". The code (HTML tags) or the text (words/sentences). Both of these might get broken if you made a mistake when typing your Regex! You might have made a typo and accidentally change: Code:
<p>This is a sample sentence. 192</p>
<p>This is a sample sentence too.</p>
Code:
<p>This is a sample sentence. This is a sample sentence too.</p> Code:
<p>This is a sampleThis is a sample sentence too.</p> I just did this a few days ago... I accidentally typed an extra period in my Regex, and the second character of words were deleted ("Then" -> "Ten", "Suing" -> "Sing"). I didn't notice until later in the day that I made the mistake, and I had to manually correct many of the words. Nothing is special about Example 3a. Search: <b class="calibre3">[0-9]+</b></p>\s+<p class="calibre2"> All that was added was the Blue code. Note: If it was up to me, I strip out all the crap/useless code FIRST... then I could treat Example 3a just like Example 3. Quote:
You could also just add spaces in the Regex to match your specific book. Like Example #5 can turn into: Search: *SPACE*\[[0-9]+\]*SPACE* Also, you can just do a normal Search/Replace after everything to manually fix the "lots of spaces in a row" problem: Search: *SPACE**SPACE* Replace: *SPACE* Quote:
I never used it before... but Regex-Function seems to allow you to use Python code for more powerful Search/Replace. Yes, I believe Sigil/Calibre use the same Regex Engine. At least all of the Regexes I have tested all work between Sigil/Calibre. Quote:
As you can see, a book might have:
It would make no sense to create a giant list of Regex for each of those... because they all follow the same basic rules! I would just visit regular-expressions.info and follow along with the tutorials. It has lots of examples to learn from! Last edited by Tex2002ans; 04-27-2016 at 02:53 PM. |
||||
04-27-2016, 03:12 PM | #7 | |
Well trained by Cats
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
INHO what is also important is the ORDER you fix them. If you don't get it right, the next fix (or join) will be more difficult I remove all Page Header type (Section/Title or Author) With a page number first (this is more than 1 template as there are right - left side variations) I believe the Text Paragraph the Includes the page# is near the last I fix ( I just look and do the needed REGEX now ) Learn basic REGEX, |
|
04-27-2016, 04:23 PM | #8 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
For example, here is some hideous code right out of an InDesign EPUB: Quote:
Quote:
Diap's Editing Toolbag is great for cleaning up code: https://www.mobileread.com/forums/sho....php?p=2980740 It is also great for helping get rid of a ton of the useless classes (<span class="no-style-override-5">), or changing certain tags into other tags (<span class="no-style-override-4"> -> <i>). Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z. And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:
Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place. |
|||
04-27-2016, 05:24 PM | #9 | |
Well trained by Cats
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Personal use, so I am not dropping big $ on a better OCR that get small time usage |
|
04-27-2016, 06:05 PM | #10 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Depending on how much time you waste on having to clean up the headers/footers in the OCR, perhaps it might be best to preprocess those images (with Scan Tailor), and then crop the headers/footers right out, so that the OCR program can just focus on the body text: Original Scan: Scan Tailor: Cropping: 2 column source... I luckily rarely come across that either. Although I would probably do something similar (come up with Imagemagick way to split the pages in half). I may be contacting you via PM for some examples soon (or you could always contact me). |
|
04-27-2016, 06:10 PM | #11 | |
Well trained by Cats
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
|
|
04-27-2016, 06:10 PM | #12 |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Tex2002ans,
Sigil uses the PCRE library, whereas calibre uses Matthew Barnett's enhanced python regex module. The difference is that PCRE supports a couple extensions the python module doesn't... but for the most part they provide the same features. (You cannot capitalize captured text in calibre regex, but you can use a function replace instead. There's always multiple ways to fix the same problem. ) |
04-28-2016, 01:29 PM | #13 |
Head of lunatic asylum
Posts: 349
Karma: 77620
Join Date: Jun 2012
Location: UTC +1
Device: Tolino Vision 3HD
|
@Tex2002ans: Example # 5 works fine with Search: *SPACE*\[[0-9]+\]
It is strange, I had known, only it had not occurred to me yesterday. Are these the unmistakable signs!? Some of you know: my access to the Internet is very limited, days or weeks nonexistent. Then I read namely the books, which I optimize with your help. Now I take time out again! Please, don't get off the track too much. I have to read all that stuff and then to understand, you know!? And please, not so many foreign words, technical terms etc., and don't forget the samples, photos ... well, that's an old story. Some of you are already do very well. Names are not mentioned. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
What is the best way to convert scanned books? | Wolfrott | Conversion | 9 | 02-14-2016 05:05 AM |
Can't delete blank lines between paragraphs in mobi book | Waylander | Conversion | 1 | 11-07-2015 06:03 AM |
Story HD and Google Books scanned free books | wilsonch | iRiver Story | 8 | 12-14-2011 10:23 PM |
Regexes to improve pdf to epub conversion | ldolse | Calibre | 23 | 04-22-2009 04:00 AM |
Small scanned books | Paul Moews | iRex | 22 | 02-05-2009 05:58 PM |