Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 04-26-2016, 12:22 PM   #1
chaot
Head of lunatic asylum
chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.
 
chaot's Avatar
 
Posts: 349
Karma: 77620
Join Date: Jun 2012
Location: UTC +1
Device: Tolino Vision 3HD
Delete paragraphs in scanned books (S & R with regexes)



Scanned books show in the view screen or e-reader often unwanted paragraphs in respect to book page numbers.

The terms may look different, but they appear en masse and therefore elimination using S & R and regaxes would be advantageous.
Marked syntax (red) should be deleted. Note the book page numbers always differ (of course).

Where are our great regex masters!?

Some examples from different books:

Click image for larger version

Name:	Example 1.png
Views:	262
Size:	45.1 KB
ID:	148233
Example 1
Code:
keine Anzeichen für körperliche Mängel zu erkennen. </p>

  <p class="calibre2">Normal? Der US-Geheimdienst OSS (Office of Strategic 169</p>

  <p class="calibre2"></p>

  <p class="calibre2">Studies, Vorläufer der CIA), oder genauer, der von ihm
Click image for larger version

Name:	Example 2.png
Views:	266
Size:	27.7 KB
ID:	148237
Example 2 Note hyphen, also to delete.
Code:
derartigen Mangel hingewiesen hätten, aber die ärztlichen Feststel-170</p>

  <p class="calibre2"></p>

  <p class="calibre2">lungen lauteten nach dem Krieg nicht anders als
Click image for larger version

Name:	Example 3.png
Views:	259
Size:	28.4 KB
ID:	148236
Example 3
Code:
die natürlich ihre Blöße nicht deckten, denn es war </p>

  <p class="calibre2">17</p>

  <p class="calibre2"></p>

  <p class="calibre2">keiner anwesend (außer mir), der nicht mindestens seine
Click image for larger version

Name:	Example 3a.png
Views:	232
Size:	35.3 KB
ID:	148272
Example 3a
Code:
das viel zu herb und zu modisch für sie ist, irgendein <b class="calibre3">19</b></p>

  <p class="calibre2"></p>

  <p class="calibre2">Zeug, das, glaube ich, Taiga heißt, noch in der Wohnung
Click image for larger version

Name:	Example 4.png
Views:	253
Size:	28.7 KB
ID:	148238
Example 4 Note Roman rather than Arabic numerals!
Code:
bewundernden Kommentare von westlichen Besuchern in Maos China, XVI </p>

  <p class="calibre2"></p>

  <p class="calibre2">dass Chinesen außerordentliche Menschen seien, die es
Click image for larger version

Name:	Example 5.png
Views:	231
Size:	20.6 KB
ID:	148239
Example 5
Code:
ihr Büro war für die [306] Sicherheit eines Parkabschnitts zuständig.
Interna: Ex1&2 Bedürftig (AHdAb), Ex3&3a Böll (AeC), Ex4&5 Chang (WS)

Last edited by chaot; 06-02-2016 at 02:27 PM. Reason: add Interna, Example 3a
chaot is offline   Reply With Quote
Old 04-26-2016, 12:53 PM   #2
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,161
Karma: 1404241
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Guess, you don't touch conversion parameters before you start your conversion work and everything is on standard setup. This is for PDF conversion not the best choice.

You need to set a better line unwrapping factor for PDF input files. Standard is .45, check out something between .15 and .25
Divingduck is offline   Reply With Quote
Old 04-26-2016, 01:19 PM   #3
chaot
Head of lunatic asylum
chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.
 
chaot's Avatar
 
Posts: 349
Karma: 77620
Join Date: Jun 2012
Location: UTC +1
Device: Tolino Vision 3HD
Sorry, this has nothing to do with PDF; these are all excerpts from (now) EPUBs, were before real books scanned.

Last edited by chaot; 04-26-2016 at 02:09 PM. Reason: add: from (now) EPUBs, were before real books scanned.
chaot is offline   Reply With Quote
Old 04-26-2016, 05:44 PM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Question: Is there an actual space before the final closing </p>? And can it actually be relied upon?

In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All.

Regex Solutions

I would handle this specific cleanup in a few passes.

First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text.

Before Examples

I would just do a simple Search and Replace to strip out all:

<p class="calibre2"></p>

and

<p class="calibre2"/>

Example #1-3

If you run the above Search/Replaces, then example #1-3 can be condensed into this:

Search: [0-9]+</p>\s+<p class="calibre2">
Replace: *BLANK OR A SPACE*

Note: In these examples, Red denotes the Regex that matches the page numbers.

Note: In English, the Red portion says "look for 1 or more numbers in a row".

The Blue portion says "look for 1 or more whitespace characters".

Note: There can be legitimate usages of numbers (for example, years/dates/ages). Be careful.

Example #4

Search: [IXVL]+</p>\s+<p class="calibre2">
Replace: *BLANK OR A SPACE*

Note: In English, Red says "look for the 1 or more 'I' + 'X' + 'L' + 'V' in a row". This should match roman numerals like "IX", "XIII", "XXIV".

Note: "I" is used very often in English, so be careful.

Note: Make sure you have the "Case-sensitive" button turned on.

Example #5

Search: \[[0-9]+\]
Replace: *BLANK OR A SPACE*

Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket".

After Examples

Beyond that point, you stated that hyphens should be removed... I would strongly recommend against this. Each one of these has to be checked on a case-by-case basis. The hyphen may actually be a hard hyphen (for example, in the word "all-purpose" might have been broken across pages).

For checking hyphens at the end of paragraphs, I personally run this regex:

Search: -</p>\s+<p>
Replace: *BLANK*

It shouldn't be too bad manually correcting these. In reality, you only have to check a handful of hyphens that were at the end of pages.

I would highly recommend learning at least the basics of Regex:

http://www.regular-expressions.info/quickstart.html

There is also a huge "Regex examples" thread in the Sigil section of the forums:

https://www.mobileread.com/forums/sho...d.php?t=167971

These examples you posted are relatively easy.

Side Note: Thanks for saving your example images as PNG. Vastly superior compared to people who post screenshots as JPG.

Last edited by Tex2002ans; 04-26-2016 at 06:12 PM.
Tex2002ans is offline   Reply With Quote
Old 04-27-2016, 01:40 PM   #5
chaot
Head of lunatic asylum
chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.
 
chaot's Avatar
 
Posts: 349
Karma: 77620
Join Date: Jun 2012
Location: UTC +1
Device: Tolino Vision 3HD
@Tex2002ans, thank you very much!

That looks like a lot of work - and you will probably be able and willing to help in cause of other related questions.

Quote:
Originally Posted by Tex2002ans View Post
Question: Is there an actual space before the final closing </p>? And can it actually be relied upon?
At the moment I can't answer that, subject to an investigation.

Quote:
In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All.
Nothing beats experience! The examples shown concern only 3 books, other hundreds/thousands waits. And so far, even if it is very time consuming, I will consider your advice to be careful with Replace all.
Note: Adding an Example 3a in Post #1 (same book as Example 3)

Treating the whole catalog of problems at once I often lack the Internet, that means going on selective. Simple things first.

Quote:
Regex Solutions

I would handle this specific cleanup in a few passes.

First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text.
Code comparison, what kind of tool will do the job best? I got Beyond Compare, Meld and KDiff3.
Probably you mean key parts of the code!?

Quote:
Example #5

Search: \[[0-9]+\]
Replace: *BLANK OR the regex, however, should eliminate a blank space

Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket".
Replace: *BLANK OR A SPACE* will transform
Code:
ihr Büro war für die [306] Sicherheit
into
Code:
ihr Büro war für die  Sicherheit [2 blank spaces]
respectively into
Code:
ihr Büro war für die    Sicherheit [4 blank spaces]
The regex here, however, should eliminate with [306] also a blank space.
Don't be angry, I'm relatively sure the solution (for the elimination of a space) is
to find anywhere - only I would like a little sense of achievement quick and now.

What's the different in S&R between settings Regex and Regex-Function?

Quote:
I would highly recommend learning at least the basics of Regex
Will be done! Great interest exists.

Quote:
There is also a huge "Regex examples" thread in the Sigil section of the forums:

https://www.mobileread.com/forums/sho...d.php?t=167971
Stupid question!? Are these regaxes also fit for calibre?
Would may be worth to create out of all these examples there something like a (regax examples) library - you know, cataloged and without bla-bla.

Last edited by chaot; 04-27-2016 at 01:45 PM.
chaot is offline   Reply With Quote
Old 04-27-2016, 02:45 PM   #6
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by chaot View Post
Code comparison, what kind of tool will do the job best? I got Beyond Compare, Meld and KDiff3.
I personally use Beyond Compare (I found it was more accurate compared to the other programs I tested).

There is also Calibre's built-in compare: "File" -> "Compare to another book".

Quote:
Originally Posted by chaot View Post
Probably you mean key parts of the code!?
The code (HTML tags) or the text (words/sentences). Both of these might get broken if you made a mistake when typing your Regex!

You might have made a typo and accidentally change:

Code:
<p>This is a sample sentence. 192</p>

<p>This is a sample sentence too.</p>
into:

Code:
<p>This is a sample sentence.

This is a sample sentence too.</p>
or:

Code:
<p>This is a sampleThis is a sample sentence too.</p>
Sometimes it is very hard to spot the error, and you don't see it until hours later when it is too late (you already made hundreds of other changes and corrections).

I just did this a few days ago... I accidentally typed an extra period in my Regex, and the second character of words were deleted ("Then" -> "Ten", "Suing" -> "Sing"). I didn't notice until later in the day that I made the mistake, and I had to manually correct many of the words.

Quote:
Originally Posted by chaot View Post
Note: Adding an Example 3a in Post #1 (same book as Example 3)
Nothing is special about Example 3a.

Search: <b class="calibre3">[0-9]+</b></p>\s+<p class="calibre2">

All that was added was the Blue code.

Note: If it was up to me, I strip out all the crap/useless code FIRST... then I could treat Example 3a just like Example 3.

Quote:
Originally Posted by chaot View Post
The regex here, however, should eliminate with [306] also a blank space.
Don't be angry, I'm relatively sure the solution (for the elimination of a space) is
to find anywhere - only I would like a little sense of achievement quick and now.
I personally just run "Prettify Code" and that fixes the multi-space issues.

You could also just add spaces in the Regex to match your specific book.

Like Example #5 can turn into:

Search: *SPACE*\[[0-9]+\]*SPACE*

Also, you can just do a normal Search/Replace after everything to manually fix the "lots of spaces in a row" problem:

Search: *SPACE**SPACE*
Replace: *SPACE*

Quote:
Originally Posted by chaot View Post
What's the different in S&R between setting Regex and Regex-Function?
https://manual.calibre-ebook.com/function_mode.html

I never used it before... but Regex-Function seems to allow you to use Python code for more powerful Search/Replace.

Quote:
Originally Posted by chaot View Post
Stupid question!? Are these regaxes also fit for calibre?
Yes, I believe Sigil/Calibre use the same Regex Engine. At least all of the Regexes I have tested all work between Sigil/Calibre.

Quote:
Originally Posted by chaot View Post
Would may be worth to create out of all these examples there something like a (regax examples) library - you know, cataloged and without bla-bla.
I don't believe there is a collection like that. Once you really learn the basics (by reading regular-expressions.info), you could really come up with all the Regex by yourself. That is what I do for the most part, I just create them on-the-fly as I need them... because each book's code comes with its own problems.

As you can see, a book might have:
  • a bold page number
  • an italic page number
  • a page number on its own line
  • a page number in the middle of text.
  • a bold+italic page number
  • [###]
  • (###)
  • <b class="calibre#">###</b>
  • <b class="block#">###</b>
  • <span class="pagenumber">###</span>
  • <sup>###</sup>
  • [...]

It would make no sense to create a giant list of Regex for each of those... because they all follow the same basic rules!

I would just visit regular-expressions.info and follow along with the tutorials. It has lots of examples to learn from!

Last edited by Tex2002ans; 04-27-2016 at 02:53 PM.
Tex2002ans is offline   Reply With Quote
Old 04-27-2016, 03:12 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Tex2002ans View Post
I don't believe there is a collection like that. Once you really learn the basics (by reading regular-expressions.info), you could really come up with all the Regex by yourself. That is what I do for the most part, I just create them on-the-fly as I need them... because each book's code comes with its own problems.

As you can see, a book might have:
  • a bold page number
  • an italic page number
  • a page number on its own line
  • a page number in the middle of text.
  • a bold+italic page number
  • [###]
  • (###)
  • <b class="calibre#">###</b>
  • <b class="block#">###</b>
  • <span class="pagenumber">###</span>
  • <sup>###</sup>
  • [...]

It would make no sense to create a giant list of Regex for each of those... because they all follow the same basic rules!

I would just visit regular-expressions.info and follow along with the tutorials. It has lots of examples to learn from!
And all of the above in the same book (OCR of scan)

INHO what is also important is the ORDER you fix them. If you don't get it right, the next fix (or join) will be more difficult

I remove all Page Header type (Section/Title or Author) With a page number first (this is more than 1 template as there are right - left side variations)
I believe the Text Paragraph the Includes the page# is near the last I fix
( I just look and do the needed REGEX now )

Learn basic REGEX,
theducks is offline   Reply With Quote
Old 04-27-2016, 04:23 PM   #8
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by theducks View Post
INHO what is also important is the ORDER you fix them. If you don't get it right, the next fix (or join) will be more difficult
Yep! And this is why you should try to normalize/clean the code as much as possible FIRST.

For example, here is some hideous code right out of an InDesign EPUB:

Quote:
<p class="body-text" xml:lang="en-us"><span class="no-style-override-5">The point is, as we can readily see, the ability to</span> <span class="no-style-override-4">foresee</span> <span class="no-style-override-5">an event is not at all equivalent to</span> <span class="no-style-override-4">agreeing</span> <span class="no-style-override-5">to it. Yes, I can full well</span> <span class="no-style-override-4">predict</span> <span class="no-style-override-5">that if I move to the South Bronx, I’ll likely be victimized by street crime. But this is not at</span> <span class="no-style-override-4">all</span> <span class="no-style-override-5">the same thing as</span> <span class="no-style-override-4">acquiescing</span> <span class="no-style-override-5">in such nefarious activities. Yet, according to the “libertarian” argument we are considering, the two are indistinguishable.</span></p>
First thing I do is go through the code and strip it down to this:

Quote:
<p>The point is, as we can readily see, the ability to <i>foresee</i> an event is not at all equivalent to <i>agreeing</i> to it. Yes, I can full well <i>predict</i> that if I move to the South Bronx, I’ll likely be victimized by street crime. But this is not at <i>all</i> the same thing as <i>acquiescing</i> in such nefarious activities. Yet, according to the “libertarian” argument we are considering, the two are indistinguishable.</p>
and then it makes it much easier to do later fixes.

Diap's Editing Toolbag is great for cleaning up code:

https://www.mobileread.com/forums/sho....php?p=2980740

It is also great for helping get rid of a ton of the useless classes (<span class="no-style-override-5">), or changing certain tags into other tags (<span class="no-style-override-4"> -> <i>).

Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z.

And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:
  • calibre2 in Book A might be the page numbers
  • calibre2 in Book B might be italics
  • [...]
  • calibre2 in Book Z might be headings

Quote:
Originally Posted by theducks View Post
And all of the above in the same book (OCR of scan)

[...]

I remove all Page Header type (Section/Title or Author) With a page number first (this is more than 1 template as there are right - left side variations)
Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place.
Tex2002ans is offline   Reply With Quote
Old 04-27-2016, 05:24 PM   #9
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Tex2002ans View Post
Yep! And this is why you should try to normalize/clean the code as much as possible FIRST.

For example, here is some hideous code right out of an InDesign EPUB:



First thing I do is go through the code and strip it down to this:



and then it makes it much easier to do later fixes.

Diap's Editing Toolbag is great for cleaning up code:

https://www.mobileread.com/forums/sho....php?p=2980740

It is also great for helping get rid of a ton of the useless classes (<span class="no-style-override-5">), or changing certain tags into other tags (<span class="no-style-override-4"> -> <i>).

Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z.

And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:
  • calibre2 in Book A might be the page numbers
  • calibre2 in Book B might be italics
  • [...]
  • calibre2 in Book Z might be headings



Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place.
Paperport, the FREE OCR that came with my scanner. What you scan is what they try and OCR . 2 Col source is a pain. Lucky me, I rarely see it.
Personal use, so I am not dropping big $ on a better OCR that get small time usage
theducks is offline   Reply With Quote
Old 04-27-2016, 06:05 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by theducks View Post
Paperport, the FREE OCR that came with my scanner. What you scan is what they try and OCR . 2 Col source is a pain. Lucky me, I rarely see it.
Side Note: Hmmmmm... I have been writing a Scan Tailor tutorial. Maybe I could toss in some semi-related extra pre/postprocessing in the tutorial.

Depending on how much time you waste on having to clean up the headers/footers in the OCR, perhaps it might be best to preprocess those images (with Scan Tailor), and then crop the headers/footers right out, so that the OCR program can just focus on the body text:

Original Scan: Click image for larger version

Name:	OriginalScan.png
Views:	380
Size:	65.6 KB
ID:	148279
Scan Tailor: Click image for larger version

Name:	ScanTailor.png
Views:	549
Size:	169.1 KB
ID:	148280
Cropping: Click image for larger version

Name:	Stripped.png
Views:	564
Size:	162.8 KB
ID:	148281

2 column source... I luckily rarely come across that either. Although I would probably do something similar (come up with Imagemagick way to split the pages in half). I may be contacting you via PM for some examples soon (or you could always contact me).
Tex2002ans is offline   Reply With Quote
Old 04-27-2016, 06:10 PM   #11
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,779
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Tex2002ans View Post
Side Note: Hmmmmm... I have been writing a Scan Tailor tutorial. Maybe I could toss in some semi-related extra pre/postprocessing in the tutorial.

Depending on how much time you waste on having to clean up the headers/footers in the OCR, perhaps it might be best to preprocess those images (with Scan Tailor), and then crop the headers/footers right out, so that the OCR program can just focus on the body text:

Original Scan: Attachment 148279
Scan Tailor: Attachment 148280
Cropping: Attachment 148281

2 column source... I luckily rarely come across that either. Although I would probably do something similar (come up with Imagemagick way to split the pages in half). I may be contacting you via PM for some examples soon (or you could always contact me).
Old Analog Magazines are fun . It is almost always Magazines with the 2col prob
theducks is offline   Reply With Quote
Old 04-27-2016, 06:10 PM   #12
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Tex2002ans,

Sigil uses the PCRE library, whereas calibre uses Matthew Barnett's enhanced python regex module.

The difference is that PCRE supports a couple extensions the python module doesn't... but for the most part they provide the same features.

(You cannot capitalize captured text in calibre regex, but you can use a function replace instead. There's always multiple ways to fix the same problem. )
eschwartz is offline   Reply With Quote
Old 04-28-2016, 01:29 PM   #13
chaot
Head of lunatic asylum
chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.chaot will give the Devil his due.
 
chaot's Avatar
 
Posts: 349
Karma: 77620
Join Date: Jun 2012
Location: UTC +1
Device: Tolino Vision 3HD
@Tex2002ans: Example # 5 works fine with Search: *SPACE*\[[0-9]+\]
It is strange, I had known, only it had not occurred to me yesterday.
Are these the unmistakable signs!?

Some of you know: my access to the Internet is very limited, days or weeks nonexistent. Then I read namely the books, which I optimize with your help.

Now I take time out again!

Please, don't get off the track too much. I have to read all that stuff and then to understand, you know!? And please, not so many foreign words, technical terms etc., and don't forget the samples, photos ... well, that's an old story. Some of you are already do very well. Names are not mentioned.
chaot is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
What is the best way to convert scanned books? Wolfrott Conversion 9 02-14-2016 05:05 AM
Can't delete blank lines between paragraphs in mobi book Waylander Conversion 1 11-07-2015 06:03 AM
Story HD and Google Books scanned free books wilsonch iRiver Story 8 12-14-2011 10:23 PM
Regexes to improve pdf to epub conversion ldolse Calibre 23 04-22-2009 04:00 AM
Small scanned books Paul Moews iRex 22 02-05-2009 05:58 PM


All times are GMT -4. The time now is 05:12 AM.


MobileRead.com is a privately owned, operated and funded community.