Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 07-18-2023, 03:31 PM   #1
WV-Mike
Connoisseur
WV-Mike began at the beginning.
 
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
From print to ePub - how I did it.

From print to ePub - how I did it.

Greetings,
I have a knack for posting in the wrong forum so I hope I got this one right.

I recently made my first eBook using Sigil.
The eBook was based on an HTML version of a book.

To get to the HTML version was time consuming but I could not think of another way.
If I had OCR software I could have eliminated some of the steps.

1.) Scan all the pages the book to images
2.) Save all the page images to a PDF
3.) Open and run the OCR tool in Acrobat 6
4.) Convert the PDFs to text using PDF24
5.) Run a script on the text file which added <p> tags to the text blocks
6.) Copy the text blocks into and HTML document.

I know, I know - this is not and HTML forum.

After I had the HTML ready I then copied and pasted each chapter into a Sigil document.
I then added a few scanned image into the Sigil project.
The results are here:

https://www.EpicRoadTrips.us/epub/

Questions:
What would make this process simpler and more efficient?

Thanks,
WV-Mike
WV-Mike is offline   Reply With Quote
Old 07-18-2023, 03:41 PM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,662
Karma: 130140792
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
For one, do not use PDF as an intermediary format. It will add all kinds of errors and Acrobat 6 is an old version and may not OCR all that well. Get a good OCR program and use that instead.

Last edited by JSWolf; 07-19-2023 at 04:06 AM.
JSWolf is offline   Reply With Quote
Advert
Old 07-18-2023, 11:43 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by WV-Mike View Post
I recently made my first eBook using Sigil.
Fantastic! Congrats.

And welcome to the forum.

Quote:
Originally Posted by WV-Mike View Post
From print to ePub - how I did it.

[...]

What would make this process simpler and more efficient?
Boy, oh boy... Well, you've come to the right place.

I've been writing about this stuff extensively since 2012.

For some of the most recent topics, see:

and, just last week, I wrote an even bigger summary here which linked to even more of the previous threads:

That should hold you over on all OCRing + PDF->EPUB + DOCX->EPUB info for... oh, about 100 years.

Quote:
Originally Posted by JSWolf View Post
It will add all kinds of errors and Acrobat 6 is an old version and may noot OCR all that well. Get a good OCR program and use that instead.
Yes, exactly.

I looked up the date, and looks like Adobe Acrobat 6 was from 2003! My gods, there has been multiple GENERATIONAL leaps in OCR quality since then.

Getting much more accurate OCR is one of the biggest and most important steps you can do, because EVERY further stage will be based on how clean your initial text is.

You can see the post I wrote about how important accurate OCR is:

When you're creating ebooks... it's not JUST the raw text you have to worry about, but correctly recognizing all the formatting too:
  • Bold / Italics
  • Superscripts / Subscripts
  • Lists
  • Tables
  • Images
  • Headers / Footers
  • [...]

Last edited by Tex2002ans; 07-18-2023 at 11:56 PM.
Tex2002ans is offline   Reply With Quote
Old 07-19-2023, 12:49 AM   #4
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,170
Karma: 4949904
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
@Tex2002ans

In your 2020: "OCRing + EPUBing my first book: Tips?" link, you mention Scan Tailor Advanced.

The only release I could find that has an install file is v0.9.11.1 from 2014.
https://github.com/scantailor/scantailor/releases

Is this the same software you are referring to? It seems quite old, and you mention generational leaps to the OP, so I wonder if the same applies to this software.

There is a v0.9.12.1 from 2016, but there does not seem to be any install file associated with that release.
Karellen is online now   Reply With Quote
Old 07-19-2023, 03:41 AM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Karellen View Post
In your 2020: "OCRing + EPUBing my first book: Tips?" link, you mention Scan Tailor Advanced.

The only release I could find that has an install file is v0.9.11.1 from 2014.
https://github.com/scantailor/scantailor/releases
The exact version of Scan Tailor Advanced I use is by 4lex4:

v1.0.16 was the latest (in 2018).

- - -

Side Note: In September 2019 there was an "Early Access" version, and then it seems like there hasn't been much activity since.

I think, since the 2019 stall, some other person created another fork of it here:

but I have no idea about that fork or what sorts of bugs/fixes have been done since.

- - -

Side Note #2: Looks like you linked to the original "Scan Tailor".

"Scan Tailor Advanced" took all the forks, pulled out all the best features, and combined them all into one super version.

The biggest features for me were:
  • multi-core support (so it runs MUCH faster than the original)
  • image formats besides TIFF

+ lots of other helpful things all listed on their Github.

- - -

Quote:
Originally Posted by Karellen View Post
It seems quite old, [...]
Doesn't matter. It's only used as a middle, pre-processing stage where you are cleaning up the raw images.

I don't foresee too much changing on that front any time soon.
  • You feed it the raw photos/scans.
  • it helps crop + fix the warping + normalize the B&W/grayscale/colors.
  • then you shove those images into OCR.

You can see me apply it in:

where I quickly:
  • took an Archive.org PDF
  • ran it through Scan Tailor Advanced
  • OCRed it in Finereader
  • Exported as (EPUB) + ran my regex on it.

You can compare my quickly-generated EPUB vs. the auto-generated Archive.org "EPUB".

Still, nowhere near as good as a manually corrected version, but WAY better quality than just spitting out raw text right out of the PDF.

Quote:
Originally Posted by Karellen View Post
[...] and you mention generational leaps to the OP, so I wonder if the same applies to this software.
Yes, generational leaps in the OCR.

Even on the free/open-source front, there's been a lot of action, but I haven't been following that too closely... Because those tools tended to:
  • focus on generating only the raw plaintext, ditching all the important formatting!
  • be commandline only.
    • (Or have really crappy GUIs.)
    • Okay if you are working on a small amount... but when you have to manually tweak/correct/mark pages, great GUI is key. It will save you so much pain further down the line.

Last edited by Tex2002ans; 07-19-2023 at 04:06 AM.
Tex2002ans is offline   Reply With Quote
Advert
Old 07-19-2023, 06:03 AM   #6
WV-Mike
Connoisseur
WV-Mike began at the beginning.
 
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
Quote:
Originally Posted by JSWolf View Post
For one, do not use PDF as an intermediary format. It will add all kinds of errors and Acrobat 6 is an old version and may not OCR all that well. Get a good OCR program and use that instead.
Greetings,
I detected few, if any errors when using Acro 6.
I have been looking on Ebay for a newer version but haven't yet purchased one.

Thanks,
WV-Mike
WV-Mike is offline   Reply With Quote
Old 07-19-2023, 06:15 AM   #7
WV-Mike
Connoisseur
WV-Mike began at the beginning.
 
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
From print to ePub - how I did it.

Whew! This is all a bit overwhelming.
I looked at 4lex4 / scantailor-advanced but I cannot see how to use it.
I am used to downloading a .msi or .exe for the installation.
I don't have a clue how to install scantailor-advanced or then use it.

I looked at https://github.com/4lex4/scantailor-advanced#readme
However, I saw no instructions for installing it.

Thanks to everyone for all this info.
As you say: I came to the right place.
WV-Mike

Quote:
Originally Posted by Tex2002ans View Post
Fantastic! Congrats.

And welcome to the forum.
Boy, oh boy... Well, you've come to the right place.

I've been writing about this stuff extensively since 2012.

For some of the most recent topics, see:

and, just last week, I wrote an even bigger summary here which linked to even more of the previous threads:

That should hold you over on all OCRing + PDF->EPUB + DOCX->EPUB info for... oh, about 100 years.
WV-Mike is offline   Reply With Quote
Old 07-19-2023, 07:02 AM   #8
WV-Mike
Connoisseur
WV-Mike began at the beginning.
 
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
From print to ePub - how I did it

Quote:
Originally Posted by Tex2002ans View Post

The exact version of Scan Tailor Advanced I use is by 4lex4:
where I quickly:
took an Archive.org PDF
ran it through Scan Tailor Advanced
OCRed it in Finereader
Exported as (EPUB) + ran my regex on it.
FineReader is now by subscription only. It seem they all are now.
I am still looking for a standalone program I can install from a .exe or CD.
Thanks,
WV-Mike
WV-Mike is offline   Reply With Quote
Old 07-19-2023, 08:48 AM   #9
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,690
Karma: 87654321
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Plenty of free and pay once OCR that's good. PDF is a terrible step to include. Better to have TIFF or png.
I use Tesseract OCR.
Quoth is offline   Reply With Quote
Old 07-19-2023, 09:38 AM   #10
WV-Mike
Connoisseur
WV-Mike began at the beginning.
 
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
From print to ePub - how I did it.

Quote:
Originally Posted by Quoth View Post
Plenty of free and pay once OCR that's good. PDF is a terrible step to include. Better to have TIFF or png.
I use Tesseract OCR.
Can you name a few others which are "free and pay once"?
I have no idea how to use the github.com offerings.

Thanks,
WV-Mike
WV-Mike is offline   Reply With Quote
Old 07-19-2023, 12:23 PM   #11
jmurphy
Zealot
jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.jmurphy ought to be getting tired of karma fortunes by now.
 
Posts: 100
Karma: 1133068
Join Date: Sep 2007
Device: ipaq
Quote:
Originally Posted by WV-Mike View Post
I looked at https://github.com/4lex4/scantailor-advanced#readme
However, I saw no instructions for installing it.

WV-Mike
Assuming you are using Windows, just download the .exe and double-click it.

https://github.com/4lex4/scantailor-advanced/releases

Under the heading "2019.8.16 Early Access", click "Assets" and download the installer.
jmurphy is offline   Reply With Quote
Old 07-19-2023, 02:43 PM   #12
WV-Mike
Connoisseur
WV-Mike began at the beginning.
 
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
From print to ePub - how I did it.

Quote:
Originally Posted by jmurphy View Post
Assuming you are using Windows, just download the .exe and double-click it.

https://github.com/4lex4/scantailor-advanced/releases

Under the heading "2019.8.16 Early Access", click "Assets" and download the installer.
Thanks.
Too be clear this software preps the images prior to running OCR software.
Is that correct?

WV-Mike
WV-Mike is offline   Reply With Quote
Old 07-19-2023, 03:21 PM   #13
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,119
Karma: 18727091
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Yes.

It helps straighten an image if the capture/camera was slightly off-axis, or de-warp an image if there was any skew. That helps set the characters to the correct orientation and consistent sizing...which makes OCR much better.

Some OCR software will do this a little bit, with differing levels of success.

It is much better to get a very accurate image in the first place. Scantailor was originally designed for just that deskewing purpose...although it sounds like they have added more functionality. I'll have to check it out again!
Turtle91 is offline   Reply With Quote
Old 07-19-2023, 03:37 PM   #14
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,170
Karma: 4949904
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
Quote:
Originally Posted by Tex2002ans View Post
The exact version of Scan Tailor Advanced I use is by 4lex4:

v1.0.16 was the latest (in 2018).
That's great. Thank you @Tex2002ans
I've installed and a quick trial run on an image I was previously having poor results in, and it OCR'd almost perfectly. In the few minutes I fiddled around with it, it seemed pretty easy to use. But I'll spend some time understanding it better.
I just learnt that images OCR better when using a non-compressed / lossless format.
Karellen is online now   Reply With Quote
Old 07-19-2023, 03:39 PM   #15
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,170
Karma: 4949904
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
Quote:
Originally Posted by WV-Mike View Post
Can you name a few others which are "free and pay once"?
I have no idea how to use the github.com offerings.

Thanks,
WV-Mike
Try this OCR package... https://github.com/manisandro/gImageReader

As with all things Github, along the right side of the page you will see Releases. Click on that, look for the latest version which is usually at the top or second one down, expand the Assets button and download the appropriate installer.
Karellen is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
EPUB DIRECT PRINT hershe ePub 2 02-21-2013 01:28 AM
Can I print an Epub book? Bart123 ePub 3 12-01-2011 12:04 AM
Print version of ePub rplantz ePub 3 09-08-2011 03:51 AM
epub print squashed pendragginp Calibre 16 11-10-2010 08:19 AM
How can I print an Epub jimjam ePub 4 11-27-2009 11:41 AM


All times are GMT -4. The time now is 08:52 PM.


MobileRead.com is a privately owned, operated and funded community.