Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 07-19-2020, 06:45 PM   #31
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
pdftotext did the job. Thanks!
Shohreh is offline   Reply With Quote
Old 07-20-2020, 01:23 PM   #32
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
As a way to re-add them after a PDF was OCRed, do you know of a PDF reader for Windows that can find all words in italics?

SumatraPDF: NO
Foxit Reader: NO
Acrobat Reader: NO
XpdfReader: NO
Shohreh is offline   Reply With Quote
Old 07-20-2020, 05:36 PM   #33
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,457
Karma: 26645808
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Word or Writer can read some PDFs, and find strings with specific formats.

BR
BetterRed is online now   Reply With Quote
Old 07-20-2020, 05:55 PM   #34
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 378
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Ubuntu, Jutoh,Kobo Forma
Quote:
Originally Posted by j.p.s View Post
pdftotext:
https://en.wikipedia.org/wiki/Pdftotext

Also, k2pdfopt, documented in the PDF forum at mobileread.
And pdftotext has an option, --layout, that will give the text file an approximation of the indented paragraphs in a book, using spaces. Then if you use Calibre to convert the text file to epub, someplace there is a styling option, I think the selection is "print", that will use those indent spaces to unwrap lines into paragraphs. I haven't done this for a while, but it worked very well for me with a couple of books.
retiredbiker is offline   Reply With Quote
Old 07-21-2020, 09:05 AM   #35
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
How does pdftotext show italics or bold?

There's no difference in the output when using "-layout". No extras spaces I could use to find those.
Attached Thumbnails
Click image for larger version

Name:	B3BB8DFD-6951-4E4C-A859-2D2F99C29B50.png
Views:	277
Size:	45.4 KB
ID:	180828  

Last edited by Shohreh; 07-21-2020 at 09:08 AM.
Shohreh is offline   Reply With Quote
Old 07-21-2020, 11:50 AM   #36
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 378
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Ubuntu, Jutoh,Kobo Forma
Quote:
Originally Posted by Shohreh View Post
How does pdftotext show italics or bold?

There's no difference in the output when using "-layout". No extras spaces I could use to find those.
It only indents if the original is indented. As always, it depends on what is in the pdf to start with. Every new pdf is an exploration in how to handle it.

And no italics or bold--text only. No magic, just another tool.
retiredbiker is offline   Reply With Quote
Old 07-21-2020, 03:38 PM   #37
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Shohreh View Post
As a way to re-add them after a PDF was OCRed, do you know of a PDF reader for Windows that can find all words in italics?

[...]
Didn't you already say in Post #28 that you ran this PDF through Finereader?

Finereader should have carried over italics and other formatting for you.

Quote:
Originally Posted by Shohreh View Post
How does pdftotext show italics or bold?

There's no difference in the output when using "-layout". No extras spaces I could use to find those.
pdftotext is plaintext only... as are most "get the text out of PDF" tools.

Again, there's a reason why PDF is the absolute worst input format. I even wrote a lot about this back in 2013: "Best way to copy text from a PDF or MOBI?".

Quote:
Originally Posted by Shohreh View Post
That's why it looked like scanned pages, but the text is still selectable like text PDF.
PDFs potentially have two layers:
  • Frontend
    • Bitmap/Image
      • Like a scanned document.
      • You can zoom in and see speckles/defects/lower-resolution.
    • Vector
      • Like a purely digital document (DOCX, InDesign, [...]).
      • You can zoom in and the text/graphics are perfectly crisp.
  • Backend (Text) (Optional)
    • This is the invisible layer you search/copy/paste from.
    • An OCR program is going to create this.
    • Note: There is also such a thing as a "Tagged PDF" file, which does carry over information like headings/italics/bold, but it's rare that people even create these types (let alone tagged properly).

Adobe's ClearScan only messes with that Frontend layer. It takes a Bitmap/Scanned image, then creates "custom fonts" based on the shapes themselves.

So you might have dozens of scanned 'g'-looking shapes:

https://blogs.adobe.com/acrolaw/file...law/003b_G.GIF

It will replace every "scanned g" with a "digital g":

https://blogs.adobe.com/acrolaw/file...law/003a_G.GIF

Next, it'll run across a tilted g (italics), etc. It does this for thousands of unique shapes, and assigns them to digital/vector fonts.

This is why I said it's still a scanned document. It doesn't change the nature of the PDF. It looks digital, like a purely vector document, but it isn't.

In many cases, it's even worse than just having the original scan, because ClearScan may botch the document even worse than expected.

Here's an 'm' scanned at 300dpi, then ran through ClearScan:

https://blogs.adobe.com/acrolaw/file..._300_dpi_m.PNG

potential distortions add up, and you might get other serious errors that crop in.

I don't have a ClearScan document on hand (and I don't use Adobe Acrobat), but here's an example of what I'm talking about when scan->digital goes awry:

Click image for larger version

Name:	Scanned.to.Digital.Distortions.png
Views:	294
Size:	77.9 KB
ID:	180833

You can see:
  • "of" is squished
  • the em dashes plus period were inconsistently recognized
  • weird random bolding
  • and kerning is especially awful around italics

Last edited by Tex2002ans; 07-21-2020 at 03:57 PM.
Tex2002ans is offline   Reply With Quote
Old 07-24-2020, 03:38 AM   #38
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Thanks much for the infos about layers.

Quote:
Originally Posted by Tex2002ans View Post
Didn't you already say in Post #28 that you ran this PDF through Finereader? Finereader should have carried over italics and other formatting for you.
Because FineReader did not carry formatting, I wanted to try other tools, especially since the PDF contained two layers, so it made sense to extract the "text" layer and see how it compared with running the PDF through FineReader.

Turns out it's still a bit of work to…
  • Re-add formatting (bold, italics, etc.)
  • Some hyphenated words weren't corrected by FineReader (but much better than starting from raw text from pdttotext, since FineReader uses a dictionary to fix most of those)
  • Re-add footnotes
  • Takes pictures of tables and… pictures, and insert them
  • Build a ToC
Shohreh is offline   Reply With Quote
Old 07-24-2020, 03:49 PM   #39
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Shohreh View Post
Because FineReader did not carry formatting,
What did you export as?

You should see italics/bold showing up in the right half of Finereader:

Click image for larger version

Name:	Finereader.Left.Right.Halves.png
Views:	430
Size:	280.2 KB
ID:	180911

Left should display the original document, and the Right half should show all the actual OCRed text.

Did you select Document Layout: "Formatted Text". In the dropdown, you can also select DOCX:

Click image for larger version

Name:	Finereader.Formatted.Text.png
Views:	409
Size:	9.3 KB
ID:	180910

(Personally, I keep everything on "Exact Copy" until I'm ready to export the document. This makes the Left/Right halves match much more closely, making it easier to make corrections.)

Quote:
Originally Posted by Shohreh View Post
I wanted to try other tools, especially since the PDF contained two layers, so it made sense to extract the "text" layer and see how it compared with running the PDF through FineReader.
I'm going to let you know right now, almost always the text layer is a garbled mess.

It's almost always better to re-OCR and work from scratch (see the PDF+OCR topics I previously linked to).

Quote:
Originally Posted by Shohreh View Post
Re-add formatting (bold, italics, etc.)
Yes, exactly, which is why you want the computer doing that.

Finereader does a better job than any other tool at carrying over this (along with superscript/subscript/tables, etc. etc.).

Quote:
Originally Posted by Shohreh View Post
Some hyphenated words weren't corrected by FineReader (but much better than starting from raw text from pdttotext, since FineReader uses a dictionary to fix most of those)
Yes, the soft/hard hyphen is a problem for anything, but again, Finereader seems to handle these the best.

Still a lot of manual correction needs to be done though, and that's where you use some of the tricks I listed in Post #15.

Spellcheck Lists are a fantastic way to catch/correct these things, along with Regex.

Quote:
Originally Posted by Shohreh View Post
Re-add footnotes
Yep, that one's a pain, but there are methods.

Quote:
Originally Posted by Shohreh View Post
Takes pictures of tables and… pictures, and insert them
Finereader should detect all that, and if not, you adjust the recognition boxes.

I explained some of this back in 2014: Post #5 in "Problems converting K2PDF Opt files to EPUB".

Quote:
Originally Posted by Shohreh View Post
Build a ToC
As long as your headings are marked fine (<h1> <h2> <h3> ...), you regenerate that from Sigil.

Last edited by Tex2002ans; 07-24-2020 at 03:58 PM.
Tex2002ans is offline   Reply With Quote
Old 07-26-2020, 07:50 AM   #40
marvin_2
Member
marvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura aboutmarvin_2 has a spectacular aura about
 
Posts: 24
Karma: 4472
Join Date: Jan 2011
Device: Kindle
recommended Sigil-Plug in: Epub Tidy Tool

The Sigil-Plugin Epub tidy Tool does a decent job a fixing incorrect line breaks. If you install the text file "IncorrectWords.txt" provided by the author, it will also fix a lot of common OCR errors.

Best to use early in the process, before the thorough proofreading.

Other Tips:

- think about what quality you want/need in the end. 80/20 applies to OCRing, you can spend way more than 80 % of your time finding the last spelling or formatting errors that don't really make a big difference to the reader. For books that I might read more than once, I tend to find myself going with fairly rough first version, highlighting problems in my Kindle (and fixing them later in Sigil), then doing another iteration before reading the book again a few years later.

- Finereader works well for me. Worth exploring the options, good settings (e.g. remove headers/footers) save a lot of fixing later

- think about what formatting you'd want to keep. OCR does a pretty lousy job if asked to preserve all formatting. You'll end up with lots of text boxes, italics, superscripts that should not be there and make a mess out of conversions.

- for fiction with no footnotes and little or no bold and italic, you could even consider converting to .txt, formatting Chapter headings in Word or Libre Office and fine-tuning ToC and page breaks in Sigil after a conversion in Calibre. And be done in a few hours.
marvin_2 is offline   Reply With Quote
Old 07-28-2020, 12:45 AM   #41
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Above, I shared information about Tesseract. These are the latest (1.7.2020) data I am using. Same for English.


Last edited by roger64; 07-28-2020 at 08:22 PM. Reason: English
roger64 is offline   Reply With Quote
Old 08-03-2020, 10:59 AM   #42
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,447
Karma: 157030631
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by Tex2002ans View Post
What did you export as?

You should see italics/bold showing up in the right half of Finereader:

Attachment 180911

Left should display the original document, and the Right half should show all the actual OCRed text.

Did you select Document Layout: "Formatted Text". In the dropdown, you can also select DOCX:

Attachment 180910

(Personally, I keep everything on "Exact Copy" until I'm ready to export the document. This makes the Left/Right halves match much more closely, making it easier to make corrections.)


(snippage for brevity)

I explained some of this back in 2014: Post #5 in "Problems converting K2PDF Opt files to EPUB".



As long as your headings are marked fine (<h1> <h2> <h3> ...), you regenerate that from Sigil.
One trick that you can use, to make your life a bit less horrible, is to take the exported Word file, in the original format/layout and then turn right around and export it to PDF--and then run a COMPARE, for the original PDF versus the new. Now...that only works worth a damn if you already have a text layer in the original pdf, but if you do, this can save you a crapload of braindamage. Take the compare, make the edits.

Take new Word file, cleaned-up, export to PDF, lather-rinse-repeat.

Yes, it's tedious and all that, but it's a shedload less tedious than trying to find all the OCR errors yourself manually. Does it find everything? Oh, hells, no, but it's an option that most people overlook.

Offered FWIW.

Hitch
Hitch is offline   Reply With Quote
Old 08-07-2020, 07:35 AM   #43
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Thanks for the infos.
Shohreh is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
An advice on OCRing, please. nlundberg Workshop 6 03-13-2013 06:29 AM
Book Designer Hints and Tips Patricia Workshop 59 06-10-2010 07:14 AM


All times are GMT -4. The time now is 04:38 PM.


MobileRead.com is a privately owned, operated and funded community.