Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 10-20-2021, 12:43 PM   #1
eduard93
Junior Member
eduard93 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Nov 2010
Device: Kindle Touch
Help with PDF conversion

Hello.

I'm unable to convert this PDF file to mobi (or any other ebook format).

Tried calibre, underlying ebook-convert, online converters - they skip all the text. I can see the text in PDF readers and the PDF passes pdf validation tools (pdfinfo), but opening in PDF editors like Libre Office shows cover and 1300 empty pages. pdftotext also returns nothing.

Any idea how to convert this file into a ebook?
eduard93 is offline   Reply With Quote
Old 10-20-2021, 05:17 PM   #2
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
That PDF has 1707 different "fonts".

If you copy/paste the text out, or try to search, you can tell it's all completely gibberish.

They ran it through some sort of program that completely substitutes in all the characters. So on the surface, it may LOOK like a "C" + "h" + "a" + "p", but in reality, it's nonsense.

You'll have to rerun that entire PDF through actual OCR.

You can use any OCR programs you want, like:
  • tesseract (free)
  • ABBYY Finereader (paid)

The accuracy should be quite good, since the text is still vector (you can fully zoom in and it stays perfectly crisp).

Here's the OCR I got out of Chapter 1 using Finereader 12:

Quote:
Chapter 1

The story of an orphan adopted into a wealthy noble family - What a romantic setting, especially for a girl. If it were a novel or television drama she would be the heroine of her own Cinderella story. The reality was nothing like the stories. Real life isn't a novel or a drama.

When my mother died my estranged father, a wealthy businessman, adopted me. For the crime of suddenly appearing in their lives, my two older half-brothers bullied and harassed me from day one. They were cruel. They insulted me and even pulled pranks with my food. My half-brothers' torment became my new normal. Any hope of reprieve at school was guickly dashed. [...]

[...]

"Father! Look! I was accepted! I was accepted!" I nearly shouted withjoy.
Looks like:
  • the 'q' might accidentally be seen as a 'g' in some words
  • a few hyphens/quotations were missing
  • the 'j' would sometimes get combined with previous word

but besides that, looks extremely accurate.

Nothing a little elbow grease couldn't fix up.

Last edited by Tex2002ans; 10-20-2021 at 05:20 PM.
Tex2002ans is offline   Reply With Quote
Old 10-21-2021, 05:17 AM   #3
eduard93
Junior Member
eduard93 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Nov 2010
Device: Kindle Touch
@Tex2002ans thank you.

You're right, looks like the OCR is the only way.
eduard93 is offline   Reply With Quote
Reply

Tags
conversion, help please, pdf


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
epub 2 PDF conversion with OCR in PDF possible? hobi2000 Conversion 2 03-25-2019 03:20 AM
PDF conversion keeping pdf page highstream Conversion 3 05-31-2016 11:46 AM
PDF to PDF conversion creates much larger file? rocketcat Conversion 11 09-30-2011 07:37 PM
PDF conversion Vasiok iRex 8 06-14-2011 03:10 AM
Conversion de pdf ? Cressence Assistance 7 02-11-2010 07:34 AM


All times are GMT -4. The time now is 03:23 AM.


MobileRead.com is a privately owned, operated and funded community.