MobileRead Forums - View Single Post

Tex2002ans · 10-20-2021, 06:17 PM

That PDF has 1707 different "fonts".

If you copy/paste the text out, or try to search, you can tell it's all completely gibberish.

They ran it through some sort of program that completely substitutes in all the characters. So on the surface, it may LOOK like a "C" + "h" + "a" + "p", but in reality, it's nonsense.

You'll have to rerun that entire PDF through actual OCR.

You can use any OCR programs you want, like:

tesseract (free)
ABBYY Finereader (paid)

The accuracy should be quite good, since the text is still vector (you can fully zoom in and it stays perfectly crisp).

Here's the OCR I got out of Chapter 1 using Finereader 12:

Quote:

Chapter 1

The story of an orphan adopted into a wealthy noble family - What a romantic setting, especially for a girl. If it were a novel or television drama she would be the heroine of her own Cinderella story. The reality was nothing like the stories. Real life isn't a novel or a drama.

When my mother died my estranged father, a wealthy businessman, adopted me. For the crime of suddenly appearing in their lives, my two older half-brothers bullied and harassed me from day one. They were cruel. They insulted me and even pulled pranks with my food. My half-brothers' torment became my new normal. Any hope of reprieve at school was guickly dashed. [...]

[...]

"Father! Look! I was accepted! I was accepted!" I nearly shouted withjoy.

Looks like:

the 'q' might accidentally be seen as a 'g' in some words
a few hyphens/quotations were missing
the 'j' would sometimes get combined with previous word

but besides that, looks extremely accurate.

Nothing a little elbow grease couldn't fix up.