View Single Post
Old 02-18-2014, 01:03 PM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by lol.systema View Post
Now, they want me to find out a way to maximize the speed of text-to-image processing as well as minimizing damages in texts.
On the hardware/getting the books into images, definitely follow PeterT's advice and read up at DIY Bookscanner. Lots of people have come up with crazy/amazing/genius contraptions to take pictures of books. Although be weary... if the lighting is not good enough, and/or the DPI is not high enough, the image could LOOK fine for a human to read, but if you try to digitize the text by running it through OCR, it might be highly inaccurate.

So perfect your method and run it through OCR to see if your setup works, before you start mass digitizing.

Also, the people on DIY Book Scanner promote a program called ScanTailor, to help automate some of the cleaning up of scans/images:

http://www.diybookscanner.org/forum/...df3d08a4b4fe20

Quote:
Originally Posted by lol.systema View Post
At the beginning they even thought I was talking about employing people to manually do the paper-to-text work processing.. cheesus parmesan rice are they old-fashioned..
Now, onto the actual image/scan -> OCR -> Digital Text. I do a large amount of (scanned) PDF -> EPUB conversion, and I explained a lot of my conversion method in this topic:

https://www.mobileread.com/forums/sho...d.php?t=223817

If you just want to have a (crappy) OCR text backend on PDFs that you release, then running it through Finereader should be ok (this would require minimal manpower). This can also give you an "ok" search option through the text of the book.

This is what archive.org does with all of their scans, but you can see the horrors if you ever downloaded one of their "EPUBs".

Manual Checking of OCR: Depending on the quality of the scan/image, you can probably go at a pace of ~15-200+ pages an hour.

Slowest: HORRIBLE quality/badly marked, very dense text, lots of footnotes/figures/formulas
Fastest: Purely clean, no markings, no writing inside, crisp text, simple (like a novel).

Although your pace might be EXTREMELY slow when you first start out (when I first started, it used to take me a week or two to fully digitize one book).

Example: One of the books that I worked on converting was "Elementary Lessons in Logic" by William Stanley Jevons (cleanest Archive.org PDF here):

https://archive.org/details/elementarylesson01jevo

If you download the Archive.org EPUB edition, you can see the OCRed text backend:

https://archive.org/download/element...son01jevo.epub

Take, for example, Page 178-179:

https://archive.org/stream/elementar...e/178/mode/2up

Click image for larger version

Name:	Jevonspg178.png
Views:	529
Size:	466.2 KB
ID:	119242 Click image for larger version

Name:	Jevonspg179.png
Views:	538
Size:	423.3 KB
ID:	119243

Click image for larger version

Name:	ArchiveEPUBExample.png
Views:	887
Size:	52.2 KB
ID:	119234 Click image for larger version

Name:	MyEPUBExample.png
Views:	905
Size:	61.4 KB
ID:	119235

Here is the section in the Archive.org EPUB (this is the OCRed text backend):

Side Note: Keep in mind, this PDF is an example of an old book that is QUITE clean. There are HORRORS out there.

Quote:
<p>The Third Material Fallacy is that of the IiTelevant Conclusion, technically called the Ignoratio Elenchi^ or literally Ignorance of the Refutation. It consists in arguing to the wrong point, or proving one thing in such a manner that it is supposed to be something else that is proved. Here again it would be difficult to adduce concise examples, because the fallacy usually occurs in the course of long harangues, where the multitude of words and figures leaves room for confusion of thought and forgetfulness. This fallacy is in fact the great resource of those who have to support a v;-eak case. It is not unknown in the legal profession, and an attorney for the defendant in a lawsuit is said to have handed to the barrister his brief marked, "No case; abuse the plaintiff^s attorney." Whoever thus uses what is known as argumentuni ad homine^n^ that is an argument which rests, not upon the merit of the case, but the character or position of those engaged in it, commits this fallacy. If a man is accused of a crime it is no answer to say that</p>

<div class="newpage" id="page-179"></div>

<p>the prosecutor is as bad. If a great change in the law is proposed in Parhament, it is an Irrelevant Conclusion to argue that the proposer is not the right man to bring it forward. Everyone who gives advice lays himself open to the retort that he who preaches ought to practise, or that those who live in glass houses ought not to throw stones. Nevertheless there is no necessary connection between the character of the person giving advice and the goodness of the advice.</p>

<p>The argumentum ad popuhirn is another form of Irrelevant Conclusion, and consists in addressing arguments to a body of people calculated to excite their feelings and prevent them from forming a dispassionate judgment upon the matter in hand. It is the great weapon of rhetoricians and demagogues.</p>
As you can see, the text is "ok" (the formatting is nonexistant, it will be riddled with page numbers/headers/symbols/junk), and I definitely wouldn't want to read an entire book full of those OCR errors!

Here is the same section in my EPUB after manual cleaning (EPUB is attached at the bottom of this post):

Quote:
<p>The Third Material Fallacy is that of the <span class="bold">Irrelevant Conclusion</span>, technically called the <span class="italics">Ignoratio Elenchi</span>, or literally Ignorance of the Refutation. It consists in arguing to the wrong point, or proving one thing in such a manner that it is supposed to be something else that is proved. Here again it would be difficult to adduce concise examples, because the fallacy usually occurs in the course of long harangues, where the multitude of words and figures leaves room for confusion of thought and forgetfulness. This fallacy is in fact the great resource of those who have to support a weak case. It is not unknown in the legal profession, and an attorney for the defendant in a lawsuit is said to have handed to the barrister his brief marked, “No case; abuse the plaintiff’s attorney.” Whoever thus uses what is known as <span class="italics">argumentum ad hominem</span>, that is an argument which rests, not upon the merit of the case, but the character or position of those engaged in it, commits this fallacy. If a man is accused of a crime it is no answer to say that the prosecutor is as bad. If a great change in the law is proposed in Parliament, it is an Irrelevant Conclusion to argue that the proposer is not the right man to bring it forward. Everyone who gives advice lays himself open to the retort that he who preaches ought to practise, or that those who live in glass houses ought not to throw stones. Nevertheless there is no necessary connection between the character of the person giving advice and the goodness of the advice.</p>

<p>The <span class="italics">argumentum ad populum</span> is another form of Irrelevant Conclusion, and consists in addressing arguments to a body of people calculated to excite their feelings and prevent them from forming a dispassionate judgment upon the matter in hand. It is the great weapon of rhetoricians and demagogues.</p>
If you have Microsoft Office, you can export from Finereader -> DOC -> use Toxaris's tools, which will most likely speed up a lot of this manual cleaning step:

Word Macro: https://www.mobileread.com/forums/sho...d.php?t=142530
ebook Tools: https://www.mobileread.com/forums/sho...d.php?t=213372

I personally just use Finereader (A/B compare PDF/OCR) -> Export as EPUB -> Clean the code + add formatting.

Quote:
Originally Posted by lol.systema View Post
-(most importantly) FULL, UNRESTRICTED, UNLIMITED ACCESS TO EVERYTHING IN THE LIBRARY AND EVERYTHING "LIBRARY" RELATED.. even the old text, rich in history, details and my god, the aroma of aged paper. SWEET BABY JESUS I CAN ALMOST SMELL IT RIGHT NOW

[...]

It's a pretty big deal for me as you can see
Any suggestions, ideas, tips, recommendations, corrections, superior knowledge of any sort are greatly appreciated. I gotta go hit the weights for a bit and keep working on ideas.
This is a FANTASTIC project. Just be weary of copyright, but I believe every single text should be up and digitized online.

Some of these tomes are just locked away in the library somewhere, nobody ever goes there and reads them, they are not searchable, the information/knowledge is just LOST. Getting them up and online will definitely get more exposure than they currently get (nearly none). And instead of getting tossed in the trash, you are bringing life to them by digitizing... then the entire INTERNET can benefit from the knowledge inside.
Attached Files
File Type: epub Jevons,W.Stanley.-.Elementary.Lessons.In.Logic[v.4].epub (735.2 KB, 435 views)

Last edited by Tex2002ans; 02-18-2014 at 01:11 PM.
Tex2002ans is offline   Reply With Quote