Thread: OCR engine
View Single Post
Old 05-04-2014, 06:39 PM   #58
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by cadele View Post
I am going to start to keep some stats - you have inspired me!
Glad to hear I have inspired someone else to start keeping stats. I love keeping stats on things that I do. You can get cool things like this:

I liberated 18,172,166 words from PDF -> EPUB since October 2012. (Although I haven't updated the stats in about a month).

And EPUBs that I read for pleasure + cleaned as I went along, 2,870,128 words.

I will have to go through and add in a Page Count to all of the books as well... that might also lead to some decent stats/graphs. (Although in my opinion, pages are a horrible way to measure. A page of non-fiction =/= a page of fiction =/= a page out of a journal/newspaper =/= a page in different font/font-size/margins). And how would you go about handling measuring "Pages" of text from an HTML source?

Quote:
Originally Posted by cadele View Post
Then I open the file in Abbyy Finereader 12 and verify the text. This is slow but worth it. I then convert it to a Word document. Following that I set up my page size and layout. I usually try to match the book's general layout without being too OCD about it.
Sounds ok. I guess different workflows for different people.

I personally just do all the fixing in minimalist HTML (EPUB) AND THEN, can go back to other formats if needed.

DOC is really a horrible/bloated "source" format. Too much cruft and inconsistencies added in because of the WYSIWYG editing.

And speaking of trying to "match page size/layout"... Here is a sample of some of my latest ventures into working backwards from EPUB -> LaTeX -> PDF:

Click image for larger version

Name:	pg022Before.png
Views:	279
Size:	60.5 KB
ID:	122624Click image for larger version

Name:	pg022LaTeX.png
Views:	248
Size:	29.2 KB
ID:	122625
Click image for larger version

Name:	pg093Before.png
Views:	255
Size:	65.0 KB
ID:	122626Click image for larger version

Name:	pg093LaTeX.png
Views:	269
Size:	30.7 KB
ID:	122627
Click image for larger version

Name:	pg119Before.png
Views:	254
Size:	63.7 KB
ID:	122628Click image for larger version

Name:	pg119LaTeX.png
Views:	256
Size:	28.7 KB
ID:	122629
Click image for larger version

Name:	pg209Before.png
Views:	259
Size:	54.6 KB
ID:	122630Click image for larger version

Name:	pg209LaTeX.png
Views:	267
Size:	25.5 KB
ID:	122631

I still have to iron out a few kinks... but I have the basics of the workflow going... now I just have a lot more to learn/absorb/code.

Quote:
Originally Posted by cadele View Post
Finally I add the book to Calibre, download the metadata and add the cover, then convert it to EPub and Mobi (both types of Mobi).
Hmmm... so a DOC -> Calibre -> EPUB/MOBI conversion? Does that give you the cleanest output?

I probably sound like a broken record, but why not use Toxaris's Word Macro?

Last edited by Tex2002ans; 05-04-2014 at 06:46 PM.
Tex2002ans is offline   Reply With Quote