MobileRead Forums - View Single Post - Delicate text digitalizing + scanning issues

Tex2002ans · 02-19-2014, 04:24 AM

Quote:

Originally Posted by lol.systema

Gotcha. I'm currently gonna go test lighting and start toying around with my SLR and find out on standard and high quality pics. I was told size of each pic doesn't matter, as long as the OCR and quality go smoothly, it's alright. I'll keep it simple though, don't want 100MB+ pics

I wish I had more info on the hardware/scanning side... only thing I have dabbled with is a destructive method. Cutting the binding off and feeding it into a feed reader. It went quite fast, but the disadvantage with that method is that you have to destroy the book.... and I do not know how well very old pieces of paper would handle that method (probably not well at all).

Quote:

Originally Posted by lol.systema

well, I actually missed adding that part... I did mention what OCR does but I did also mention that AbbyFinereader does leave a lot of mistakes (specially with O turned in 0, m turned in rn, and no tildes... since all university related text is in spanish, tildes are on every text, every paragraph, almost every line. I did mention the need of manual correction, just not manual translation as a tabula rasa.

Typically with OCR, the further you move away from English, the worse the OCR accuracy becomes. I don't have too much experience with Spanish (typically the books that I convert have lots of French/German names/references).

The only book that I recall working on that had a massive amount of Spanish was, "The Socialist Empire: The Incas of Peru" by Louis Baudin:

Original PDF: https://mises.org/document/4336/A-So...-Incas-of-Peru
EPUB version on my site: http://misesbooks.blogspot.com/2012/...-by-louis.html

The OCR from Finereader turned out fine with Spanish, but if I recall, I still had to do a lot of manual checking. Accents and tildes are especially rough (seems to me it is highly dependant on the font used in the original book as well... sometimes books recognize accents perfectly, other times, it misses even the simplest/clearest cases!), so you typically need even higher quality initial material (compared to a purely English book).

I must admit though, I don't have too much experience on the Spanish side of things (maybe someone here knows a lot more about digitizing Spanish).

Just don't forget in Finereader, at the very top, to set the Language to "Spanish" (or maybe, "English; Spanish")....... I have run through too many books with the wrong languages selected, and by the time I notice all the missing accents, it is too late.

Quote:

Originally Posted by lol.systema

They agreed. So far they'll let me get the text, OCR it in my comp (they're not gonna spend 160$ on good ol'e Abby) and send the OCR'd text so they can work on it. The staff assigned to do the digitalization will be reassigned to do the correction.. if the project is approved.

Toxaris recommends exporting from Finereader as an "Editable Copy" DOC. (Which is what I assume his tool works best with).

If a third party is going to be manually going over the OCR, that might make it easiest to A/B compare with the PDF. I believe that method still keeps all the pages, just throws out a nice amount of formatting overhead, and doesn't try to place text EXACTLY where it appeared... but it still inserts page breaks, so each "page" in the DOC still matches each page in the PDF.

Again, sorry I don't have more information, I don't export to DOC, or use Toxaris's macro (since I don't use Microsoft Word).

Quote:

Originally Posted by lol.systema

You normally work with EPUB. What would you recommend for basic image-to-text, no specific format required? Only HTML?

To be honest I never worked with HTML. All the old books I scanned were passed only to DOC and worked from there. It seems to work just fine. However, that's just me being a total amateur.

I know they have a crew that will work on OCR'd text, however they did not mention HTML nor the intentions of touching HTML. Do you know of a tutorial on HTML for text processing? I'm pretty clueless on that /:

Heh, not one that is to my standards. (You can see the outline I have written for PDF -> EPUB method in that previous topic I linked to... I have yet to flesh it out/expand on it). Maybe someone else can point out some tutorials (with a focus on digitizing text).

I find the HTML output from Finereader to be quite dreadful, but the EPUB output (added in Finereader 11) is some pretty minimalist/clean HTML (it only leaves in the basics, italic, bold, underline, sub/superscript, headings, ...). All the other font/layout junk code is completely nonexistent.

For someone who doesn't know their way around HTML... DOC output MIGHT be your best bet. (Especially if this is going to be handed off to others to check/clean... I doubt they will know much HTML either).

Explore Toxaris's stuff... From what I gather, his tools really can help clean up any DOCs (and DOCs exported from Finereader), and his tools can be used to do quite a good job at exporting a very clean EPUB.

If you are using Libre or Open Office, you can use Writer2EPUB: https://www.mobileread.com/forums/forumdisplay.php?f=230
and/or his other tool, PerfectEPUB: http://lukesblog.it/ebooks/ebook-tools/perfectepub/

Quote:

Originally Posted by lol.systema

Can't say I'm doing so; this still needs full approval. If it goes through, I can immerse myself into university-based-texts. Even so, I would not be authorized to release them outside of the uni.

Furthermore, the restricted access section (which has the locked books) is not taken into consideration when it comes to the en masse digitalization. Therefore I cannot scan them books nor bring them home (since they're totally locked). If this works, though, I can definitely present a proposal to digitalize the old books that are locked from the public.. Once their priority on university-stuff is OCR'd and fully digitalized, I'm sure they'll have space for a fully student-handled project; and my hands will be all over the library by then so they'll know how much of an efficient tool in the shed I am.

Bah, some digitization (for sharing within the university) is better than nothing though... but just think of all that duplicated waste of manpower going on in all the different universities! (Each school would be wasting time taking images of the same exact books + manually converting/checking the OCR).

There might be some sort of system in place to share digital texts between universities, but I have no clue (I am not in academe). Typically getting access to those things is insanely expensive (just like many of these academic journals... don't get me started on that racket

).

And another reminder, since a lot of this might be older theses.... in my experience, Finereader does a HORRIBLE job on typewriter text (maybe there is a setting I have missed somewhere). But the few books that I had to convert that were typed from typewriters... it was HORRIBLY inaccurate (and on the very slow end of conversion).

Quote:

Originally Posted by lol.systema

Thank you so much, Tex. I seriously appreciate the intense amount of detail that you've added on your post. Thanks a bunch

May this information help you, and all other future digitizers!!!

And I think we should change "intense amount of detail" into "a TEX amount of detail".

Quote:

Originally Posted by mrmikel

Since you are in an academic environment, you are going to find some problems if you try to venture outside of PDF for heavily formatted books. Epubs are based on reflowing text, which means NO fixed page numbers, unless you leave visible page numbers in the text.

I don't see TOO much of a problem if the original PDF is released right alongside. IF someone must absolutely reference something formally using AMA/APA/MLA/[ZZZ is what that makes me want to do] based off of the page numbers, they can always look back at the PDF. If they want to read it for the knowledge, they can choose their preferred format.

You CAN spend your time and create a page-map (specifically for EPUB).... but the tools to create a page-map automatically are to my knowledge, non-existent (it is a giant pain in the butt). And the amount of readers who actually know how to look through the code, figure out the page-map, OR even know that this specific book uses one (and not the typical ADE/Calibre/whatever numbering schemes).... I can probably say, abysmally small. Plus who knows in the future after EPUB, if any of these conversion programs will properly be able to convert a page-map to that future format.

Plus these same problems just occur when placing text in HTML form on a website... there are no such things as "pages" on a site. You can split them in logical locations based on chapter, and SOME sites have some sort of paragraph numbering system in place... but these digitization methods abandon the entire "page" system (which makes ZERO sense in the digital realm).

I believe as long as the original scan/images/PDF is accessible alongside the HTML/EPUB version... that should be good enough.

Quote:

Originally Posted by mrmikel

It also means that text can be hard to pin down along side images and that tables are the work of the devil. This variable page size also means that footnotes can end up some distance from the original text on the page. Many go to notes at the end of a chapter to solve this. Small text like footnote citations can be hard to see on a small device.

Indeed... Non-fiction works with Tables/Figures/Formulas/Footnotes/Images... these are a HUGE slowdown in the digitization process.

Tables: Some people/companies take a "snapshot" of the table and include it as an image. I digitize them completely (I believe it is much better for the long-run of the book, and it allows it to be copy/pastable/scalable/readable by the blind). I explained some of my table ideas in this topic: https://www.mobileread.com/forums/sho...d.php?t=223062

Warning With Images of Tables: If you insist on taking a dreaded snapshot of a table, USE PNG. AVOID JPG LIKE THE PLAGUE.

Footnotes: I explained my Footnote ideas in this topic (the real fun begins around post #16 hahaha): https://www.mobileread.com/forums/sho...d.php?t=225045

Formulas: There is no good way to do this in EPUB/MOBI at the moment... perhaps future formats this will be better (although it will still require a MASSIVE amount of manpower). I explained a lot of the ideas in this topic: https://www.mobileread.com/forums/sho...d.php?t=228413

I also explained how I handle generating higher resolution PNGs of formulas (and having the formulas saved in a more easily convertible form) in my "Formulas to PNG Tutorial": https://www.mobileread.com/forums/sho...d.php?t=223254

Figures: Many non-fiction books tend to have figures that "flow" around the text. My method is almost always to push the figure "down" to the end of the split paragraph. For example, on Page 72 of that Jevons PDF I linked above is "Fig. 1":

Click image for larger version

Name: JevonsPDFpg72.png
Views: 721
Size: 80.8 KB
ID: 119248

Click image for larger version

Name: JevonsEPUBpg72.png
Views: 727
Size: 61.8 KB
ID: 119249

Images: If the images are "artificial" (charts, graphs, text), go PNG! If they are "natural" (photographs), the argument could be made for JPG (if it is a grayscale image, please save as grayscale JPG).

I explained why JPG = junk for artificial images up in the Tables topic I linked above, Post #8: https://www.mobileread.com/forums/sho...54&postcount=8

I explained some of my PNG compression methods here (and reasoning to go PNG over JPG in the case of "artificial"/"few color" images): https://www.mobileread.com/forums/sho...5&postcount=26

Quote:

Originally Posted by mrmikel

You will need to check the death dates of any author you propose to digitize, 70 years ago being a good average, but check the laws of your country. Copyright can be a gigantic headache.

Indeed... Gigantic is not the correct word for this... there must be a larger word.