MobileRead Forums - View Single Post - Delicate text digitalizing + scanning issues

Tex2002ans · 02-19-2014, 09:59 PM

Quote:

Originally Posted by lol.systema

Yea. So I've seen. Epubs are are a bit of a pain when it comes to situations like the one I'm gonna go through. You couldn't have said it better: work of the devil.

I do most of my work for a non-profit economics website, and we release a huge amount of academic material (non-fiction economics/history books mostly), and it works for us just releasing all of the scans instantly as PDFs, and then when EPUBs are converted, just releasing those side by side.

Everyone who reads the books, can just do a search on the site and go reference the PDFs that are right next door. Seems to work out well for the hundreds/thousands of academics (and non-academics) who use our resources.

I don't see any reason why it has to be any different in a formal academic setting.

Quote:

Originally Posted by lol.systema

I gave a quick glance at the text I'm working. It's heavy on footnotes. Maybe I can lure the crew into working in another way; possibly doc.... they prefer Epub style. They also don't want to pack themselves with too much work. Maybe I can get to them that way: Epub's more work than what they can chew on.

Don't want to "pack themselves with too much work"... I work at this stuff full-time... just finished converting my ~210th book. Working from images/PDFs is about as painful as it can get, and converting from non-fiction is even more painful.... The only thing that is probably worse is converting math.

After working at this stuff full-time for about a year and a half, it takes me ~8-15 hours on average to convert a scanned (non-fiction economics) book -> OCR -> completed EPUB. So at the pace of around one book every one or two days.... (Of course, some only take a few hours, and some take MUCH longer (30+ hours)).

As I said, when you first start book conversion... it will be SLOOOOOOWWWWWW (took me a week or two, so I assume my pace used to be ~40-80 hours per book).

I assume some sort of distributed system would bring about even more overhead in actual manhours. And this is not taking into account the manpower it takes to initially get the books into images/PDFs.

Book Digitization = time consuming.

Quote:

Originally Posted by lol.systema

German and french... OOH I thank my sweet baby Jesus that I ain't touching any of those. I asked if any text had any other languages. They said "just spanish/english". I left quite relieved

Don't believe one word they say!!! Sure, the books you just have are "just Spanish/English"... the books I work on are "just English"!!! But there are a lot of names/references that will have German/French accents, and a lot of quotations that may be in different languages, or single French/German/Spanish words that are in italics and accented.

Finereader Tip: Setting the "Language" up top activates the OCR to look for certain characters. This was something I learned after too many headaches (that stupid cedilla below the 'c' in "François" what finally pushed me over the "English" Language edge). So now I just set Finereader to convert all books as "English; French; German".

Side Note: Selecting the "Language" in Finereader also activates dictionaries for those languages as well.... I found that when I activated "Spanish" as a language, sure, the OCR might catch a few tildes/accents the other languages would have missed, but then Finereader started doing WAY too many false positives (markings in the PDF were considered accents), AND, the Spanish dictionary started to interfere with the actual words (so it was telling me things were spelled wrong when they weren't). This might not effect you so much though if you were just doing your editing using an outside program (like Microsoft Word).

Quote:

Originally Posted by lol.systema

Ah! El Imperio Socialista de los Incas de Louis Baudin.. REALLY interesting read. Left it halfway through since I had to read other stuff but I might try again and finish it sometime.

Well I can guarantee you that is the greatest EPUB that exists of the book! A faithful reproduction if I do say so myself.

Quote:

Originally Posted by lol.systema

ACTUALLY, now that you mention it, an editable copy is the best one. It keeps some headers/footers; if the font in word is the same font as the book, then the page formatting leaves each page 99% identical with the original book page. Titles and headings are also identified (although badly and sometimes it doesn't but that can be worked with easily).

The thing that is horrible though is that you cannot rely on Finereader marking things properly (headers/pagenumbers, footers/footnotes).

The actual CODE in the backend making the DOC look close to the actual page is ABSOLUTELY DREADFUL.

You may potentially dig yourself into a hole where you will have to waste lots more future manpower going out from a HORRIBLY designed DOC -> HTML (or whatever other format you want).

Which is why I personally just jump from OCR -> EPUB (barebones HTML), and do my fixing directly. HTML + CSS is not going anywhere... and I keep the code extremely minimal/consistent throughout all my books, which makes it easy as pie to just copy/paste to sites/anywhere.

Although again, Toxaris's tools... huge time saver if you use Microsoft Office.

Quote:

Originally Posted by lol.systema

I'll also keep record of digitalization, do some experimenting and see what goes and comes around each playthrough. Maybe I can work on providing better insight on digitalization and share it in here.

Can't wait to hear more info from you... the hardware side of digitizing books is interesting.

Quote:

Originally Posted by lol.systema

In here it's more of "think of all the universities that don't give a ship about digitalization. You might have it good out there in your country but in here, getting physical books is a pain. Don't even get me started on digitals, ebooks, pdf... ANYTHING because it's mostly non-existent.

There SHOULD be a communication between universities and institutions, but thing is there's not even a slightest shadow of interest... only this uni I'm studying in, and has a barebones idea on how to get things done

Indeed indeed... academe is always living in the stone ages and moves glacially slow.

I jumped ship from physical books once I stumbled upon the treasure trove of all PDFs/EPUBs for free. Now I will NEVER touch a physical book again (unless I have to digitize it).

I dedicate all my time now towards getting books into EPUB (VASTLY SUPERIOR to reading some crappy pictures/scanned PDF).

Most of the books that we work on went out of print, got lost in time, etc. etc. Now, ANYONE around the world can have access to them within a minute of searching/downloading.

Having them up in digital form is ALSO fantastic when you yourself are needing to use them for reference. You can quickly look up the PDF version, pull out what you need, and move on with typing your paper.

Stone Ages:
- Go to the library, they don't have it.
- They search around... only one library across the country has it.
- Weeks later, they get some dusty tome shipped to them.
  - Or better yet, it is locked up, and you have to spend a whole day traveling to get it.
- Only one person can use the book at a time.
Now:
- Search in your browser
- Download PDF/EPUB/XYZ format
- Copy/Paste into your paper
- Move on without ever having to leave your desk.
- Everyone can use the book at the same time.

Quote:

Originally Posted by lol.systema

After checking on some text, most of the things I'll be working with is footnotes, tons of footnotes; few tables, few images, few figures, tons of formulas. I was informed that someoe in the architecture department can work on setting up the formulas as long as we send the images attached on each request. I'll go check out the PNG/Formulas thread. That'll serve as good reference.

The cheapest way is to just leave the original formulas as snapshots right out of the PDF.

I would not recommend fully digitizing the formulas if you are doing archival. It is not worth the amount of time/money AT ALL.

I personally do it because I want the highest quality in my EPUBs, and if we ever DO reprint one of these older books with a new edition, a horrible scanned formula would look QUITE out of place. So you want the stuff in some sort of vector form that can easily be scaled.

But since you are not in the business of publishing.... I wouldn't recommend it.

Quote:

Originally Posted by lol.systema

Thanks for the TEX amount of detail (wink wink see what I did there? lol)

Side tip that seems obvious: Start off with the EASY stuff. Work on small material first. Articles (maybe up to 30 pages), small journals. Then tackle much harder works later. You feel like you are making much more progress when you fully digitize 30 articles instead of ONE 600 page book with millions of footnotes/tables/diagrams.

Quote:

Originally Posted by PeterT

You might like to check out the Distributed Proofreading project http://www.pgdp.net/c/ . I seem to recall that there was a way of installing a copy of this on your own server, which would allow you to get multiple bodies involved in the validation part of the OCR work.

http://www.pgdp.net/phpBB2/viewtopic.php?t=21864 seems to cover installation

Definitely read a lot of the other material in their forums too, there is lots of good stuff.