Quote:
Originally Posted by Tex2002ans
heavenly snip
|
Dear God that's beautiful.
I seriously, from the seriousness of my seriously sirius heart, thank you for your detailed post. This is by far the best crash-course on anything I've ever seen in a forum.
Quote:
be weary... if the lighting is not good enough, and/or the DPI is not high enough, the image could LOOK fine for a human to read, but if you try to digitize the text by running it through OCR, it might be highly inaccurate.
|
Gotcha. I'm currently gonna go test lighting and start toying around with my SLR and find out on standard and high quality pics. I was told size of each pic doesn't matter, as long as the OCR and quality go smoothly, it's alright. I'll keep it simple though, don't want 100MB+ pics
Checking it out. Thank you very much
Quote:
If you just want to have a (crappy) OCR text backend on PDFs that you release, then running it through Finereader should be ok
|
well, I actually missed adding that part... I did mention what OCR does but I did also mention that AbbyFinereader does leave a lot of mistakes (specially with O turned in 0, m turned in rn, and no tildes... since all university related text is in spanish, tildes are on every text, every paragraph, almost every line. I did mention the need of manual correction, just not manual translation as a tabula rasa.
They agreed. So far they'll let me get the text, OCR it in my comp (they're not gonna spend 160$ on good ol'e Abby) and send the OCR'd text so they can work on it. The staff assigned to do the digitalization will be reassigned to do the correction.. if the project is approved.
More detailed info. Wow, man. Thanks ALOT
I was going to give them a presentation on OCR and how it worked (since one or two were still clueless on how things worked). I'm going to delay it and give it tomorrow. Your information is definitely worth adding. It'll add a boost to the pressi.
Quote:
If you have Microsoft Office, you can export from Finereader -> DOC -> use Toxaris's tools, which will most likely speed up a lot of this manual cleaning step:
|
You normally work with EPUB. What would you recommend for basic image-to-text, no specific format required? Only HTML?
To be honest I never worked with HTML. All the old books I scanned were passed only to DOC and worked from there. It seems to work just fine. However, that's just me being a total amateur.
I know they have a crew that will work on OCR'd text, however they did not mention HTML nor the intentions of touching HTML. Do you know of a tutorial on HTML for text processing? I'm pretty clueless on that /:
Checking. Thank you very much for the links.
Quote:
you are bringing life to them by digitizing
|
Can't say I'm doing so; this still needs full approval. If it goes through, I can immerse myself into university-based-texts. Even so, I would not be authorized to release them outside of the uni.
Furthermore, the restricted access section (which has the locked books) is not taken into consideration when it comes to the
en masse digitalization. Therefore I cannot scan them books nor bring them home (since they're totally locked). If this works, though, I can definitely present a proposal to digitalize the old books that are locked from the public.. Once their priority on university-stuff is OCR'd and fully digitalized, I'm sure they'll have space for a fully student-handled project; and my hands will be all over the library by then so they'll know how much of an efficient tool in the shed I am.
All of the text is in spanish (for the most part that's unfortunate since the internet and its userbase is english-speaking) but this will definitely boost spanish books in the internet, which is quite an oasis in the desert if you ask me.
Most of the books in the restricted area are common domain (or so the librarian assigned in that section told me.) I just came from checking out the restricted section and yeah, some of the books I checked are as old as 90 years. I spoke to the librarian to give me the links on copyright holders and see which can be launched online. She sent me an email with the contact means and also told me that she herself can do the talking as long as I have the authorization from the uni to scan the books.
That'll be something I'll be pushing myself into as of next year. This year is to get the university-texts done and the project approved.
Thank you so much, Tex. I seriously appreciate the intense amount of detail that you've added on your post. Thanks a bunch
Cheers