Delicate text digitalizing + scanning issues

lol.systema · 02-17-2014, 09:20 PM

Spoiler:

So last week I presented a project for en masse digitalization to my university, in terms of their old texts (thesis', simple format essays and college texts, both student and teacher-made) for the digital library campaign that they've just recently started. They've already digitalized a hefty amount of text; however I realized their process is a tad rudimentary, radical and has already permanently damaged some text.. actually it sucks big time.. They've destroyed a lot of texts already; a sad scene to say the least.

Spoiler:

I initially recommended to add OCR for cheaper reproduction of thesis and depending less on rudimentary photocopying, which the peeps at the university agreed (with a huge grin on their face, since it seems that, as any other university, money-saving and rubbing their elbows matters ALOT). Now I'm no engineer nor an erudite of any sort but for some darn reason they seemed impressed of my process of OCR. Some of the higher ups thought that OCR was a self-invented bogus name just to make things look fancier. Them dudes didn't know about my pin-up sweetheart Abby, so they ended up liking the project even more. At the beginning they even thought I was talking about employing people to manually do the paper-to-text work processing.. cheesus parmesan rice are they old-fashioned..

Now, they want me to find out a way to maximize the speed of text-to-image processing as well as minimizing damages in texts.

To say the least, as much as they are worried a lot about the $$, these nuggets are the only university in my chihuahua-looking thirld world country that actually CARE about books... largely the reason I decided to spend my study funds in them... (aside from that).. Now my main issue is the OCR processing of old, rusty text and most importantly, how to get images of texts bigger than a scanner. That's my main concern.

As I said, I'm no enlightened engineer of any sort so I worked with a few ideas already..

Taking into consideration the size of texts, I came up with the following:

Big texts are damaged because of two things: they're old as frozen hell and are being handled like a feisty toddler... also because they're being forced to fit on a small scanner. Having said this, they are turned upside down, opened up like the suggestive example given above; that's where the damage begins.
Based on the aforementioned, scanners are totally obsolete. Big text needs to be handled and moved to the very least; keeping them as steady as possible is a priority. Keeping them looking upwards would be best, since we avoid the weight and issues on adjusting text-to-scanner process.
Scanners by themselves are slow and tedious. Minimizing speed of processing each page is required without losing high quality.
SCANNERS ARE OBSOLETE. That's the main conclusion. They're slow, they damage and are thus counterproductive

The only best option is a camera.. SLR to be precise. Lighting is also required to replace the scanner's big shiny string of slowness. I've already counted the lighting on a scanner and measured it with a bit of the photo studio the Art Department this university has. Perfectly "doable".

I also thought of setting the tripod with the book directly below (90 degree angle), setting the pages at a certain angle (still experimenting and finding out the best camera/text distance vs. text angle)... If any of you are good at numbrers and can help me out on getting an nice pic angle without distorting the text, I would greatly appreciate it.

Also. Have any of you have an idea on how to mount something like this. I can definitely go full Chuck Norris 4x4 Rambo Ranger Mambo Tango style on this and just set up a tripod, studio lighting, and the book on a simple mount and start taking HQ pictures,... but would there be any way of setting up a mount that helps on sustaining the camera and the text (and if possible, the lighting as well?) This is what I came up with... sorry for the crappy paintjob:

}

IF.... in any case this works in the slightest I will get the project approved and do one of 2:

1-keep on training this puppy and be part of the leads (since I started up with this whole zip zoopity bip bop) and ensure this project works out (even if no tips or suggestions are provided, I'm sure I can find something that'll make things work out)
2-provide a sound, well-fundamented project and hand it to ACTUAL engineers that can work on this matter (I'm studying for Education Sciences, Elementary School Teacher... noothing to do over there..)

Either way if I do any of two I will have the following as the booty (since they are by no means going to pay me for this):
-a good reputation among the students and future alumni which will grant more access to any wicked projects I have in mind
-possible school funding and backup depending on how radical the project is
-(most importantly) FULL, UNRESTRICTED, UNLIMITED ACCESS TO EVERYTHING IN THE LIBRARY AND EVERYTHING "LIBRARY" RELATED.. even the old text, rich in history, details and my god, the aroma of aged paper. SWEET BABY JESUS I CAN ALMOST SMELL IT RIGHT NOW
-lesser restriction in terms of book lending; normally it's a 14 day timeframe, to me it could be times2.. also lesser restrictions on the amount of books.. as long as I return them without any scribblings, torn pages or damages of any sort... which I GLADLY oblige to and agree.
-be one of the few that are first in line of book dumping (every year or two the university dumps old, obsolete books).. I dont care if they're old, they'll be MINE.

hubba hubba

It's a pretty big deal for me as you can see
Any suggestions, ideas, tips, recommendations, corrections, superior knowledge of any sort are greatly appreciated. I gotta go hit the weights for a bit and keep working on ideas.

Enjoy your reading

Cheers,

PeterT · 02-17-2014, 10:24 PM

Have you looked at http://www.diybookscanner.org/ ?

doubleshuffle · 02-18-2014, 12:01 AM

Or this: http://www.instructables.com/id/Barg...Cardboard-Box/

Tex2002ans · 02-18-2014, 01:03 PM

Quote:

Originally Posted by lol.systema

Now, they want me to find out a way to maximize the speed of text-to-image processing as well as minimizing damages in texts.

On the hardware/getting the books into images, definitely follow PeterT's advice and read up at DIY Bookscanner. Lots of people have come up with crazy/amazing/genius contraptions to take pictures of books. Although be weary... if the lighting is not good enough, and/or the DPI is not high enough, the image could LOOK fine for a human to read, but if you try to digitize the text by running it through OCR, it might be highly inaccurate.

So perfect your method and run it through OCR to see if your setup works, before you start mass digitizing.

Also, the people on DIY Book Scanner promote a program called ScanTailor, to help automate some of the cleaning up of scans/images:

http://www.diybookscanner.org/forum/...df3d08a4b4fe20

Quote:

Originally Posted by lol.systema

At the beginning they even thought I was talking about employing people to manually do the paper-to-text work processing.. cheesus parmesan rice are they old-fashioned..

Now, onto the actual image/scan -> OCR -> Digital Text. I do a large amount of (scanned) PDF -> EPUB conversion, and I explained a lot of my conversion method in this topic:

https://www.mobileread.com/forums/sho...d.php?t=223817

If you just want to have a (crappy) OCR text backend on PDFs that you release, then running it through Finereader should be ok (this would require minimal manpower). This can also give you an "ok" search option through the text of the book.

This is what archive.org does with all of their scans, but you can see the horrors if you ever downloaded one of their "EPUBs".

Manual Checking of OCR: Depending on the quality of the scan/image, you can probably go at a pace of ~15-200+ pages an hour.

Slowest: HORRIBLE quality/badly marked, very dense text, lots of footnotes/figures/formulas
Fastest: Purely clean, no markings, no writing inside, crisp text, simple (like a novel).

Although your pace might be EXTREMELY slow when you first start out (when I first started, it used to take me a week or two to fully digitize one book).

Example: One of the books that I worked on converting was "Elementary Lessons in Logic" by William Stanley Jevons (cleanest Archive.org PDF here):

https://archive.org/details/elementarylesson01jevo

If you download the Archive.org EPUB edition, you can see the OCRed text backend:

https://archive.org/download/element...son01jevo.epub

Take, for example, Page 178-179:

https://archive.org/stream/elementar...e/178/mode/2up

Click image for larger version

Name: Jevonspg178.png
Views: 547
Size: 466.2 KB
ID: 119242

Click image for larger version

Name: Jevonspg179.png
Views: 556
Size: 423.3 KB
ID: 119243

Click image for larger version

Name: ArchiveEPUBExample.png
Views: 929
Size: 52.2 KB
ID: 119234

Click image for larger version

Name: MyEPUBExample.png
Views: 944
Size: 61.4 KB
ID: 119235

Here is the section in the Archive.org EPUB (this is the OCRed text backend):

Side Note: Keep in mind, this PDF is an example of an old book that is QUITE clean. There are HORRORS out there.

Quote:

The Third Material Fallacy is that of the IiTelevant Conclusion, technically called the Ignoratio Elenchi^ or literally Ignorance of the Refutation. It consists in arguing to the wrong point, or proving one thing in such a manner that it is supposed to be something else that is proved. Here again it would be difficult to adduce concise examples, because the fallacy usually occurs in the course of long harangues, where the multitude of words and figures leaves room for confusion of thought and forgetfulness. This fallacy is in fact the great resource of those who have to support a v;-eak case. It is not unknown in the legal profession, and an attorney for the defendant in a lawsuit is said to have handed to the barrister his brief marked, "No case; abuse the plaintiff^s attorney." Whoever thus uses what is known as argumentuni ad homine^n^ that is an argument which rests, not upon the merit of the case, but the character or position of those engaged in it, commits this fallacy. If a man is accused of a crime it is no answer to say that

<div class="newpage" id="page-179"></div>

the prosecutor is as bad. If a great change in the law is proposed in Parhament, it is an Irrelevant Conclusion to argue that the proposer is not the right man to bring it forward. Everyone who gives advice lays himself open to the retort that he who preaches ought to practise, or that those who live in glass houses ought not to throw stones. Nevertheless there is no necessary connection between the character of the person giving advice and the goodness of the advice.

The argumentum ad popuhirn is another form of Irrelevant Conclusion, and consists in addressing arguments to a body of people calculated to excite their feelings and prevent them from forming a dispassionate judgment upon the matter in hand. It is the great weapon of rhetoricians and demagogues.

As you can see, the text is "ok" (the formatting is nonexistant, it will be riddled with page numbers/headers/symbols/junk), and I definitely wouldn't want to read an entire book full of those OCR errors!

Here is the same section in my EPUB after manual cleaning (EPUB is attached at the bottom of this post):

Quote:

The Third Material Fallacy is that of the Irrelevant Conclusion, technically called the Ignoratio Elenchi, or literally Ignorance of the Refutation. It consists in arguing to the wrong point, or proving one thing in such a manner that it is supposed to be something else that is proved. Here again it would be difficult to adduce concise examples, because the fallacy usually occurs in the course of long harangues, where the multitude of words and figures leaves room for confusion of thought and forgetfulness. This fallacy is in fact the great resource of those who have to support a weak case. It is not unknown in the legal profession, and an attorney for the defendant in a lawsuit is said to have handed to the barrister his brief marked, “No case; abuse the plaintiff’s attorney.” Whoever thus uses what is known as argumentum ad hominem, that is an argument which rests, not upon the merit of the case, but the character or position of those engaged in it, commits this fallacy. If a man is accused of a crime it is no answer to say that the prosecutor is as bad. If a great change in the law is proposed in Parliament, it is an Irrelevant Conclusion to argue that the proposer is not the right man to bring it forward. Everyone who gives advice lays himself open to the retort that he who preaches ought to practise, or that those who live in glass houses ought not to throw stones. Nevertheless there is no necessary connection between the character of the person giving advice and the goodness of the advice.

The argumentum ad populum is another form of Irrelevant Conclusion, and consists in addressing arguments to a body of people calculated to excite their feelings and prevent them from forming a dispassionate judgment upon the matter in hand. It is the great weapon of rhetoricians and demagogues.

If you have Microsoft Office, you can export from Finereader -> DOC -> use Toxaris's tools, which will most likely speed up a lot of this manual cleaning step:

Word Macro: https://www.mobileread.com/forums/sho...d.php?t=142530
ebook Tools: https://www.mobileread.com/forums/sho...d.php?t=213372

I personally just use Finereader (A/B compare PDF/OCR) -> Export as EPUB -> Clean the code + add formatting.

Quote:

Originally Posted by lol.systema

-(most importantly) FULL, UNRESTRICTED, UNLIMITED ACCESS TO EVERYTHING IN THE LIBRARY AND EVERYTHING "LIBRARY" RELATED.. even the old text, rich in history, details and my god, the aroma of aged paper. SWEET BABY JESUS I CAN ALMOST SMELL IT RIGHT NOW

[...]

It's a pretty big deal for me as you can see
Any suggestions, ideas, tips, recommendations, corrections, superior knowledge of any sort are greatly appreciated. I gotta go hit the weights for a bit and keep working on ideas.

This is a FANTASTIC project. Just be weary of copyright, but I believe every single text should be up and digitized online.

Some of these tomes are just locked away in the library somewhere, nobody ever goes there and reads them, they are not searchable, the information/knowledge is just LOST. Getting them up and online will definitely get more exposure than they currently get (nearly none). And instead of getting tossed in the trash, you are bringing life to them by digitizing... then the entire INTERNET can benefit from the knowledge inside.

lol.systema · 02-18-2014, 03:57 PM

Quote:

Originally Posted by PeterT

Have you looked at http://www.diybookscanner.org/ ?

Just saw it. Totally amazed

Still though, I wouldn't be able to wait and purchase a DIY kit. These peeps at the uni actually want a made-by-me DIY setup, and they want the concept done now in terms of "TBA: NOW". Now this looks quite interesting. The concept is small, almost compact-like; if done correctly I can set up 2 or 3 of those in one room and quicken the processing of stuff. Hell.. I was calculating 2 months of processing text into image; with something as little as this multiplied by 3 the whole text-to-image would take weeks.

THANK YOU SO MUCH, MAN (: much appreciated

Quote:

Originally Posted by doubleshuffle

Or this: http://www.instructables.com/id/Barg...Cardboard-Box/

NOW THIS IS MORE LIKE IT
hubba hubba

A cheap design that can be done even cheaper.

now that you mention it, I googled "DIY Bookscanner" and found some pretty neat, easy designs. In some threads they have the same conclusion. Flatbed scanners suck .

AWESOME, it seems this concept was already exploited. I don't need to start from scratch. Thanks, people.

Cheers,

lol.systema · 02-18-2014, 05:25 PM

Quote:

Originally Posted by Tex2002ans

heavenly snip

Dear God that's beautiful.

I seriously, from the seriousness of my seriously sirius heart, thank you for your detailed post. This is by far the best crash-course on anything I've ever seen in a forum.

Quote:

be weary... if the lighting is not good enough, and/or the DPI is not high enough, the image could LOOK fine for a human to read, but if you try to digitize the text by running it through OCR, it might be highly inaccurate.

Gotcha. I'm currently gonna go test lighting and start toying around with my SLR and find out on standard and high quality pics. I was told size of each pic doesn't matter, as long as the OCR and quality go smoothly, it's alright. I'll keep it simple though, don't want 100MB+ pics

Quote:

ScanTailor

Checking it out. Thank you very much

Quote:

If you just want to have a (crappy) OCR text backend on PDFs that you release, then running it through Finereader should be ok

well, I actually missed adding that part... I did mention what OCR does but I did also mention that AbbyFinereader does leave a lot of mistakes (specially with O turned in 0, m turned in rn, and no tildes... since all university related text is in spanish, tildes are on every text, every paragraph, almost every line. I did mention the need of manual correction, just not manual translation as a tabula rasa.

They agreed. So far they'll let me get the text, OCR it in my comp (they're not gonna spend 160$ on good ol'e Abby) and send the OCR'd text so they can work on it. The staff assigned to do the digitalization will be reassigned to do the correction.. if the project is approved.

Quote:

https://www.mobileread.com/forums/sho...d.php?t=223817

More detailed info. Wow, man. Thanks ALOT

Quote:

snip on OCR

I was going to give them a presentation on OCR and how it worked (since one or two were still clueless on how things worked). I'm going to delay it and give it tomorrow. Your information is definitely worth adding. It'll add a boost to the pressi.

Quote:

If you have Microsoft Office, you can export from Finereader -> DOC -> use Toxaris's tools, which will most likely speed up a lot of this manual cleaning step:

You normally work with EPUB. What would you recommend for basic image-to-text, no specific format required? Only HTML?

To be honest I never worked with HTML. All the old books I scanned were passed only to DOC and worked from there. It seems to work just fine. However, that's just me being a total amateur.

I know they have a crew that will work on OCR'd text, however they did not mention HTML nor the intentions of touching HTML. Do you know of a tutorial on HTML for text processing? I'm pretty clueless on that /:

Quote:

Word Macro
ebook Tools

Checking. Thank you very much for the links.

Quote:

you are bringing life to them by digitizing

Can't say I'm doing so; this still needs full approval. If it goes through, I can immerse myself into university-based-texts. Even so, I would not be authorized to release them outside of the uni.

Furthermore, the restricted access section (which has the locked books) is not taken into consideration when it comes to the en masse digitalization. Therefore I cannot scan them books nor bring them home (since they're totally locked). If this works, though, I can definitely present a proposal to digitalize the old books that are locked from the public.. Once their priority on university-stuff is OCR'd and fully digitalized, I'm sure they'll have space for a fully student-handled project; and my hands will be all over the library by then so they'll know how much of an efficient tool in the shed I am.

All of the text is in spanish (for the most part that's unfortunate since the internet and its userbase is english-speaking) but this will definitely boost spanish books in the internet, which is quite an oasis in the desert if you ask me.

Most of the books in the restricted area are common domain (or so the librarian assigned in that section told me.) I just came from checking out the restricted section and yeah, some of the books I checked are as old as 90 years. I spoke to the librarian to give me the links on copyright holders and see which can be launched online. She sent me an email with the contact means and also told me that she herself can do the talking as long as I have the authorization from the uni to scan the books.
That'll be something I'll be pushing myself into as of next year. This year is to get the university-texts done and the project approved.

Thank you so much, Tex. I seriously appreciate the intense amount of detail that you've added on your post. Thanks a bunch

Cheers

mrmikel · 02-18-2014, 06:46 PM

Since you are in an academic environment, you are going to find some problems if you try to venture outside of PDF for heavily formatted books. Epubs are based on reflowing text, which means NO fixed page numbers, unless you leave visible page numbers in the text. It also means that text can be hard to pin down along side images and that tables are the work of the devil. This variable page size also means that footnotes can end up some distance from the original text on the page. Many go to notes at the end of a chapter to solve this. Small text like footnote citations can be hard to see on a small device. Inset descriptions by biographies of people mentioned in the texts can cause a major break in the flow of the text in smaller readers.

You may be the first to make this all work and have your future made....or they may cart you away to somewhere for troubled people muttering, but why can't I get that table to work??????

You will need to check the death dates of any author you propose to digitize, 70 years ago being a good average, but check the laws of your country. Copyright can be a gigantic headache.

Tex2002ans · 02-19-2014, 03:24 AM

Quote:

Originally Posted by lol.systema

Gotcha. I'm currently gonna go test lighting and start toying around with my SLR and find out on standard and high quality pics. I was told size of each pic doesn't matter, as long as the OCR and quality go smoothly, it's alright. I'll keep it simple though, don't want 100MB+ pics

I wish I had more info on the hardware/scanning side... only thing I have dabbled with is a destructive method. Cutting the binding off and feeding it into a feed reader. It went quite fast, but the disadvantage with that method is that you have to destroy the book.... and I do not know how well very old pieces of paper would handle that method (probably not well at all).

Quote:

Originally Posted by lol.systema

well, I actually missed adding that part... I did mention what OCR does but I did also mention that AbbyFinereader does leave a lot of mistakes (specially with O turned in 0, m turned in rn, and no tildes... since all university related text is in spanish, tildes are on every text, every paragraph, almost every line. I did mention the need of manual correction, just not manual translation as a tabula rasa.

Typically with OCR, the further you move away from English, the worse the OCR accuracy becomes. I don't have too much experience with Spanish (typically the books that I convert have lots of French/German names/references).

The only book that I recall working on that had a massive amount of Spanish was, "The Socialist Empire: The Incas of Peru" by Louis Baudin:

Original PDF: https://mises.org/document/4336/A-So...-Incas-of-Peru
EPUB version on my site: http://misesbooks.blogspot.com/2012/...-by-louis.html

The OCR from Finereader turned out fine with Spanish, but if I recall, I still had to do a lot of manual checking. Accents and tildes are especially rough (seems to me it is highly dependant on the font used in the original book as well... sometimes books recognize accents perfectly, other times, it misses even the simplest/clearest cases!), so you typically need even higher quality initial material (compared to a purely English book).

I must admit though, I don't have too much experience on the Spanish side of things (maybe someone here knows a lot more about digitizing Spanish).

Just don't forget in Finereader, at the very top, to set the Language to "Spanish" (or maybe, "English; Spanish")....... I have run through too many books with the wrong languages selected, and by the time I notice all the missing accents, it is too late.

Quote:

Originally Posted by lol.systema

They agreed. So far they'll let me get the text, OCR it in my comp (they're not gonna spend 160$ on good ol'e Abby) and send the OCR'd text so they can work on it. The staff assigned to do the digitalization will be reassigned to do the correction.. if the project is approved.

Toxaris recommends exporting from Finereader as an "Editable Copy" DOC. (Which is what I assume his tool works best with).

If a third party is going to be manually going over the OCR, that might make it easiest to A/B compare with the PDF. I believe that method still keeps all the pages, just throws out a nice amount of formatting overhead, and doesn't try to place text EXACTLY where it appeared... but it still inserts page breaks, so each "page" in the DOC still matches each page in the PDF.

Again, sorry I don't have more information, I don't export to DOC, or use Toxaris's macro (since I don't use Microsoft Word).

Quote:

Originally Posted by lol.systema

You normally work with EPUB. What would you recommend for basic image-to-text, no specific format required? Only HTML?

To be honest I never worked with HTML. All the old books I scanned were passed only to DOC and worked from there. It seems to work just fine. However, that's just me being a total amateur.

I know they have a crew that will work on OCR'd text, however they did not mention HTML nor the intentions of touching HTML. Do you know of a tutorial on HTML for text processing? I'm pretty clueless on that /:

Heh, not one that is to my standards. (You can see the outline I have written for PDF -> EPUB method in that previous topic I linked to... I have yet to flesh it out/expand on it). Maybe someone else can point out some tutorials (with a focus on digitizing text).

I find the HTML output from Finereader to be quite dreadful, but the EPUB output (added in Finereader 11) is some pretty minimalist/clean HTML (it only leaves in the basics, italic, bold, underline, sub/superscript, headings, ...). All the other font/layout junk code is completely nonexistent.

For someone who doesn't know their way around HTML... DOC output MIGHT be your best bet. (Especially if this is going to be handed off to others to check/clean... I doubt they will know much HTML either).

Explore Toxaris's stuff... From what I gather, his tools really can help clean up any DOCs (and DOCs exported from Finereader), and his tools can be used to do quite a good job at exporting a very clean EPUB.

If you are using Libre or Open Office, you can use Writer2EPUB: https://www.mobileread.com/forums/forumdisplay.php?f=230
and/or his other tool, PerfectEPUB: http://lukesblog.it/ebooks/ebook-tools/perfectepub/

Quote:

Originally Posted by lol.systema

Can't say I'm doing so; this still needs full approval. If it goes through, I can immerse myself into university-based-texts. Even so, I would not be authorized to release them outside of the uni.

Furthermore, the restricted access section (which has the locked books) is not taken into consideration when it comes to the en masse digitalization. Therefore I cannot scan them books nor bring them home (since they're totally locked). If this works, though, I can definitely present a proposal to digitalize the old books that are locked from the public.. Once their priority on university-stuff is OCR'd and fully digitalized, I'm sure they'll have space for a fully student-handled project; and my hands will be all over the library by then so they'll know how much of an efficient tool in the shed I am.

Bah, some digitization (for sharing within the university) is better than nothing though... but just think of all that duplicated waste of manpower going on in all the different universities! (Each school would be wasting time taking images of the same exact books + manually converting/checking the OCR).

There might be some sort of system in place to share digital texts between universities, but I have no clue (I am not in academe). Typically getting access to those things is insanely expensive (just like many of these academic journals... don't get me started on that racket

).

And another reminder, since a lot of this might be older theses.... in my experience, Finereader does a HORRIBLE job on typewriter text (maybe there is a setting I have missed somewhere). But the few books that I had to convert that were typed from typewriters... it was HORRIBLY inaccurate (and on the very slow end of conversion).

Quote:

Originally Posted by lol.systema

Thank you so much, Tex. I seriously appreciate the intense amount of detail that you've added on your post. Thanks a bunch

May this information help you, and all other future digitizers!!!

And I think we should change "intense amount of detail" into "a TEX amount of detail".

Quote:

Originally Posted by mrmikel

Since you are in an academic environment, you are going to find some problems if you try to venture outside of PDF for heavily formatted books. Epubs are based on reflowing text, which means NO fixed page numbers, unless you leave visible page numbers in the text.

I don't see TOO much of a problem if the original PDF is released right alongside. IF someone must absolutely reference something formally using AMA/APA/MLA/[ZZZ is what that makes me want to do] based off of the page numbers, they can always look back at the PDF. If they want to read it for the knowledge, they can choose their preferred format.

You CAN spend your time and create a page-map (specifically for EPUB).... but the tools to create a page-map automatically are to my knowledge, non-existent (it is a giant pain in the butt). And the amount of readers who actually know how to look through the code, figure out the page-map, OR even know that this specific book uses one (and not the typical ADE/Calibre/whatever numbering schemes).... I can probably say, abysmally small. Plus who knows in the future after EPUB, if any of these conversion programs will properly be able to convert a page-map to that future format.

Plus these same problems just occur when placing text in HTML form on a website... there are no such things as "pages" on a site. You can split them in logical locations based on chapter, and SOME sites have some sort of paragraph numbering system in place... but these digitization methods abandon the entire "page" system (which makes ZERO sense in the digital realm).

I believe as long as the original scan/images/PDF is accessible alongside the HTML/EPUB version... that should be good enough.

Quote:

Originally Posted by mrmikel

It also means that text can be hard to pin down along side images and that tables are the work of the devil. This variable page size also means that footnotes can end up some distance from the original text on the page. Many go to notes at the end of a chapter to solve this. Small text like footnote citations can be hard to see on a small device.

Indeed... Non-fiction works with Tables/Figures/Formulas/Footnotes/Images... these are a HUGE slowdown in the digitization process.

Tables: Some people/companies take a "snapshot" of the table and include it as an image. I digitize them completely (I believe it is much better for the long-run of the book, and it allows it to be copy/pastable/scalable/readable by the blind). I explained some of my table ideas in this topic: https://www.mobileread.com/forums/sho...d.php?t=223062

Warning With Images of Tables: If you insist on taking a dreaded snapshot of a table, USE PNG. AVOID JPG LIKE THE PLAGUE.

Footnotes: I explained my Footnote ideas in this topic (the real fun begins around post #16 hahaha): https://www.mobileread.com/forums/sho...d.php?t=225045

Formulas: There is no good way to do this in EPUB/MOBI at the moment... perhaps future formats this will be better (although it will still require a MASSIVE amount of manpower). I explained a lot of the ideas in this topic: https://www.mobileread.com/forums/sho...d.php?t=228413

I also explained how I handle generating higher resolution PNGs of formulas (and having the formulas saved in a more easily convertible form) in my "Formulas to PNG Tutorial": https://www.mobileread.com/forums/sho...d.php?t=223254

Figures: Many non-fiction books tend to have figures that "flow" around the text. My method is almost always to push the figure "down" to the end of the split paragraph. For example, on Page 72 of that Jevons PDF I linked above is "Fig. 1":

Click image for larger version

Name: JevonsPDFpg72.png
Views: 737
Size: 80.8 KB
ID: 119248

Click image for larger version

Name: JevonsEPUBpg72.png
Views: 742
Size: 61.8 KB
ID: 119249

Images: If the images are "artificial" (charts, graphs, text), go PNG! If they are "natural" (photographs), the argument could be made for JPG (if it is a grayscale image, please save as grayscale JPG).

I explained why JPG = junk for artificial images up in the Tables topic I linked above, Post #8: https://www.mobileread.com/forums/sho...54&postcount=8

I explained some of my PNG compression methods here (and reasoning to go PNG over JPG in the case of "artificial"/"few color" images): https://www.mobileread.com/forums/sho...5&postcount=26

Quote:

Originally Posted by mrmikel

You will need to check the death dates of any author you propose to digitize, 70 years ago being a good average, but check the laws of your country. Copyright can be a gigantic headache.

Indeed... Gigantic is not the correct word for this... there must be a larger word.

lol.systema · 02-19-2014, 05:53 PM

Quote:

Originally Posted by mrmikel

Since you are in an academic environment, you are going to find some problems if you try to venture outside of PDF for heavily formatted books. Epubs are based on reflowing text, which means NO fixed page numbers, unless you leave visible page numbers in the text. It also means that text can be hard to pin down along side images and that tables are the work of the devil. This variable page size also means that footnotes can end up some distance from the original text on the page. Many go to notes at the end of a chapter to solve this. Small text like footnote citations can be hard to see on a small device. Inset descriptions by biographies of people mentioned in the texts can cause a major break in the flow of the text in smaller readers.

You may be the first to make this all work and have your future made....or they may cart you away to somewhere for troubled people muttering, but why can't I get that table to work??????

You will need to check the death dates of any author you propose to digitize, 70 years ago being a good average, but check the laws of your country. Copyright can be a gigantic headache.

Yea. So I've seen. Epubs are are a bit of a pain when it comes to situations like the one I'm gonna go through. You couldn't have said it better: work of the devil.

I gave a quick glance at the text I'm working. It's heavy on footnotes. Maybe I can lure the crew into working in another way; possibly doc.... they prefer Epub style. They also don't want to pack themselves with too much work. Maybe I can get to them that way: Epub's more work than what they can chew on.

Quote:

Originally Posted by mrmikel

You may be the first to make this all work and have your future made.... or they may cart you away to somewhere for troubled people muttering, but why can't I get that table to work??????

We may be subversive brats from a third world country, but one thing's for sure: we're economic, we're quick and we're efficient. If we like it, we get it done in the most cost-effective manner and without losing any quality of work. Ruthless pragmatism is what I use to call it.. others call it different though lol.

Having said this, you don't need to worry on watching me strapped in white and in a fetal position

for that matter I should've been in that condition long time ago for other reasons (; lol

Quote:

Originally Posted by Tex2002ans

a destructive method.... you have to destroy the book.... and I do not know how well very old pieces of paper would handle that method (probably not well at all)

nope nopitty nope NOPE... Not even gonna think about that one.

Just came from testing lighting and DPI in the university's photo studio. Scanned a few pieces and did the OCR in my comp. Seems to work just fine. As a matter of fact, the OCR showed little to no mistakes.
The book was quite clean: clean characters, clean pages, clean footers and headers. Clean everything.. I'll be returning in a few hours in order to check out the typewriter text and old papers.

Quote:

Originally Posted by Tex2002ans

Typically with OCR, the further you move away from English, the worse the OCR accuracy becomes. I don't have too much experience with Spanish (typically the books that I convert have lots of French/German names/references).

The only book that I recall working on that had a massive amount of Spanish was, "The Socialist Empire: The Incas of Peru" by Louis Baudin

I just tried scanning a few pages of a spanish book I have. Turns out the OCR went quite well. Even most tildes were set up. So yea, as you said: it really depends on the font.
German and french... OOH I thank my sweet baby Jesus that I ain't touching any of those. I asked if any text had any other languages. They said "just spanish/english". I left quite relieved

Ah! El Imperio Socialista de los Incas de Louis Baudin.. REALLY interesting read. Left it halfway through since I had to read other stuff but I might try again and finish it sometime.

Quote:

Originally Posted by Tex2002ans

Just don't forget in Finereader, at the very top, to set the Language to "Spanish" (or maybe, "English; Spanish")....... I have run through too many books with the wrong languages selected, and by the time I notice all the missing accents, it is too late.

Gotcha
lol that must be a pain

Quote:

Originally Posted by Tex2002ans

[editable copies]might make it easiest to A/B compare with the PDF. I believe that method still keeps all the pages, just throws out a nice amount of formatting overhead, and doesn't try to place text EXACTLY where it appeared

}

ACTUALLY, now that you mention it, an editable copy is the best one. It keeps some headers/footers; if the font in word is the same font as the book, then the page formatting leaves each page 99% identical with the original book page. Titles and headings are also identified (although badly and sometimes it doesn't but that can be worked with easily).

You're right. An editable copy would be the best choice in this case. Gonna try it out along with a few other formats right now.

Quote:

Originally Posted by Tex2002ans

For someone who doesn't know their way around HTML... DOC output MIGHT be your best bet. (Especially if this is going to be handed off to others to check/clean... I doubt they will know much HTML either)

I got into a meeting with the staffing team.. none of them have a clue of what's going on. Hell, one even confused Java with HTML. I was later told that I will need to "capacitate" the crew depending on the means I find more suitable. So yeah.. I'm sure I can work with them. DOC is a fairly easy format and goes hand-on-hand with the original text images for comparison purposes. Two birds, one shot. All I need is to polish the team's efficiency. I'll go Check Scan Tailor and all works from Toraxis. If things go through, I'll push into getting someone with HTML experience, just so we can keep that stored in a backup.

I'll also keep record of digitalization, do some experimenting and see what goes and comes around each playthrough. Maybe I can work on providing better insight on digitalization and share it in here.

Quote:

Originally Posted by Tex2002ans

Bah, some digitization (for sharing within the university) is better than nothing though...

I threw in a hint on that. They said NOPE on sharing.
I did though hinted on getting my scanner (once it's done) to the restricted area and work the books. They said yea, BUT once the initial campaign is done. So I guess I got that going for me.

Quote:

Originally Posted by Tex2002ans

but just think of all that duplicated waste of manpower going on in all the different universities!

In here it's more of "think of all the universities that don't give a ship about digitalization. You might have it good out there in your country but in here, getting physical books is a pain. Don't even get me started on digitals, ebooks, pdf... ANYTHING because it's mostly non-existent.

There SHOULD be a communication between universities and institutions, but thing is there's not even a slightest shadow of interest... only this uni I'm studying in, and has a barebones idea on how to get things done

Quote:

Originally Posted by Tex2002ans

snip on tables, images, footnotes, formulas, figures, typewriter OCR, APA/AMA/MLA

After checking on some text, most of the things I'll be working with is footnotes, tons of footnotes; few tables, few images, few figures, tons of formulas. I was informed that someoe in the architecture department can work on setting up the formulas as long as we send the images attached on each request. I'll go check out the PNG/Formulas thread. That'll serve as good reference.

Yea. I've fallen in love with the PNG/JPG duo ever since I started toying around with my scanner. I'll keep grayscale on everything since Abby seems to digest it better. She's on a diet, you know..

Im gonna check out the PNG cmpression methods right now. There's going to be more meeting with more detailed info request. I can give poop covered in tin foil, sell it as Avant-garde stuff and still convince the old geezers, but every single bit of detail works.

Thanks for the TEX amount of detail (wink wink see what I did there? lol)
And also, thanks for sharing as well, mrmikel. Now it's time to work

Cheers,

PeterT · 02-19-2014, 07:00 PM

You might like to check out the Distributed Proofreading project http://www.pgdp.net/c/ . I seem to recall that there was a way of installing a copy of this on your own server, which would allow you to get multiple bodies involved in the validation part of the OCR work.

http://www.pgdp.net/phpBB2/viewtopic.php?t=21864 seems to cover installation

Tex2002ans · 02-19-2014, 08:59 PM

Quote:

Originally Posted by lol.systema

Yea. So I've seen. Epubs are are a bit of a pain when it comes to situations like the one I'm gonna go through. You couldn't have said it better: work of the devil.

I do most of my work for a non-profit economics website, and we release a huge amount of academic material (non-fiction economics/history books mostly), and it works for us just releasing all of the scans instantly as PDFs, and then when EPUBs are converted, just releasing those side by side.

Everyone who reads the books, can just do a search on the site and go reference the PDFs that are right next door. Seems to work out well for the hundreds/thousands of academics (and non-academics) who use our resources.

I don't see any reason why it has to be any different in a formal academic setting.

Quote:

Originally Posted by lol.systema

I gave a quick glance at the text I'm working. It's heavy on footnotes. Maybe I can lure the crew into working in another way; possibly doc.... they prefer Epub style. They also don't want to pack themselves with too much work. Maybe I can get to them that way: Epub's more work than what they can chew on.

Don't want to "pack themselves with too much work"... I work at this stuff full-time... just finished converting my ~210th book. Working from images/PDFs is about as painful as it can get, and converting from non-fiction is even more painful.... The only thing that is probably worse is converting math.

After working at this stuff full-time for about a year and a half, it takes me ~8-15 hours on average to convert a scanned (non-fiction economics) book -> OCR -> completed EPUB. So at the pace of around one book every one or two days.... (Of course, some only take a few hours, and some take MUCH longer (30+ hours)).

As I said, when you first start book conversion... it will be SLOOOOOOWWWWWW (took me a week or two, so I assume my pace used to be ~40-80 hours per book).

I assume some sort of distributed system would bring about even more overhead in actual manhours. And this is not taking into account the manpower it takes to initially get the books into images/PDFs.

Book Digitization = time consuming.

Quote:

Originally Posted by lol.systema

German and french... OOH I thank my sweet baby Jesus that I ain't touching any of those. I asked if any text had any other languages. They said "just spanish/english". I left quite relieved

Don't believe one word they say!!! Sure, the books you just have are "just Spanish/English"... the books I work on are "just English"!!! But there are a lot of names/references that will have German/French accents, and a lot of quotations that may be in different languages, or single French/German/Spanish words that are in italics and accented.

Finereader Tip: Setting the "Language" up top activates the OCR to look for certain characters. This was something I learned after too many headaches (that stupid cedilla below the 'c' in "François" what finally pushed me over the "English" Language edge). So now I just set Finereader to convert all books as "English; French; German".

Side Note: Selecting the "Language" in Finereader also activates dictionaries for those languages as well.... I found that when I activated "Spanish" as a language, sure, the OCR might catch a few tildes/accents the other languages would have missed, but then Finereader started doing WAY too many false positives (markings in the PDF were considered accents), AND, the Spanish dictionary started to interfere with the actual words (so it was telling me things were spelled wrong when they weren't). This might not effect you so much though if you were just doing your editing using an outside program (like Microsoft Word).

Quote:

Originally Posted by lol.systema

Ah! El Imperio Socialista de los Incas de Louis Baudin.. REALLY interesting read. Left it halfway through since I had to read other stuff but I might try again and finish it sometime.

Well I can guarantee you that is the greatest EPUB that exists of the book! A faithful reproduction if I do say so myself.

Quote:

Originally Posted by lol.systema

ACTUALLY, now that you mention it, an editable copy is the best one. It keeps some headers/footers; if the font in word is the same font as the book, then the page formatting leaves each page 99% identical with the original book page. Titles and headings are also identified (although badly and sometimes it doesn't but that can be worked with easily).

The thing that is horrible though is that you cannot rely on Finereader marking things properly (headers/pagenumbers, footers/footnotes).

The actual CODE in the backend making the DOC look close to the actual page is ABSOLUTELY DREADFUL.

You may potentially dig yourself into a hole where you will have to waste lots more future manpower going out from a HORRIBLY designed DOC -> HTML (or whatever other format you want).

Which is why I personally just jump from OCR -> EPUB (barebones HTML), and do my fixing directly. HTML + CSS is not going anywhere... and I keep the code extremely minimal/consistent throughout all my books, which makes it easy as pie to just copy/paste to sites/anywhere.

Although again, Toxaris's tools... huge time saver if you use Microsoft Office.

Quote:

Originally Posted by lol.systema

I'll also keep record of digitalization, do some experimenting and see what goes and comes around each playthrough. Maybe I can work on providing better insight on digitalization and share it in here.

Can't wait to hear more info from you... the hardware side of digitizing books is interesting.

Quote:

Originally Posted by lol.systema

In here it's more of "think of all the universities that don't give a ship about digitalization. You might have it good out there in your country but in here, getting physical books is a pain. Don't even get me started on digitals, ebooks, pdf... ANYTHING because it's mostly non-existent.

There SHOULD be a communication between universities and institutions, but thing is there's not even a slightest shadow of interest... only this uni I'm studying in, and has a barebones idea on how to get things done

Indeed indeed... academe is always living in the stone ages and moves glacially slow.

I jumped ship from physical books once I stumbled upon the treasure trove of all PDFs/EPUBs for free. Now I will NEVER touch a physical book again (unless I have to digitize it).

I dedicate all my time now towards getting books into EPUB (VASTLY SUPERIOR to reading some crappy pictures/scanned PDF).

Most of the books that we work on went out of print, got lost in time, etc. etc. Now, ANYONE around the world can have access to them within a minute of searching/downloading.

Having them up in digital form is ALSO fantastic when you yourself are needing to use them for reference. You can quickly look up the PDF version, pull out what you need, and move on with typing your paper.

Stone Ages:
- Go to the library, they don't have it.
- They search around... only one library across the country has it.
- Weeks later, they get some dusty tome shipped to them.
  - Or better yet, it is locked up, and you have to spend a whole day traveling to get it.
- Only one person can use the book at a time.
Now:
- Search in your browser
- Download PDF/EPUB/XYZ format
- Copy/Paste into your paper
- Move on without ever having to leave your desk.
- Everyone can use the book at the same time.

Quote:

Originally Posted by lol.systema

After checking on some text, most of the things I'll be working with is footnotes, tons of footnotes; few tables, few images, few figures, tons of formulas. I was informed that someoe in the architecture department can work on setting up the formulas as long as we send the images attached on each request. I'll go check out the PNG/Formulas thread. That'll serve as good reference.

The cheapest way is to just leave the original formulas as snapshots right out of the PDF.

I would not recommend fully digitizing the formulas if you are doing archival. It is not worth the amount of time/money AT ALL.

I personally do it because I want the highest quality in my EPUBs, and if we ever DO reprint one of these older books with a new edition, a horrible scanned formula would look QUITE out of place. So you want the stuff in some sort of vector form that can easily be scaled.

But since you are not in the business of publishing.... I wouldn't recommend it.

Quote:

Originally Posted by lol.systema

Thanks for the TEX amount of detail (wink wink see what I did there? lol)

Side tip that seems obvious: Start off with the EASY stuff. Work on small material first. Articles (maybe up to 30 pages), small journals. Then tackle much harder works later. You feel like you are making much more progress when you fully digitize 30 articles instead of ONE 600 page book with millions of footnotes/tables/diagrams.

Quote:

Originally Posted by PeterT

You might like to check out the Distributed Proofreading project http://www.pgdp.net/c/ . I seem to recall that there was a way of installing a copy of this on your own server, which would allow you to get multiple bodies involved in the validation part of the OCR work.

http://www.pgdp.net/phpBB2/viewtopic.php?t=21864 seems to cover installation

Definitely read a lot of the other material in their forums too, there is lots of good stuff.

doubleshuffle · 02-20-2014, 12:07 AM

Quote:

Originally Posted by Tex2002ans

Stone Ages:
- Go to the library, they don't have it.
- They search around... only one library across the country has it.
- Weeks later, they get some dusty tome shipped to them.
  - Or better yet, it is locked up, and you have to spend a whole day traveling to get it.
- Only one person can use the book at a time.
Now:
- Search in your browser
- Download PDF/EPUB/XYZ format
- Copy/Paste into your paper
- Move on without ever having to leave your desk.
- Everyone can use the book at the same time.

Nicely put. Certainly used to get more exercise in the olden times though, didn't we?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Text formatting issues when creating IDs/hyperlinking text	rnuss	Sigil	2	09-28-2013 06:49 AM
Digitalizing my collection	Ansileran	Workshop	3	07-02-2013 07:29 PM
PRS-T1 Are they really that delicate??	Grendel Rex	Sony Reader	23	07-07-2012 10:10 AM
PRS-650 Text and alignment issues	henders254	Sony Reader	0	05-14-2011 08:11 PM
Reader overly delicate?	macastronomer	Sony Reader	45	10-05-2007 12:09 PM

02-17-2014, 09:20 PM	#1
lol.systema Member Posts: 17 Karma: 10 Join Date: Feb 2014 Device: Kindle Touch	Delicate text digitalizing + scanning issues Spoiler: 'sup Larry, you and your pack-o'-nuggets can start hawking me now So last week I presented a project for en masse digitalization to my university, in terms of their old texts (thesis', simple format essays and college texts, both student and teacher-made) for the digital library campaign that they've just recently started. They've already digitalized a hefty amount of text; however I realized their process is a tad rudimentary, radical and has already permanently damaged some text.. actually it sucks big time.. They've destroyed a lot of texts already; a sad scene to say the least. Spoiler: to put things simply; they grab REALLY old and REALLY delicate text.. thesis', essays, records from congress [not even the national library has them as cleanly kept as these cracks], open them up like a Vegas "masseuse" and place them through a (comically saying) even older scanner, flushing in the toilet all those years of utmost care. Some books have already fallen out of their cover and pages were torn in a slow, painful and counterproductive process. I can tell you how horrible it is, because I was there and saw it with my own eyes. Truly that cannot be unseen I initially recommended to add OCR for cheaper reproduction of thesis and depending less on rudimentary photocopying, which the peeps at the university agreed (with a huge grin on their face, since it seems that, as any other university, money-saving and rubbing their elbows matters ALOT). Now I'm no engineer nor an erudite of any sort but for some darn reason they seemed impressed of my process of OCR. Some of the higher ups thought that OCR was a self-invented bogus name just to make things look fancier. Them dudes didn't know about my pin-up sweetheart Abby, so they ended up liking the project even more. At the beginning they even thought I was talking about employing people to manually do the paper-to-text work processing.. cheesus parmesan rice are they old-fashioned.. Now, they want me to find out a way to maximize the speed of text-to-image processing as well as minimizing damages in texts. To say the least, as much as they are worried a lot about the $$, these nuggets are the only university in my chihuahua-looking thirld world country that actually CARE about books... largely the reason I decided to spend my study funds in them... (aside from that).. Now my main issue is the OCR processing of old, rusty text and most importantly, how to get images of texts bigger than a scanner. That's my main concern. As I said, I'm no enlightened engineer of any sort so I worked with a few ideas already.. Taking into consideration the size of texts, I came up with the following: Big texts are damaged because of two things: they're old as frozen hell and are being handled like a feisty toddler... also because they're being forced to fit on a small scanner. Having said this, they are turned upside down, opened up like the suggestive example given above; that's where the damage begins. Based on the aforementioned, scanners are totally obsolete. Big text needs to be handled and moved to the very least; keeping them as steady as possible is a priority. Keeping them looking upwards would be best, since we avoid the weight and issues on adjusting text-to-scanner process. Scanners by themselves are slow and tedious. Minimizing speed of processing each page is required without losing high quality. SCANNERS ARE OBSOLETE. That's the main conclusion. They're slow, they damage and are thus counterproductive The only best option is a camera.. SLR to be precise. Lighting is also required to replace the scanner's big shiny string of slowness. I've already counted the lighting on a scanner and measured it with a bit of the photo studio the Art Department this university has. Perfectly "doable". I also thought of setting the tripod with the book directly below (90 degree angle), setting the pages at a certain angle (still experimenting and finding out the best camera/text distance vs. text angle)... If any of you are good at numbrers and can help me out on getting an nice pic angle without distorting the text, I would greatly appreciate it. Also. Have any of you have an idea on how to mount something like this. I can definitely go full Chuck Norris 4x4 Rambo Ranger Mambo Tango style on this and just set up a tripod, studio lighting, and the book on a simple mount and start taking HQ pictures,... but would there be any way of setting up a mount that helps on sustaining the camera and the text (and if possible, the lighting as well?) This is what I came up with... sorry for the crappy paintjob: } IF.... in any case this works in the slightest I will get the project approved and do one of 2: 1-keep on training this puppy and be part of the leads (since I started up with this whole zip zoopity bip bop) and ensure this project works out (even if no tips or suggestions are provided, I'm sure I can find something that'll make things work out) 2-provide a sound, well-fundamented project and hand it to ACTUAL engineers that can work on this matter (I'm studying for Education Sciences, Elementary School Teacher... noothing to do over there..) Either way if I do any of two I will have the following as the booty (since they are by no means going to pay me for this): -a good reputation among the students and future alumni which will grant more access to any wicked projects I have in mind -possible school funding and backup depending on how radical the project is -(most importantly) FULL, UNRESTRICTED, UNLIMITED ACCESS TO EVERYTHING IN THE LIBRARY AND EVERYTHING "LIBRARY" RELATED.. even the old text, rich in history, details and my god, the aroma of aged paper. SWEET BABY JESUS I CAN ALMOST SMELL IT RIGHT NOW -lesser restriction in terms of book lending; normally it's a 14 day timeframe, to me it could be times2.. also lesser restrictions on the amount of books.. as long as I return them without any scribblings, torn pages or damages of any sort... which I GLADLY oblige to and agree. -be one of the few that are first in line of book dumping (every year or two the university dumps old, obsolete books).. I dont care if they're old, they'll be MINE. hubba hubba It's a pretty big deal for me as you can see Any suggestions, ideas, tips, recommendations, corrections, superior knowledge of any sort are greatly appreciated. I gotta go hit the weights for a bit and keep working on ideas. Enjoy your reading Cheers,

02-17-2014, 10:24 PM	#2
PeterT Grand Sorcerer Posts: 14,025 Karma: 83000000 Join Date: Nov 2007 Location: Toronto Device: Libra H2O, Libra Colour	Have you looked at http://www.diybookscanner.org/ ?

02-18-2014, 12:01 AM	#3
doubleshuffle Unicycle Daredevil Posts: 13,949 Karma: 185432100 Join Date: Jan 2011 Location: Planet of the Pudding Brains Device: Aura HD (R.I.P. After six years the USB socket died.) tolino shine 3	Or this: http://www.instructables.com/id/Barg...Cardboard-Box/

02-18-2014, 06:46 PM	#7
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Since you are in an academic environment, you are going to find some problems if you try to venture outside of PDF for heavily formatted books. Epubs are based on reflowing text, which means NO fixed page numbers, unless you leave visible page numbers in the text. It also means that text can be hard to pin down along side images and that tables are the work of the devil. This variable page size also means that footnotes can end up some distance from the original text on the page. Many go to notes at the end of a chapter to solve this. Small text like footnote citations can be hard to see on a small device. Inset descriptions by biographies of people mentioned in the texts can cause a major break in the flow of the text in smaller readers. You may be the first to make this all work and have your future made....or they may cart you away to somewhere for troubled people muttering, but why can't I get that table to work?????? You will need to check the death dates of any author you propose to digitize, 70 years ago being a good average, but check the laws of your country. Copyright can be a gigantic headache.

02-19-2014, 07:00 PM	#10
PeterT Grand Sorcerer Posts: 14,025 Karma: 83000000 Join Date: Nov 2007 Location: Toronto Device: Libra H2O, Libra Colour	You might like to check out the Distributed Proofreading project http://www.pgdp.net/c/ . I seem to recall that there was a way of installing a copy of this on your own server, which would allow you to get multiple bodies involved in the validation part of the OCR work. http://www.pgdp.net/phpBB2/viewtopic.php?t=21864 seems to cover installation

Advert

Advert