Quote:
Originally Posted by mrmikel
Since you are in an academic environment, you are going to find some problems if you try to venture outside of PDF for heavily formatted books. Epubs are based on reflowing text, which means NO fixed page numbers, unless you leave visible page numbers in the text. It also means that text can be hard to pin down along side images and that tables are the work of the devil. This variable page size also means that footnotes can end up some distance from the original text on the page. Many go to notes at the end of a chapter to solve this. Small text like footnote citations can be hard to see on a small device. Inset descriptions by biographies of people mentioned in the texts can cause a major break in the flow of the text in smaller readers.
You may be the first to make this all work and have your future made....or they may cart you away to somewhere for troubled people muttering, but why can't I get that table to work??????
You will need to check the death dates of any author you propose to digitize, 70 years ago being a good average, but check the laws of your country. Copyright can be a gigantic headache.
|
Yea. So I've seen. Epubs are are a bit of a pain when it comes to situations like the one I'm gonna go through. You couldn't have said it better: work of the devil.
I gave a quick glance at the text I'm working. It's heavy on footnotes. Maybe I can lure the crew into working in another way; possibly doc.... they prefer Epub style. They also don't want to pack themselves with too much work. Maybe I can get to them that way: Epub's more work than what they can chew on.
Quote:
Originally Posted by mrmikel
You may be the first to make this all work and have your future made.... or they may cart you away to somewhere for troubled people muttering, but why can't I get that table to work??????
|
We may be subversive brats from a third world country, but one thing's for sure: we're economic, we're quick and we're efficient. If we like it, we get it done in the most cost-effective manner and without losing any quality of work. Ruthless pragmatism is what I use to call it.. others call it different though lol.
Having said this, you don't need to worry on watching me strapped in white and in a fetal position

for that matter I should've been in that condition long time ago for other reasons (; lol
Quote:
Originally Posted by Tex2002ans
a destructive method.... you have to destroy the book.... and I do not know how well very old pieces of paper would handle that method (probably not well at all)
|
nope nopitty nope NOPE... Not even gonna think about that one.
Just came from testing lighting and DPI in the university's photo studio. Scanned a few pieces and did the OCR in my comp. Seems to work just fine. As a matter of fact, the OCR showed little to no mistakes.
The book was quite clean: clean characters, clean pages, clean footers and headers. Clean everything.. I'll be returning in a few hours in order to check out the typewriter text and old papers.
Quote:
Originally Posted by Tex2002ans
Typically with OCR, the further you move away from English, the worse the OCR accuracy becomes. I don't have too much experience with Spanish (typically the books that I convert have lots of French/German names/references).
The only book that I recall working on that had a massive amount of Spanish was, "The Socialist Empire: The Incas of Peru" by Louis Baudin
|
I just tried scanning a few pages of a spanish book I have. Turns out the OCR went quite well. Even most tildes were set up. So yea, as you said: it really depends on the font.
German and french... OOH I thank my sweet baby Jesus that I ain't touching any of those. I asked if any text had any other languages. They said "just spanish/english". I left quite relieved
Ah! El Imperio Socialista de los Incas de Louis Baudin.. REALLY interesting read. Left it halfway through since I had to read other stuff but I might try again and finish it sometime.
Quote:
Originally Posted by Tex2002ans
Just don't forget in Finereader, at the very top, to set the Language to "Spanish" (or maybe, "English; Spanish")....... I have run through too many books with the wrong languages selected, and by the time I notice all the missing accents, it is too late.
|
Gotcha
lol that must be a pain
Quote:
Originally Posted by Tex2002ans
[editable copies]might make it easiest to A/B compare with the PDF. I believe that method still keeps all the pages, just throws out a nice amount of formatting overhead, and doesn't try to place text EXACTLY where it appeared
|
}
ACTUALLY, now that you mention it, an editable copy is the best one. It keeps some headers/footers; if the font in word is the same font as the book, then the page formatting leaves each page 99% identical with the original book page. Titles and headings are also identified (although badly and sometimes it doesn't but that can be worked with easily).
You're right. An editable copy would be the best choice in this case. Gonna try it out along with a few other formats right now.
Quote:
Originally Posted by Tex2002ans
For someone who doesn't know their way around HTML... DOC output MIGHT be your best bet. (Especially if this is going to be handed off to others to check/clean... I doubt they will know much HTML either)
|
I got into a meeting with the staffing team.. none of them have a clue of what's going on. Hell, one even confused Java with HTML. I was later told that I will need to "capacitate" the crew depending on the means I find more suitable. So yeah.. I'm sure I can work with them. DOC is a fairly easy format and goes hand-on-hand with the original text images for comparison purposes. Two birds, one shot. All I need is to polish the team's efficiency. I'll go Check Scan Tailor and all works from Toraxis. If things go through, I'll push into getting someone with HTML experience, just so we can keep that stored in a backup.
I'll also keep record of digitalization, do some experimenting and see what goes and comes around each playthrough. Maybe I can work on providing better insight on digitalization and share it in here.
Quote:
Originally Posted by Tex2002ans
Bah, some digitization (for sharing within the university) is better than nothing though...
|
I threw in a hint on that. They said NOPE on sharing.
I did though hinted on getting my scanner (once it's done) to the restricted area and work the books. They said yea, BUT once the initial campaign is done. So I guess I got that going for me.
Quote:
Originally Posted by Tex2002ans
but just think of all that duplicated waste of manpower going on in all the different universities!
|
In here it's more of "think of all the universities that don't give a ship about digitalization. You might have it good out there in your country but in here, getting physical books is a pain. Don't even get me started on digitals, ebooks, pdf... ANYTHING because it's mostly non-existent.
There SHOULD be a communication between universities and institutions, but thing is there's not even a slightest shadow of interest... only this uni I'm studying in, and has a barebones idea on how to get things done
Quote:
Originally Posted by Tex2002ans
snip on tables, images, footnotes, formulas, figures, typewriter OCR, APA/AMA/MLA
|
After checking on some text, most of the things I'll be working with is footnotes, tons of footnotes; few tables, few images, few figures, tons of formulas. I was informed that someoe in the architecture department can work on setting up the formulas as long as we send the images attached on each request. I'll go check out the PNG/Formulas thread. That'll serve as good reference.
Yea. I've fallen in love with the PNG/JPG duo ever since I started toying around with my scanner. I'll keep grayscale on everything since Abby seems to digest it better. She's on a diet, you know..
Im gonna check out the PNG cmpression methods right now. There's going to be more meeting with more detailed info request. I can give poop covered in tin foil, sell it as Avant-garde stuff and still convince the old geezers, but every single bit of detail works.
Thanks for the TEX amount of detail (wink wink see what I did there? lol)
And also, thanks for sharing as well, mrmikel. Now it's time to work
Cheers,