View Single Post
Old 05-10-2011, 02:50 PM   #14
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
I'm possibly veering somewhat off-topic here, but in the cases where you have both the OCR'ed text and the scanned images of each page, I like to make a two-column html-table with the images on the right and the text on the left. Then I can import that file into OpenOffice Writer, proofread, edit, and xhtml-format it, and save as plaintext. I enclose the bash-script I use on the pdf files of public-domain works available from the Norwegian National Library, and a screen dump of what a file looks like in OOffice. It could work on books from Google too, I think, though it'll probably have to be tweaked a bit.

<Grumble> Why are Windows-users allowed to upload their .bat files, while linux-users must zip their .sh files to upload them?</Grumble>
Attached Thumbnails
Click image for larger version

Name:	screendump-twocolumn-pdf-view.jpg
Views:	407
Size:	109.8 KB
ID:	71213  
Attached Files
File Type: gz twocolumnPDFview.sh.gz (1.5 KB, 288 views)
SBT is offline   Reply With Quote