![]() |
#16 |
Color me gone
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
For one's own personal use, one can leave errors.
My experience has been that you should not rely on it too much if it is a non fiction work, because I have seen a number of very plausible errors that change the meaning considerably. I make it a practice to download the PDF source along with the OCRed document. It is a devourer of hard drive space. Thank goodness for my 600 GB hard drive. Unfortunately, the Distributed Proofreaders software is in PHP which I am not running. My kingdom for a Windows binary! |
![]() |
![]() |
![]() |
#17 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,339
Karma: 203719142
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
|
|
![]() |
![]() |
Advert | |
|
![]() |
#18 | |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Righto..
First, Pages splits: Code:
sed '/\f/<!-- PAGEBREAK -->\n/g' in.txt >out.txt Code:
sed '/PAGEBREAK/{n;s/.*/<!-- PAGEHEADER: & -->/}' in.txt->out.txt Code:
tac in.txt|sed '/PAGEBREAK/{n;s/.*/<!-- PAGEFOOTER: & -->/}'|tac >out.txt Code:
sed '/-$/{h;s/.* //;s/$/#/; x; s/[^ ]\+$//; :a; n; /^[a-z]/!b a; H;x; s/\n//}' file.txt >new-file.txt The first part of a split word is prepended to the next line starting with a lower-case letter, so it will work across page breaks. Therefore extraneous leading/trailing blanks will cause problems, as will hyphenated capitalized names (e.g. Karl-Otto) Next, chapters. First I do a grep to check all headings are there and present. E.g. assuming chapter headings are "CHAPTER 1,2,3...." Code:
grep CHAPTER file.txt Code:
grep -A 2 CHAPTER file.txt Having satisfied myself that all the chapters are on the form Quote:
Code:
sed '/CHAPTER/{s/.*/<\/p>\n<hr class="endchapter\/>\n<h2 class="chapter">&<\/h2>/; n; s/.*/<h2 class="chapter_title">&<\/h2>/; n;n; s/\([^A-Z]*[A-Z]\)\([^ ]*\)/<p class="initial"><span class="drop">\1<\/span><span class="first">\2<\/span>/}' file.txt> new_file.txt Then you can do a Code:
sed 's/^ \{3,12\}\([^ ]\)/<\/p>\n<p>\1/' If your chapter heading instead is just a centered roman numeral, you can use /^ \{20,\}[IVX]\+\.$/ instead of /CHAPTER/. And for my final trick, I'll split this file into one xhtml file for each chapter: Code:
sed -e '/<h2 class="chapter"/i</body>\n</html>\n<html>\n<body>' -e '1i<html>\n<body>' -e '$a</body>\n</html>' file.txt|\ csplit -f epub/OEBPS/ -b "%2.2d.xhtml" - '/<html>/' '{*}' \ && rm epub/OEBPS/00.xhtml 00.xhtml is an empty file, and is therefore removed. In real life, you'll want a proper header for your xhtml files, but I figured that putting in a proper header wouldn't improve the readability... My own workflow differs slightly from this. These commands have been made into shell functions, and chapter headings etc. are converted into an intermediate, self-composed markup that doesn't ruin the readability of the text in the same way as html. I also do some contorted sed&awk expressions to handle footnotes and images. There you basically have it. Working out how the commands work is left as an exercise to the reader. Not fully automatic, but not many repeptitive operations either, all done using computer stone-age tools. Try to do this with yer fancy new-fangled colour-coded GOEEY interfaces! ![]() |
|
![]() |
![]() |
![]() |
#19 | |
Color me gone
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Quote:
I can open foxit reader and sigil in windows at the same time, but they tend to jump back and forth vying for attention, so it is an annoyance. |
|
![]() |
![]() |
![]() |
#20 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
My issue with distributed proofreading solutions is that since you're limited to editing one page at a time, you are prevented from efficiently dealing with systematic errors (e.g. I will often want to do an interactive search&replace 'U' -> 'll' for the entire document)
A thought has struck me - would it be possible to have a proofreading plugin in Sigil? I've never used Sigil myself, but maybe it could do the same as I do in Libreoffice, viz. convert a pdf/djvu into a html file with the scanned image in the left column and the text from the image in the right? |
![]() |
![]() |
Advert | |
|
![]() |
#21 | ||
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
Quote:
![]() Plus if you can't edit the html code (as PeterT indicates in post #12) there's no use. It's more efficient to do proof-reading and fixing formatting in the same process. Quote:
|
||
![]() |
![]() |
![]() |
#22 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,680
Karma: 23983815
Join Date: Dec 2010
Device: Kindle PW2
|
@SBT: Thanks for posting the scripts. I only have rudimentary sed scripting skills and your scripts will help me get started.
|
![]() |
![]() |
![]() |
#23 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
|
![]() |
![]() |
![]() |
#24 | |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 48
Karma: 854254
Join Date: Nov 2016
Device: none
|
Quote:
Also, what would be the workflow for post editing/changing? Such deleting a chapter, section and have it automatically modified the TOX and other places. Perhaps not exactly 'automagically' but the sed way automagically. thanks. |
|
![]() |
![]() |
![]() |
Tags |
ocr, proof-reading |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
ABBYY FineReader - Proof reading tips? | PieOPah | Workshop | 23 | 03-02-2012 01:03 AM |
Proof reading: What do you do when you find a clear misprint? | graycyn | Workshop | 4 | 07-20-2011 01:13 PM |
Proof Reading Service | genepool | General Discussions | 1 | 03-16-2011 09:02 AM |
What is easier on your eyes while reading. | JeremyZ | General Discussions | 32 | 08-28-2010 05:58 PM |
Reading methodology (list ordering) | Be Szpilman | Reading Recommendations | 27 | 07-31-2008 08:44 PM |