Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 06-18-2012, 07:09 AM   #16
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
For one's own personal use, one can leave errors.

My experience has been that you should not rely on it too much if it is a non fiction work, because I have seen a number of very plausible errors that change the meaning considerably.

I make it a practice to download the PDF source along with the OCRed document. It is a devourer of hard drive space. Thank goodness for my 600 GB hard drive.

Unfortunately, the Distributed Proofreaders software is in PHP which I am not running. My kingdom for a Windows binary!
mrmikel is offline   Reply With Quote
Old 06-18-2012, 08:03 AM   #17
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,552
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Unfortunately, the Distributed Proofreaders software is in PHP which I am not running. My kingdom for a Windows binary!
Why would you think you couldn't run PHP on an IIS Webserver?
DiapDealer is offline   Reply With Quote
Old 06-18-2012, 08:38 AM   #18
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Quote:
Originally Posted by Doitsu View Post
Could you please post your sed script(s)?
Righto..
First,
Pages splits:
Code:
sed '/\f/<!-- PAGEBREAK -->\n/g' in.txt >out.txt
Page headers/footers: Very often the first line on a page is the page header, so surprisingly often this works:
Code:
sed '/PAGEBREAK/{n;s/.*/<!-- PAGEHEADER: & -->/}' in.txt->out.txt
For footers, I do this:
Code:
tac in.txt|sed '/PAGEBREAK/{n;s/.*/<!-- PAGEFOOTER: & -->/}'|tac >out.txt
Words split over lines:
Code:
sed '/-$/{h;s/.* //;s/$/#/;
x;
s/[^ ]\+$//;
:a;
n;
/^[a-z]/!b a;
H;x;
s/\n//}' file.txt >new-file.txt
Now every split word is put at the head of the following line, and the hyphen replaced by '-#'. Do an interactive search&replace '-#' -> '-' to keep those hyphens you want, and then do a batch search & replace to remove the rest.
The first part of a split word is prepended to the next line starting with a lower-case letter, so it will work across page breaks. Therefore extraneous leading/trailing blanks will cause problems, as will hyphenated capitalized names (e.g. Karl-Otto)

Next, chapters. First I do a grep to check all headings are there and present. E.g. assuming chapter headings are "CHAPTER 1,2,3...."
Code:
grep CHAPTER file.txt
If chapters have a title on the following line, I inspect that:
Code:
grep -A 2 CHAPTER file.txt
That gives me the two following lines, so I can see if the title is split over more than one line, if so I manually concatenate the title on to one line
Having satisfied myself that all the chapters are on the form
Quote:
CHAPTER 3
An old cliché

It was a dark and stormy night....
I execute the following abomination:
Code:
sed '/CHAPTER/{s/.*/<\/p>\n<hr class="endchapter\/>\n<h2 class="chapter">&<\/h2>/;
n;
s/.*/<h2 class="chapter_title">&<\/h2>/;
n;n;
s/\([^A-Z]*[A-Z]\)\([^ ]*\)/<p class="initial"><span class="drop">\1<\/span><span class="first">\2<\/span>/}' file.txt> new_file.txt
, which closes the last paragraph and appends a line at the end of the preceeding chapter, wraps chapter number and title in suitable tags, and inserts a drop cap á la Jellby in the first paragraph.

Then you can do a
Code:
sed 's/^ \{3,12\}\([^ ]\)/<\/p>\n<p>\1/'
to convert text indents to paragraphs.

If your chapter heading instead is just a centered roman numeral, you can use /^ \{20,\}[IVX]\+\.$/ instead of /CHAPTER/.

And for my final trick, I'll split this file into one xhtml file for each chapter:
Code:
 sed -e '/<h2 class="chapter"/i</body>\n</html>\n<html>\n<body>'  -e '1i<html>\n<body>' -e '$a</body>\n</html>' file.txt|\
csplit -f epub/OEBPS/ -b "%2.2d.xhtml" - '/<html>/' '{*}' \
 && rm epub/OEBPS/00.xhtml
This'll put files 00.xhtml, 01.xhtml,... in directory epub/OEBPS
00.xhtml is an empty file, and is therefore removed.
In real life, you'll want a proper header for your xhtml files, but I figured that putting in a proper header wouldn't improve the readability...

My own workflow differs slightly from this. These commands have been made into shell functions, and chapter headings etc. are converted into an intermediate, self-composed markup that doesn't ruin the readability of the text in the same way as html.
I also do some contorted sed&awk expressions to handle footnotes and images.

There you basically have it. Working out how the commands work is left as an exercise to the reader. Not fully automatic, but not many repeptitive operations either, all done using computer stone-age tools. Try to do this with yer fancy new-fangled colour-coded GOEEY interfaces!
SBT is offline   Reply With Quote
Old 06-18-2012, 08:48 AM   #19
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Quote:
Originally Posted by DiapDealer View Post
Why would you think you couldn't run PHP on an IIS Webserver?
No doubt I could...if I were running an IIS Webserver. I am running just Vista Home Premium, 64 bit version. Since I don't do any web server development, I don't want to install software just to run one program.

I can open foxit reader and sigil in windows at the same time, but they tend to jump back and forth vying for attention, so it is an annoyance.
mrmikel is offline   Reply With Quote
Old 06-18-2012, 09:02 AM   #20
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
My issue with distributed proofreading solutions is that since you're limited to editing one page at a time, you are prevented from efficiently dealing with systematic errors (e.g. I will often want to do an interactive search&replace 'U' -> 'll' for the entire document)

A thought has struck me - would it be possible to have a proofreading plugin in Sigil? I've never used Sigil myself, but maybe it could do the same as I do in Libreoffice, viz. convert a pdf/djvu into a html file with the scanned image in the left column and the text from the image in the right?
SBT is offline   Reply With Quote
Old 06-18-2012, 09:58 AM   #21
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
Quote:
Originally Posted by SBT View Post
My issue with distributed proofreading solutions is that since you're limited to editing one page at a time, you are prevented from efficiently dealing with systematic errors (e.g. I will often want to do an interactive search&replace 'U' -> 'll' for the entire document)

Plus if you can't edit the html code (as PeterT indicates in post #12) there's no use. It's more efficient to do proof-reading and fixing formatting in the same process.

Quote:
Originally Posted by SBT View Post
A thought has struck me - would it be possible to have a proofreading plugin in Sigil? I've never used Sigil myself, but maybe it could do the same as I do in Libreoffice, viz. convert a pdf/djvu into a html file with the scanned image in the left column and the text from the image in the right?
I have tried to use sigil for proof-reading but gave it up; there was too much automatic clean-up that messes with my code. But when 0.6.0 comes out, I hope that this can be turned off completely.
Iznogood is offline   Reply With Quote
Old 06-18-2012, 10:08 AM   #22
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
@SBT: Thanks for posting the scripts. I only have rudimentary sed scripting skills and your scripts will help me get started.
Doitsu is offline   Reply With Quote
Old 06-18-2012, 05:12 PM   #23
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Quote:
Originally Posted by Doitsu View Post
@SBT: Thanks for posting the scripts. I only have rudimentary sed scripting skills and your scripts will help me get started.
You're welcome!
Oodles of fun can be had with sed&awk - need I mention that I use them to create the toc & ncx files as well?
SBT is offline   Reply With Quote
Old 12-05-2016, 10:43 AM   #24
pluma
Enthusiast
pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.pluma ought to be getting tired of karma fortunes by now.
 
Posts: 48
Karma: 854254
Join Date: Nov 2016
Device: none
Quote:
Originally Posted by SBT View Post
Righto..
First,
Pages splits:
Code:
sed '/\f/<!-- PAGEBREAK -->\n/g' in.txt >out.txt
Page headers/footers: Very often the first line on a page is the page header, so surprisingly often this works:
Code:
sed '/PAGEBREAK/{n;s/.*/<!-- PAGEHEADER: & -->/}' in.txt->out.txt
For footers, I do this:
Code:
tac in.txt|sed '/PAGEBREAK/{n;s/.*/<!-- PAGEFOOTER: & -->/}'|tac >out.txt
Words split over lines:
Code:
sed '/-$/{h;s/.* //;s/$/#/;
x;
s/[^ ]\+$//;
:a;
n;
/^[a-z]/!b a;
H;x;
s/\n//}' file.txt >new-file.txt
Now every split word is put at the head of the following line, and the hyphen replaced by '-#'. Do an interactive search&replace '-#' -> '-' to keep those hyphens you want, and then do a batch search & replace to remove the rest.
The first part of a split word is prepended to the next line starting with a lower-case letter, so it will work across page breaks. Therefore extraneous leading/trailing blanks will cause problems, as will hyphenated capitalized names (e.g. Karl-Otto)

Next, chapters. First I do a grep to check all headings are there and present. E.g. assuming chapter headings are "CHAPTER 1,2,3...."
Code:
grep CHAPTER file.txt
If chapters have a title on the following line, I inspect that:
Code:
grep -A 2 CHAPTER file.txt
That gives me the two following lines, so I can see if the title is split over more than one line, if so I manually concatenate the title on to one line
Having satisfied myself that all the chapters are on the form

I execute the following abomination:
Code:
sed '/CHAPTER/{s/.*/<\/p>\n<hr class="endchapter\/>\n<h2 class="chapter">&<\/h2>/;
n;
s/.*/<h2 class="chapter_title">&<\/h2>/;
n;n;
s/\([^A-Z]*[A-Z]\)\([^ ]*\)/<p class="initial"><span class="drop">\1<\/span><span class="first">\2<\/span>/}' file.txt> new_file.txt
, which closes the last paragraph and appends a line at the end of the preceeding chapter, wraps chapter number and title in suitable tags, and inserts a drop cap á la Jellby in the first paragraph.

Then you can do a
Code:
sed 's/^ \{3,12\}\([^ ]\)/<\/p>\n<p>\1/'
to convert text indents to paragraphs.

If your chapter heading instead is just a centered roman numeral, you can use /^ \{20,\}[IVX]\+\.$/ instead of /CHAPTER/.

And for my final trick, I'll split this file into one xhtml file for each chapter:
Code:
 sed -e '/<h2 class="chapter"/i</body>\n</html>\n<html>\n<body>'  -e '1i<html>\n<body>' -e '$a</body>\n</html>' file.txt|\
csplit -f epub/OEBPS/ -b "%2.2d.xhtml" - '/<html>/' '{*}' \
 && rm epub/OEBPS/00.xhtml
This'll put files 00.xhtml, 01.xhtml,... in directory epub/OEBPS
00.xhtml is an empty file, and is therefore removed.
In real life, you'll want a proper header for your xhtml files, but I figured that putting in a proper header wouldn't improve the readability...

My own workflow differs slightly from this. These commands have been made into shell functions, and chapter headings etc. are converted into an intermediate, self-composed markup that doesn't ruin the readability of the text in the same way as html.
I also do some contorted sed&awk expressions to handle footnotes and images.

There you basically have it. Working out how the commands work is left as an exercise to the reader. Not fully automatic, but not many repeptitive operations either, all done using computer stone-age tools. Try to do this with yer fancy new-fangled colour-coded GOEEY interfaces!
Interesting, I'd like to try this method. But, are those codes compatible with the sed of 2016? Your post is from 2012.

Also, what would be the workflow for post editing/changing? Such deleting a chapter, section and have it automatically modified the TOX and other places. Perhaps not exactly 'automagically' but the sed way automagically.

thanks.
pluma is offline   Reply With Quote
Reply

Tags
ocr, proof-reading


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
ABBYY FineReader - Proof reading tips? PieOPah Workshop 23 03-02-2012 01:03 AM
Proof reading: What do you do when you find a clear misprint? graycyn Workshop 4 07-20-2011 01:13 PM
Proof Reading Service genepool General Discussions 1 03-16-2011 09:02 AM
What is easier on your eyes while reading. JeremyZ General Discussions 32 08-28-2010 05:58 PM
Reading methodology (list ordering) Be Szpilman Reading Recommendations 27 07-31-2008 08:44 PM


All times are GMT -4. The time now is 03:40 PM.


MobileRead.com is a privately owned, operated and funded community.