View Single Post
Old 10-02-2013, 11:18 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by willus View Post
I'm thinking this thread is almost worthy of a sticky in the PDF forum. Lots of good info. Any admins watching?
There is no way this topic is worthy of stickiness!

Months ago I jotted down a rough outline for a "Tutorial" I have been putting together for all of the steps I take from PDF (Finereader) -> EPUB (Sigil). So many books to convert... I haven't gone back to flesh it out.

I should really take a break from conversion and get it done.

(It would most likely be like the Formula to PNG Tutorial I posted here: https://www.mobileread.com/forums/sho...d.php?t=223254) Step by step, lots of pictures, and lots of REAL LIFE examples (of actual books I worked on).

I wanted to have a "guinea pig" who is quite interested in OCR, so I can refine a lot of what I wrote (expand certain sections, remove others, simplify areas, etc. etc.).

I just took a look at the rough draft and I have 196 lines. Here is the rough draft outline:

Code:
Step 1, getting Finereader prepared

	English; French; German
		façade
	B&W
	Tools - Options - Scan/Open - Do not read and analyze acquired page images
	Tools - Options - Read - Thorough Reading
	Tools - Options - Save - PDF - Use Mixed Raster Content, Enable Tagged PDF, Best Quality
		FB2/EPUB - Title, Author, Best Quality
	

Step 1.5, cleaning the PDF

Step 2, Importing the PDF

Step 3, Layout

Step 4, Editing
	
	
EPUB
	- Clean Finereader output
	- Side by side comparison 
		- Fix Footnotes first
		- Add blockquotes
		- Keep an eye out for indentation on new pages
	- Catch hyphenation (if plenty of time, do ([a-zA-Z])-([a-zA-Z])
		- Index Trick
		
Helpful character sites:
		
http://www.fileformat.info/info/unicode/char/search.htm
		
https://en.wikipedia.org/wiki/Macron
https://en.wikipedia.org/wiki/Grave_accent
https://en.wikipedia.org/wiki/Acute_accent
https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29
https://en.wikipedia.org/wiki/Circumflex
https://en.wikipedia.org/wiki/Caron
https://en.wikipedia.org/wiki/Dagger_%28typography%29
https://en.wikipedia.org/wiki/Greek_letters
	
Finereader:
	
Settings: Document Language: English; French; German;
	
1. Go through the scan and erase artifacts
2. Analyze (Optional)
	-  Tweaking of areas (Optional)
3. Read
4. OCR
	- Spellcheck round
		- Ignore All
			- If you run into a person's name that is "misspelled" I usually ignore all, so all further instances are ignored.
		- Check all blue highlighted areas
		- Pay attention to blue punctuation
	- Pay attention to hyphens between years/numbers
		- Quite often in numbers, a number 1 is accidentally OCRed as lowercase "l" or capital "I"
	- Pay attention to hyphens along the right side margin (if the word carries over to the next line it will have a ¬ instead of a hyphen)
		- Anti-
		- Semi-
		- Quasi-
		- pseudo-
		- Non-
		- Pre-
		- Post-
		- Neo-
		- self-
		- well-being
		- short-term
		- long-term
		- time-preference
		- South-west
		- If a hyphen is there accidentally, and it should be a ¬, this is impossible to fix in Finereader.  It will be fixed at the EPUB step.
	- Pay attention to missing apostrophes
		- Böhm-Bawerks -> Böhm-Bawerk’s
		- mans -> man’s
	- Pay attention to superscripts (quite often they get messed up since they are smaller font)
	- Pay attention to italics
		- Sometimes characters/words before/after actual italics get dragged in by accident
		- Sometimes italic letters become symbols.  Example: <i>I</i> becoming /
	- Do not care much about bold, unless they are actually used for emphasis
		- I find it easiest to strip out all bold in the EPUB step, and then add it in if needed.
		- This takes care of all the accidentally bolded words from the OCR
	- Pay attention to Parenthesis
		- Somtimes they are accidentally brackets instead { } or [ ]
	- Try to change all tables into HTML
		- The current recommendation is to take "images" of very large/complex tables, but this has a few disadvantages:
			- They are not searchable/copy/pastable
			= The images are unreadable on smaller devices
			- Does not scale with the text
			- Images take up a lot more space than HTML
			- A 100% HTML version of the table allows better for the long-term storage of the books.
				- In a future format, the HTML table can be converted easily.
				- Resolutions are only going to go up.
				- HTML tables are more accessible to those with disabilities
	- Keep an eye out for "m" -> "rn"
		- govemment -> government
		- bom -> born
		- tum -> turn
		- returm -> return
		- eam -> earn
		- modem -> modern
		- com -> corn
	- Keep an eye out for the capital letter "I" and the number "1"
		- I942 -> 1942
	- All errors in the actual text of the book should be marked in the format:
		- Page ###: paragraph #: Fixed Text
			This is the copied and pasted <b><i>sentance</i></b> out of the book as it appears in the text with the erro highlighted.
		- Paragraph 0 is used if the top paragraph has carried over from the previous page.
		- Make the actual text fix in the OCR
	- Try to pay attention to missing accents on characters
		- Bohm-Bawerk -> Böhm-Bawerk
	- Try to pay attention for missing quotation marks.  Depending on the book, sometimes ending punctuation + quotation marks are missing or become slashes.
		- . . / -> . . .” or . . .’
		- Sometimes entire ellipsis just are not OCRed
	- Pay attention for italicized variables
		- a, b, c, d  -> <i>a</i>, <i>b</i>, <i>c</i>, <i>d</i>
	- Quite often misspellings occur when a word carries over to the next page, or carries over to the next line (split by a hyphen).
	- Pay attention to two single left/right quotation marks ‘‘ ’’ instead of their proper left/right double quotes “ ”
		- Sometimes there are spaces between ‘ ‘ ’ ’
		-  ‘ ’
	- Do not try to "modernize" the words
		- If the book uses "co-ordination", "to-day", "per cent." or "coöperate", make sure you match the text exactly.
	- ellipsis, I personally avoid the unicode ellipsis character … and instead stick with the normal periods
		- Normal periods work better/look more consistent when there are more than three
		- Mixing ellipsis + normal periods looks very inconsistent.
		

		
After the OCR round:
	- Replace ONE BY ONE.
		- Search for ' and replace with a right single quote ’
			- Keep a left single quote ‘ in your copy/paste, so you can paste that instead if needed.
			- Quite often a prime is used.  Just keep that as a dumb apostrophe. (Many ereaders cannot handle the actual unicode prime character).
		- Search for " and replace with a right double quote ”
			- Keep a left double quote “ in your copy/paste, so you can paste that instead if needed.
			- Sometimes "two dumb apostrophes" '' are OCRed instead of the “smart quotes”
		- Search for /
			- These are not often used in a books, and usually these are typos from the OCR
	
EPUB Steps: Good to know basic XHTML/CSS.  Great to know Regular Expressions:

Regular Expressions
	http://www.regular-expressions.info/
	https://www.mobileread.com/forums/sho...d.php?t=167971

	- Versioning system
		- Use the date and/or version number
			- Last,First.-.Title.of.Book[MM.DD.YYYY].epub
			- Last,First.-.Title.of.Book[MM.DD.YYYY][v.1].epub
			- Save OFTEN, and save a different version BEFORE you try to do any large search/replaces, and up the version number after doing a major "pass" on the EPUB
				- For example, you just finish a spellcheck pass, bump the version from v.2 -> v.3.

Step 1: Split chapters
Step 2: Set up headers
Step 3: Set up blockquotes
	- I try to combine the paragraphs here, match the formatting
		- For example, if the author of the quote is right justified
	- Combine paragraphs if needed
Step 4: Find/Order footnotes
	- Doublecheck that all footnotes are there.
	- Place them at the end of the chapter in the order they were found in the book.
Step 5: Combine paragraphs:
	- Search: -</p>\s+<p>
	- Replace: (BLANK)
		- This will combine paragraphs that end with a hyphen (usually a word carried over from one page to the next)
		- If it is an actual hyphen, then manually combine the paragraphs
	- Search: ([^>”\?\!\.])</p>\s+<p>
	- Replace: \1 
		- This will catch any paragraph that does not end in a punctuation mark
		- Note: There is a space after the \1
		- Note: This WILL catch paragraphs that end with a right parenthesis ) and colon :.
			- These are sometimes valid, sometimes not.

Step ?: Pay attention to upper left corner of page, if you see no indentation, and a capital letter (this most likely means it is a continuation of the paragraph from the page before).  Search for that phrase in the EPUB, and make sure it is combined with the above paragraph.
	
	
Index
	- Search: ([0-9]) ([A-Z])
	- Replace: \1</p> <p class="index">\2
		- This will look for any number followed by a space, and a capital letter
		- Make sure you have selected "Current File" in Sigil, to make sure this only effects the Index.
		

		
Final Passes:
	en dash
	- Search: ([0-9])-([0-9])
	- Replace: \1–\2
Now THAT Tutorial might be worthy of stickiness.

Last edited by Tex2002ans; 10-02-2013 at 11:22 PM.
Tex2002ans is offline   Reply With Quote