MobileRead Forums - View Single Post - Best way to copy text from a PDF or MOBI?

Tex2002ans · 10-02-2013, 11:18 PM

Quote:

Originally Posted by willus

I'm thinking this thread is almost worthy of a sticky in the PDF forum. Lots of good info. Any admins watching?

There is no way this topic is worthy of stickiness!

Months ago I jotted down a rough outline for a "Tutorial" I have been putting together for all of the steps I take from PDF (Finereader) -> EPUB (Sigil). So many books to convert... I haven't gone back to flesh it out.

I should really take a break from conversion and get it done.

(It would most likely be like the Formula to PNG Tutorial I posted here: https://www.mobileread.com/forums/sho...d.php?t=223254) Step by step, lots of pictures, and lots of REAL LIFE examples (of actual books I worked on).

I wanted to have a "guinea pig" who is quite interested in OCR, so I can refine a lot of what I wrote (expand certain sections, remove others, simplify areas, etc. etc.).

I just took a look at the rough draft and I have 196 lines. Here is the rough draft outline:

Code:

Step 1, getting Finereader prepared

	English; French; German
		façade
	B&W
	Tools - Options - Scan/Open - Do not read and analyze acquired page images
	Tools - Options - Read - Thorough Reading
	Tools - Options - Save - PDF - Use Mixed Raster Content, Enable Tagged PDF, Best Quality
		FB2/EPUB - Title, Author, Best Quality
	

Step 1.5, cleaning the PDF

Step 2, Importing the PDF

Step 3, Layout

Step 4, Editing
	
	
EPUB
	- Clean Finereader output
	- Side by side comparison 
		- Fix Footnotes first
		- Add blockquotes
		- Keep an eye out for indentation on new pages
	- Catch hyphenation (if plenty of time, do ([a-zA-Z])-([a-zA-Z])
		- Index Trick
		
Helpful character sites:
		
http://www.fileformat.info/info/unicode/char/search.htm
		
https://en.wikipedia.org/wiki/Macron
https://en.wikipedia.org/wiki/Grave_accent
https://en.wikipedia.org/wiki/Acute_accent
https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29
https://en.wikipedia.org/wiki/Circumflex
https://en.wikipedia.org/wiki/Caron
https://en.wikipedia.org/wiki/Dagger_%28typography%29
https://en.wikipedia.org/wiki/Greek_letters
	
Finereader:
	
Settings: Document Language: English; French; German;
	
1. Go through the scan and erase artifacts
2. Analyze (Optional)
	-  Tweaking of areas (Optional)
3. Read
4. OCR
	- Spellcheck round
		- Ignore All
			- If you run into a person's name that is "misspelled" I usually ignore all, so all further instances are ignored.
		- Check all blue highlighted areas
		- Pay attention to blue punctuation
	- Pay attention to hyphens between years/numbers
		- Quite often in numbers, a number 1 is accidentally OCRed as lowercase "l" or capital "I"
	- Pay attention to hyphens along the right side margin (if the word carries over to the next line it will have a ¬ instead of a hyphen)
		- Anti-
		- Semi-
		- Quasi-
		- pseudo-
		- Non-
		- Pre-
		- Post-
		- Neo-
		- self-
		- well-being
		- short-term
		- long-term
		- time-preference
		- South-west
		- If a hyphen is there accidentally, and it should be a ¬, this is impossible to fix in Finereader.  It will be fixed at the EPUB step.
	- Pay attention to missing apostrophes
		- Böhm-Bawerks -> Böhm-Bawerk’s
		- mans -> man’s
	- Pay attention to superscripts (quite often they get messed up since they are smaller font)
	- Pay attention to italics
		- Sometimes characters/words before/after actual italics get dragged in by accident
		- Sometimes italic letters become symbols.  Example: <i>I</i> becoming /
	- Do not care much about bold, unless they are actually used for emphasis
		- I find it easiest to strip out all bold in the EPUB step, and then add it in if needed.
		- This takes care of all the accidentally bolded words from the OCR
	- Pay attention to Parenthesis
		- Somtimes they are accidentally brackets instead { } or [ ]
	- Try to change all tables into HTML
		- The current recommendation is to take "images" of very large/complex tables, but this has a few disadvantages:
			- They are not searchable/copy/pastable
			= The images are unreadable on smaller devices
			- Does not scale with the text
			- Images take up a lot more space than HTML
			- A 100% HTML version of the table allows better for the long-term storage of the books.
				- In a future format, the HTML table can be converted easily.
				- Resolutions are only going to go up.
				- HTML tables are more accessible to those with disabilities
	- Keep an eye out for "m" -> "rn"
		- govemment -> government
		- bom -> born
		- tum -> turn
		- returm -> return
		- eam -> earn
		- modem -> modern
		- com -> corn
	- Keep an eye out for the capital letter "I" and the number "1"
		- I942 -> 1942
	- All errors in the actual text of the book should be marked in the format:
		- Page ###: paragraph #: Fixed Text
			This is the copied and pasted <b><i>sentance</i></b> out of the book as it appears in the text with the erro highlighted.
		- Paragraph 0 is used if the top paragraph has carried over from the previous page.
		- Make the actual text fix in the OCR
	- Try to pay attention to missing accents on characters
		- Bohm-Bawerk -> Böhm-Bawerk
	- Try to pay attention for missing quotation marks.  Depending on the book, sometimes ending punctuation + quotation marks are missing or become slashes.
		- . . / -> . . .” or . . .’
		- Sometimes entire ellipsis just are not OCRed
	- Pay attention for italicized variables
		- a, b, c, d  -> <i>a</i>, <i>b</i>, <i>c</i>, <i>d</i>
	- Quite often misspellings occur when a word carries over to the next page, or carries over to the next line (split by a hyphen).
	- Pay attention to two single left/right quotation marks ‘‘ ’’ instead of their proper left/right double quotes “ ”
		- Sometimes there are spaces between ‘ ‘ ’ ’
		-  ‘ ’
	- Do not try to "modernize" the words
		- If the book uses "co-ordination", "to-day", "per cent." or "coöperate", make sure you match the text exactly.
	- ellipsis, I personally avoid the unicode ellipsis character … and instead stick with the normal periods
		- Normal periods work better/look more consistent when there are more than three
		- Mixing ellipsis + normal periods looks very inconsistent.
		

		
After the OCR round:
	- Replace ONE BY ONE.
		- Search for ' and replace with a right single quote ’
			- Keep a left single quote ‘ in your copy/paste, so you can paste that instead if needed.
			- Quite often a prime is used.  Just keep that as a dumb apostrophe. (Many ereaders cannot handle the actual unicode prime character).
		- Search for " and replace with a right double quote ”
			- Keep a left double quote “ in your copy/paste, so you can paste that instead if needed.
			- Sometimes "two dumb apostrophes" '' are OCRed instead of the “smart quotes”
		- Search for /
			- These are not often used in a books, and usually these are typos from the OCR
	
EPUB Steps: Good to know basic XHTML/CSS.  Great to know Regular Expressions:

Regular Expressions
	http://www.regular-expressions.info/
	https://www.mobileread.com/forums/sho...d.php?t=167971

	- Versioning system
		- Use the date and/or version number
			- Last,First.-.Title.of.Book[MM.DD.YYYY].epub
			- Last,First.-.Title.of.Book[MM.DD.YYYY][v.1].epub
			- Save OFTEN, and save a different version BEFORE you try to do any large search/replaces, and up the version number after doing a major "pass" on the EPUB
				- For example, you just finish a spellcheck pass, bump the version from v.2 -> v.3.

Step 1: Split chapters
Step 2: Set up headers
Step 3: Set up blockquotes
	- I try to combine the paragraphs here, match the formatting
		- For example, if the author of the quote is right justified
	- Combine paragraphs if needed
Step 4: Find/Order footnotes
	- Doublecheck that all footnotes are there.
	- Place them at the end of the chapter in the order they were found in the book.
Step 5: Combine paragraphs:
	- Search: -</p>\s+<p>
	- Replace: (BLANK)
		- This will combine paragraphs that end with a hyphen (usually a word carried over from one page to the next)
		- If it is an actual hyphen, then manually combine the paragraphs
	- Search: ([^>”\?\!\.])</p>\s+<p>
	- Replace: \1 
		- This will catch any paragraph that does not end in a punctuation mark
		- Note: There is a space after the \1
		- Note: This WILL catch paragraphs that end with a right parenthesis ) and colon :.
			- These are sometimes valid, sometimes not.

Step ?: Pay attention to upper left corner of page, if you see no indentation, and a capital letter (this most likely means it is a continuation of the paragraph from the page before).  Search for that phrase in the EPUB, and make sure it is combined with the above paragraph.
	
	
Index
	- Search: ([0-9]) ([A-Z])
	- Replace: \1</p> <p class="index">\2
		- This will look for any number followed by a space, and a capital letter
		- Make sure you have selected "Current File" in Sigil, to make sure this only effects the Index.
		

		
Final Passes:
	en dash
	- Search: ([0-9])-([0-9])
	- Replace: \1–\2

Now THAT Tutorial might be worthy of stickiness.