Quote:
Originally Posted by willus
I'm thinking this thread is almost worthy of a sticky in the PDF forum. Lots of good info. Any admins watching?
|
There is no way this topic is worthy of stickiness!
Months ago I jotted down a rough outline for a "Tutorial" I have been putting together for all of the steps I take from PDF (Finereader) -> EPUB (Sigil). So many books to convert... I haven't gone back to flesh it out.
I should really take a break from conversion and get it done.
(It would most likely be like the Formula to PNG Tutorial I posted here:
https://www.mobileread.com/forums/sho...d.php?t=223254) Step by step, lots of pictures, and lots of REAL LIFE examples (of actual books I worked on).
I wanted to have a "guinea pig" who is quite interested in OCR, so I can refine a lot of what I wrote (expand certain sections, remove others, simplify areas, etc. etc.).
I just took a look at the rough draft and I have 196 lines. Here is the rough draft outline:
Code:
Step 1, getting Finereader prepared
English; French; German
façade
B&W
Tools - Options - Scan/Open - Do not read and analyze acquired page images
Tools - Options - Read - Thorough Reading
Tools - Options - Save - PDF - Use Mixed Raster Content, Enable Tagged PDF, Best Quality
FB2/EPUB - Title, Author, Best Quality
Step 1.5, cleaning the PDF
Step 2, Importing the PDF
Step 3, Layout
Step 4, Editing
EPUB
- Clean Finereader output
- Side by side comparison
- Fix Footnotes first
- Add blockquotes
- Keep an eye out for indentation on new pages
- Catch hyphenation (if plenty of time, do ([a-zA-Z])-([a-zA-Z])
- Index Trick
Helpful character sites:
http://www.fileformat.info/info/unicode/char/search.htm
https://en.wikipedia.org/wiki/Macron
https://en.wikipedia.org/wiki/Grave_accent
https://en.wikipedia.org/wiki/Acute_accent
https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29
https://en.wikipedia.org/wiki/Circumflex
https://en.wikipedia.org/wiki/Caron
https://en.wikipedia.org/wiki/Dagger_%28typography%29
https://en.wikipedia.org/wiki/Greek_letters
Finereader:
Settings: Document Language: English; French; German;
1. Go through the scan and erase artifacts
2. Analyze (Optional)
- Tweaking of areas (Optional)
3. Read
4. OCR
- Spellcheck round
- Ignore All
- If you run into a person's name that is "misspelled" I usually ignore all, so all further instances are ignored.
- Check all blue highlighted areas
- Pay attention to blue punctuation
- Pay attention to hyphens between years/numbers
- Quite often in numbers, a number 1 is accidentally OCRed as lowercase "l" or capital "I"
- Pay attention to hyphens along the right side margin (if the word carries over to the next line it will have a ¬ instead of a hyphen)
- Anti-
- Semi-
- Quasi-
- pseudo-
- Non-
- Pre-
- Post-
- Neo-
- self-
- well-being
- short-term
- long-term
- time-preference
- South-west
- If a hyphen is there accidentally, and it should be a ¬, this is impossible to fix in Finereader. It will be fixed at the EPUB step.
- Pay attention to missing apostrophes
- Böhm-Bawerks -> Böhm-Bawerk’s
- mans -> man’s
- Pay attention to superscripts (quite often they get messed up since they are smaller font)
- Pay attention to italics
- Sometimes characters/words before/after actual italics get dragged in by accident
- Sometimes italic letters become symbols. Example: <i>I</i> becoming /
- Do not care much about bold, unless they are actually used for emphasis
- I find it easiest to strip out all bold in the EPUB step, and then add it in if needed.
- This takes care of all the accidentally bolded words from the OCR
- Pay attention to Parenthesis
- Somtimes they are accidentally brackets instead { } or [ ]
- Try to change all tables into HTML
- The current recommendation is to take "images" of very large/complex tables, but this has a few disadvantages:
- They are not searchable/copy/pastable
= The images are unreadable on smaller devices
- Does not scale with the text
- Images take up a lot more space than HTML
- A 100% HTML version of the table allows better for the long-term storage of the books.
- In a future format, the HTML table can be converted easily.
- Resolutions are only going to go up.
- HTML tables are more accessible to those with disabilities
- Keep an eye out for "m" -> "rn"
- govemment -> government
- bom -> born
- tum -> turn
- returm -> return
- eam -> earn
- modem -> modern
- com -> corn
- Keep an eye out for the capital letter "I" and the number "1"
- I942 -> 1942
- All errors in the actual text of the book should be marked in the format:
- Page ###: paragraph #: Fixed Text
This is the copied and pasted <b><i>sentance</i></b> out of the book as it appears in the text with the erro highlighted.
- Paragraph 0 is used if the top paragraph has carried over from the previous page.
- Make the actual text fix in the OCR
- Try to pay attention to missing accents on characters
- Bohm-Bawerk -> Böhm-Bawerk
- Try to pay attention for missing quotation marks. Depending on the book, sometimes ending punctuation + quotation marks are missing or become slashes.
- . . / -> . . .” or . . .’
- Sometimes entire ellipsis just are not OCRed
- Pay attention for italicized variables
- a, b, c, d -> <i>a</i>, <i>b</i>, <i>c</i>, <i>d</i>
- Quite often misspellings occur when a word carries over to the next page, or carries over to the next line (split by a hyphen).
- Pay attention to two single left/right quotation marks ‘‘ ’’ instead of their proper left/right double quotes “ ”
- Sometimes there are spaces between ‘ ‘ ’ ’
- ‘ ’
- Do not try to "modernize" the words
- If the book uses "co-ordination", "to-day", "per cent." or "coöperate", make sure you match the text exactly.
- ellipsis, I personally avoid the unicode ellipsis character … and instead stick with the normal periods
- Normal periods work better/look more consistent when there are more than three
- Mixing ellipsis + normal periods looks very inconsistent.
After the OCR round:
- Replace ONE BY ONE.
- Search for ' and replace with a right single quote ’
- Keep a left single quote ‘ in your copy/paste, so you can paste that instead if needed.
- Quite often a prime is used. Just keep that as a dumb apostrophe. (Many ereaders cannot handle the actual unicode prime character).
- Search for " and replace with a right double quote ”
- Keep a left double quote “ in your copy/paste, so you can paste that instead if needed.
- Sometimes "two dumb apostrophes" '' are OCRed instead of the “smart quotes”
- Search for /
- These are not often used in a books, and usually these are typos from the OCR
EPUB Steps: Good to know basic XHTML/CSS. Great to know Regular Expressions:
Regular Expressions
http://www.regular-expressions.info/
https://www.mobileread.com/forums/sho...d.php?t=167971
- Versioning system
- Use the date and/or version number
- Last,First.-.Title.of.Book[MM.DD.YYYY].epub
- Last,First.-.Title.of.Book[MM.DD.YYYY][v.1].epub
- Save OFTEN, and save a different version BEFORE you try to do any large search/replaces, and up the version number after doing a major "pass" on the EPUB
- For example, you just finish a spellcheck pass, bump the version from v.2 -> v.3.
Step 1: Split chapters
Step 2: Set up headers
Step 3: Set up blockquotes
- I try to combine the paragraphs here, match the formatting
- For example, if the author of the quote is right justified
- Combine paragraphs if needed
Step 4: Find/Order footnotes
- Doublecheck that all footnotes are there.
- Place them at the end of the chapter in the order they were found in the book.
Step 5: Combine paragraphs:
- Search: -</p>\s+<p>
- Replace: (BLANK)
- This will combine paragraphs that end with a hyphen (usually a word carried over from one page to the next)
- If it is an actual hyphen, then manually combine the paragraphs
- Search: ([^>”\?\!\.])</p>\s+<p>
- Replace: \1
- This will catch any paragraph that does not end in a punctuation mark
- Note: There is a space after the \1
- Note: This WILL catch paragraphs that end with a right parenthesis ) and colon :.
- These are sometimes valid, sometimes not.
Step ?: Pay attention to upper left corner of page, if you see no indentation, and a capital letter (this most likely means it is a continuation of the paragraph from the page before). Search for that phrase in the EPUB, and make sure it is combined with the above paragraph.
Index
- Search: ([0-9]) ([A-Z])
- Replace: \1</p> <p class="index">\2
- This will look for any number followed by a space, and a capital letter
- Make sure you have selected "Current File" in Sigil, to make sure this only effects the Index.
Final Passes:
en dash
- Search: ([0-9])-([0-9])
- Replace: \1–\2
Now THAT Tutorial might be worthy of stickiness.