![]() |
#1 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 6052
Join Date: Jul 2013
Device: Kobo Touch and Mini
|
Best way to copy text from a PDF or MOBI?
I tried using Calibre but got a lot of errors with PDF and MOBI files. I thought it would be great to be able to just extract the text. I love using txt files on ereaders as they are so quick to manage and control (page turning takes a fraction of the time) and you seem to have a lot of font control as it doesn't seem to get in the way as much as some other formats.
So the quest is how to best extract that text from PDF and MOBI files? |
![]() |
![]() |
![]() |
#2 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,299
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
|
|
![]() |
![]() |
![]() |
#3 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 6052
Join Date: Jul 2013
Device: Kobo Touch and Mini
|
How about whole lines missing? Pretty hard to live with!
|
![]() |
![]() |
![]() |
#4 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
PDF is just about the worst format to convert FROM. PDF was built as a final output print format.
See this in Calibre's help files: http://manual.calibre-ebook.com/conv...#pdfconversion In all cases, it is best to go back to the source document and work from there. The best of the worst case scenario would be having a PDF created directly from the source (InDesign, Quark, etc.). You can tell when zooming in on the PDF, the text/graphs stay extremely crisp. These might be able to have text extracted from them ok (although still a lot of errors can/will be introduced). I believe Calibre uses xpdf in the backend to handle pulling text out of PDFs: http://www.foolabs.com/xpdf/download.html Someone on the forums probably has a lot more experience with this type. I never work from this type (we usually have the source files for these). Sounds to me like you have a scanned book. This is the worst case scenario. The text backend in the PDF most likely was just fed through Tesseract, Finereader, the scanner's built-in OCR, etc... and spit out with no human intervention. This is the case, for example, on the conversions to different formats on archive.org. There will be a ton of errors. Your best bet would be to start from scratch, using the latest version of the OCR programs (later versions most likely have more accurate OCR). Here is a whole list of different OCR programs: https://en.wikipedia.org/wiki/Compar...ition_software If you want higher quality output, you would also have to painstakingly go through and manually fix errors that you find. It is very laborious work. ![]() I personally use ABBYY Finereader (this is a paid program, quite expensive, but well worth it if you do a lot of conversions): http://finereader.abbyy.com/ It is very accurate with a whole host of texts/languages, and allows you to easily side-by-side compare image to OCRed text (highlights unsure characters in light blue). Here is a book in Finereader that I am currently working on converting: Left = Original Document Right = OCRed Text Bottom = Magnified area in the original document Even after export, you must still spend a lot of time fixing the output (combining paragraphs, removing accidental hyphens, adding formatting, splitting chapters, etc. etc.). Overall, PDF -> anything is horrible. Quote:
If the book is in the public domain, try to get it from Project Gutenberg, where it goes through multiple human revisions. Or the MR ebook sections: Kindle: https://www.mobileread.com/forums/forumdisplay.php?f=128 EPUB: https://www.mobileread.com/forums/forumdisplay.php?f=130 Last edited by Tex2002ans; 10-02-2013 at 03:34 AM. |
|
![]() |
![]() |
![]() |
#5 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 6052
Join Date: Jul 2013
Device: Kobo Touch and Mini
|
Tex2002ans, that was one marvelous and informative post! It will take me a while to go through it but much of what you said rings true already. PDF is often just a snapshot so OCR programs are going to make errors. Most people don't realize this but understanding it helps the user understand errors and why they occur.
|
![]() |
![]() |
![]() |
#6 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
The OCR step is part of the reason why there are so many errors in ebook versions. Usually the publisher decides to go the cheapest route, and pay a (crappy) conversion company to do the PDF -> text conversion.
The quality of these will be slightly better than Archive.org (pure OCR with no intervention), but usually the texts are still riddled with more typos than a reader would want. Common mistakes:
Part of the reason I got into this was the horrors I was running into when reading EPUBs. So I decided to take a stab at it. In the past year I have converted over 160 books from PDF -> EPUB. Started in ~December 2011 by taking apart EPUBs and fixing typos as I read. Ramped up EPUB production in October 2012, and officially hired since April 2013... so now I just sit around all day doing PROPER conversions. ![]() Quote:
But as I said, it is quite time intensive. When I first got started it took about one-two weeks for me to get through one PDF -> completed EPUB. Nowadays I have it wittled down to ~8-15 hours of work for your average book, some more, some less. (I convert non-fiction economics books for the most part). Also, if you want to convert fiction, just keep in mind that you may potentially spoil the book for yourself while OCRing! After working on these for so long you learn how to "not read" while fixing, but you still risk potentially spoiling the story! ![]() Luckily with non-fiction, if I "accidentally read" I actually learn stuff! ![]() Last edited by Tex2002ans; 10-02-2013 at 05:57 PM. |
|
![]() |
![]() |
![]() |
#7 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 6052
Join Date: Jul 2013
Device: Kobo Touch and Mini
|
All my prospective conversion are non-fiction. I know what you mean....it would destroy the flow of the story correcting errors in fiction. It would demolish it!
|
![]() |
![]() |
![]() |
#8 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Glorious!
If you are serious about OCRing and getting high quality work out there, I would not mind teaching everything I know. (I am free over AIM/YIM/MSN/Skype/email). While you can OCR for your own personal benefit, the benefit does not outweigh the costs (I spend about 8-15 hours just to get a great EPUB, but just starting, you might be spending 40+ hours on a book). In my opinion, you should try to tackle works that are in the public domain, or books that are released as CC (Creative Commons). After finishing your OCR, and making a clean EPUB, you can then post it on MobileRead/elsewhere so that the ENTIRE WORLD can benefit from your conversion (instead of just you). ![]() Archive.org has scans of a massive amount of public domain books. Or if you are interested in some "training materials", I have a bunch of journal articles that need OCR (~13 pages each). Tackling the easy/short stuff I believe would have built up my skills/familiarity with the tools way faster, and it definitely keeps the motivation up (makes you feel like you are actually ACCOMPLISHING SOMETHING). When I first jumped in to OCR I decided it would be a good idea to tackle all the hard stuff first... I wish I didn't do that! ![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
#9 |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,299
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
I'm thinking this thread is almost worthy of a sticky in the PDF forum. Lots of good info. Any admins watching?
|
![]() |
![]() |
![]() |
#10 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
![]() Months ago I jotted down a rough outline for a "Tutorial" I have been putting together for all of the steps I take from PDF (Finereader) -> EPUB (Sigil). So many books to convert... I haven't gone back to flesh it out. I should really take a break from conversion and get it done. (It would most likely be like the Formula to PNG Tutorial I posted here: https://www.mobileread.com/forums/sho...d.php?t=223254) Step by step, lots of pictures, and lots of REAL LIFE examples (of actual books I worked on). ![]() I wanted to have a "guinea pig" who is quite interested in OCR, so I can refine a lot of what I wrote (expand certain sections, remove others, simplify areas, etc. etc.). I just took a look at the rough draft and I have 196 lines. Here is the rough draft outline: Code:
Step 1, getting Finereader prepared English; French; German façade B&W Tools - Options - Scan/Open - Do not read and analyze acquired page images Tools - Options - Read - Thorough Reading Tools - Options - Save - PDF - Use Mixed Raster Content, Enable Tagged PDF, Best Quality FB2/EPUB - Title, Author, Best Quality Step 1.5, cleaning the PDF Step 2, Importing the PDF Step 3, Layout Step 4, Editing EPUB - Clean Finereader output - Side by side comparison - Fix Footnotes first - Add blockquotes - Keep an eye out for indentation on new pages - Catch hyphenation (if plenty of time, do ([a-zA-Z])-([a-zA-Z]) - Index Trick Helpful character sites: http://www.fileformat.info/info/unicode/char/search.htm https://en.wikipedia.org/wiki/Macron https://en.wikipedia.org/wiki/Grave_accent https://en.wikipedia.org/wiki/Acute_accent https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29 https://en.wikipedia.org/wiki/Circumflex https://en.wikipedia.org/wiki/Caron https://en.wikipedia.org/wiki/Dagger_%28typography%29 https://en.wikipedia.org/wiki/Greek_letters Finereader: Settings: Document Language: English; French; German; 1. Go through the scan and erase artifacts 2. Analyze (Optional) - Tweaking of areas (Optional) 3. Read 4. OCR - Spellcheck round - Ignore All - If you run into a person's name that is "misspelled" I usually ignore all, so all further instances are ignored. - Check all blue highlighted areas - Pay attention to blue punctuation - Pay attention to hyphens between years/numbers - Quite often in numbers, a number 1 is accidentally OCRed as lowercase "l" or capital "I" - Pay attention to hyphens along the right side margin (if the word carries over to the next line it will have a ¬ instead of a hyphen) - Anti- - Semi- - Quasi- - pseudo- - Non- - Pre- - Post- - Neo- - self- - well-being - short-term - long-term - time-preference - South-west - If a hyphen is there accidentally, and it should be a ¬, this is impossible to fix in Finereader. It will be fixed at the EPUB step. - Pay attention to missing apostrophes - Böhm-Bawerks -> Böhm-Bawerk’s - mans -> man’s - Pay attention to superscripts (quite often they get messed up since they are smaller font) - Pay attention to italics - Sometimes characters/words before/after actual italics get dragged in by accident - Sometimes italic letters become symbols. Example: <i>I</i> becoming / - Do not care much about bold, unless they are actually used for emphasis - I find it easiest to strip out all bold in the EPUB step, and then add it in if needed. - This takes care of all the accidentally bolded words from the OCR - Pay attention to Parenthesis - Somtimes they are accidentally brackets instead { } or [ ] - Try to change all tables into HTML - The current recommendation is to take "images" of very large/complex tables, but this has a few disadvantages: - They are not searchable/copy/pastable = The images are unreadable on smaller devices - Does not scale with the text - Images take up a lot more space than HTML - A 100% HTML version of the table allows better for the long-term storage of the books. - In a future format, the HTML table can be converted easily. - Resolutions are only going to go up. - HTML tables are more accessible to those with disabilities - Keep an eye out for "m" -> "rn" - govemment -> government - bom -> born - tum -> turn - returm -> return - eam -> earn - modem -> modern - com -> corn - Keep an eye out for the capital letter "I" and the number "1" - I942 -> 1942 - All errors in the actual text of the book should be marked in the format: - Page ###: paragraph #: Fixed Text This is the copied and pasted <b><i>sentance</i></b> out of the book as it appears in the text with the erro highlighted. - Paragraph 0 is used if the top paragraph has carried over from the previous page. - Make the actual text fix in the OCR - Try to pay attention to missing accents on characters - Bohm-Bawerk -> Böhm-Bawerk - Try to pay attention for missing quotation marks. Depending on the book, sometimes ending punctuation + quotation marks are missing or become slashes. - . . / -> . . .” or . . .’ - Sometimes entire ellipsis just are not OCRed - Pay attention for italicized variables - a, b, c, d -> <i>a</i>, <i>b</i>, <i>c</i>, <i>d</i> - Quite often misspellings occur when a word carries over to the next page, or carries over to the next line (split by a hyphen). - Pay attention to two single left/right quotation marks ‘‘ ’’ instead of their proper left/right double quotes “ ” - Sometimes there are spaces between ‘ ‘ ’ ’ - ‘ ’ - Do not try to "modernize" the words - If the book uses "co-ordination", "to-day", "per cent." or "coöperate", make sure you match the text exactly. - ellipsis, I personally avoid the unicode ellipsis character … and instead stick with the normal periods - Normal periods work better/look more consistent when there are more than three - Mixing ellipsis + normal periods looks very inconsistent. After the OCR round: - Replace ONE BY ONE. - Search for ' and replace with a right single quote ’ - Keep a left single quote ‘ in your copy/paste, so you can paste that instead if needed. - Quite often a prime is used. Just keep that as a dumb apostrophe. (Many ereaders cannot handle the actual unicode prime character). - Search for " and replace with a right double quote ” - Keep a left double quote “ in your copy/paste, so you can paste that instead if needed. - Sometimes "two dumb apostrophes" '' are OCRed instead of the “smart quotes” - Search for / - These are not often used in a books, and usually these are typos from the OCR EPUB Steps: Good to know basic XHTML/CSS. Great to know Regular Expressions: Regular Expressions http://www.regular-expressions.info/ https://www.mobileread.com/forums/sho...d.php?t=167971 - Versioning system - Use the date and/or version number - Last,First.-.Title.of.Book[MM.DD.YYYY].epub - Last,First.-.Title.of.Book[MM.DD.YYYY][v.1].epub - Save OFTEN, and save a different version BEFORE you try to do any large search/replaces, and up the version number after doing a major "pass" on the EPUB - For example, you just finish a spellcheck pass, bump the version from v.2 -> v.3. Step 1: Split chapters Step 2: Set up headers Step 3: Set up blockquotes - I try to combine the paragraphs here, match the formatting - For example, if the author of the quote is right justified - Combine paragraphs if needed Step 4: Find/Order footnotes - Doublecheck that all footnotes are there. - Place them at the end of the chapter in the order they were found in the book. Step 5: Combine paragraphs: - Search: -</p>\s+<p> - Replace: (BLANK) - This will combine paragraphs that end with a hyphen (usually a word carried over from one page to the next) - If it is an actual hyphen, then manually combine the paragraphs - Search: ([^>”\?\!\.])</p>\s+<p> - Replace: \1 - This will catch any paragraph that does not end in a punctuation mark - Note: There is a space after the \1 - Note: This WILL catch paragraphs that end with a right parenthesis ) and colon :. - These are sometimes valid, sometimes not. Step ?: Pay attention to upper left corner of page, if you see no indentation, and a capital letter (this most likely means it is a continuation of the paragraph from the page before). Search for that phrase in the EPUB, and make sure it is combined with the above paragraph. Index - Search: ([0-9]) ([A-Z]) - Replace: \1</p> <p class="index">\2 - This will look for any number followed by a space, and a capital letter - Make sure you have selected "Current File" in Sigil, to make sure this only effects the Index. Final Passes: en dash - Search: ([0-9])-([0-9]) - Replace: \1–\2 ![]() Last edited by Tex2002ans; 10-02-2013 at 11:22 PM. |
|
![]() |
![]() |
![]() |
#11 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 6052
Join Date: Jul 2013
Device: Kobo Touch and Mini
|
I have many books I'd like to convert so it can't take much time. What I'm looking for is less errors. These are reference materials I want on my ereader and there are many, many books. It can't take much time. Copying and pasting from a PDF to TXT worked OK. If I could reduce the errors that would be golden. I was using Foxit to display the PDF. Maybe other programs display with less errors? That's the only thing I can think of.
|
![]() |
![]() |
![]() |
#12 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Generally it is better to place this kind of stuff in our wiki and then reference it in a discussion with a back reference in the wiki to the discussion forum. Not every solution requires a hammer.
Dale |
![]() |
![]() |
![]() |
#13 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Dec 2013
Device: Kindle
|
I think its impossible to extract text, unless you crack the pdf (rather difficult) or convert it to text with something like http://pdftoword.pro/
after that text extraction is obviously no problem ^^ |
![]() |
![]() |
![]() |
#15 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 6052
Join Date: Jul 2013
Device: Kobo Touch and Mini
|
How about errors with either of these programs? People rarely talk about errors but they are often present. The trick is to be able to look around them when reading so they don't distract you.
|
![]() |
![]() |
![]() |
Tags |
copy text, mobi, pdf |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to MOBI - keep text formatting | Maryellen_au | Conversion | 0 | 08-12-2013 06:56 PM |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 07:06 AM |
Pdf to MOBI as pictures not as text | Rikkaruohimus | Conversion | 4 | 01-28-2012 08:54 AM |
pdf to mobi... creating images rather than text | Dumhed | Calibre | 5 | 11-06-2010 12:08 PM |
PDF -> Copy/Paste -> TXT/HTML -> MOBI ? | summon | Amazon Kindle | 9 | 04-12-2010 11:15 PM |