Best way to copy text from a PDF or MOBI?

mb2u · 10-01-2013, 10:19 AM

I tried using Calibre but got a lot of errors with PDF and MOBI files. I thought it would be great to be able to just extract the text. I love using txt files on ereaders as they are so quick to manage and control (page turning takes a fraction of the time) and you seem to have a lot of font control as it doesn't seem to get in the way as much as some other formats.

So the quest is how to best extract that text from PDF and MOBI files?

willus · 10-01-2013, 07:38 PM

Quote:

Originally Posted by mb2u

I tried using Calibre but got a lot of errors with PDF and MOBI files. I thought it would be great to be able to just extract the text. I love using txt files on ereaders as they are so quick to manage and control (page turning takes a fraction of the time) and you seem to have a lot of font control as it doesn't seem to get in the way as much as some other formats.

So the quest is how to best extract that text from PDF and MOBI files?

This is somewhat dependent on the PDF file itself. Can you post any examples of the PDF files and the errors you got when extracting the text with Calibre?

mb2u · 10-01-2013, 08:37 PM

How about whole lines missing? Pretty hard to live with!

Tex2002ans · 10-02-2013, 02:53 AM

PDF is just about the worst format to convert FROM. PDF was built as a final output print format.

See this in Calibre's help files: http://manual.calibre-ebook.com/conv...#pdfconversion

In all cases, it is best to go back to the source document and work from there.

The best of the worst case scenario would be having a PDF created directly from the source (InDesign, Quark, etc.). You can tell when zooming in on the PDF, the text/graphs stay extremely crisp.

Click image for larger version

Name: page5.png
Views: 2539
Size: 107.0 KB
ID: 112573

Click image for larger version

Name: page5zoom.png
Views: 2923
Size: 43.4 KB
ID: 112574

These might be able to have text extracted from them ok (although still a lot of errors can/will be introduced). I believe Calibre uses xpdf in the backend to handle pulling text out of PDFs:

http://www.foolabs.com/xpdf/download.html

Someone on the forums probably has a lot more experience with this type. I never work from this type (we usually have the source files for these).

Quote:

Originally Posted by mb2u

How about whole lines missing? Pretty hard to live with!

Sounds to me like you have a scanned book.

This is the worst case scenario. The text backend in the PDF most likely was just fed through Tesseract, Finereader, the scanner's built-in OCR, etc... and spit out with no human intervention. This is the case, for example, on the conversions to different formats on archive.org. There will be a ton of errors.

Your best bet would be to start from scratch, using the latest version of the OCR programs (later versions most likely have more accurate OCR).

Here is a whole list of different OCR programs: https://en.wikipedia.org/wiki/Compar...ition_software

If you want higher quality output, you would also have to painstakingly go through and manually fix errors that you find. It is very laborious work.

I personally use ABBYY Finereader (this is a paid program, quite expensive, but well worth it if you do a lot of conversions): http://finereader.abbyy.com/

It is very accurate with a whole host of texts/languages, and allows you to easily side-by-side compare image to OCRed text (highlights unsure characters in light blue).

Here is a book in Finereader that I am currently working on converting:

Click image for larger version

Name: FinereaderSidebySide.png
Views: 991
Size: 176.3 KB
ID: 112575

Left = Original Document
Right = OCRed Text
Bottom = Magnified area in the original document

Even after export, you must still spend a lot of time fixing the output (combining paragraphs, removing accidental hyphens, adding formatting, splitting chapters, etc. etc.).

Overall, PDF -> anything is horrible.

Quote:

Originally Posted by willus

This is somewhat dependent on the PDF file itself. Can you post any examples of the PDF files and the errors you got when extracting the text with Calibre?

Indeed. Post samples.

If the book is in the public domain, try to get it from Project Gutenberg, where it goes through multiple human revisions.

Or the MR ebook sections:

Kindle: https://www.mobileread.com/forums/forumdisplay.php?f=128
EPUB: https://www.mobileread.com/forums/forumdisplay.php?f=130

mb2u · 10-02-2013, 12:28 PM

Tex2002ans, that was one marvelous and informative post! It will take me a while to go through it but much of what you said rings true already. PDF is often just a snapshot so OCR programs are going to make errors. Most people don't realize this but understanding it helps the user understand errors and why they occur.

Tex2002ans · 10-02-2013, 05:50 PM

The OCR step is part of the reason why there are so many errors in ebook versions. Usually the publisher decides to go the cheapest route, and pay a (crappy) conversion company to do the PDF -> text conversion.

The quality of these will be slightly better than Archive.org (pure OCR with no intervention), but usually the texts are still riddled with more typos than a reader would want.

Common mistakes:

0 -> O
m -> rn
Hyphenation problems
Missing punctuation
Missing quotation marks
Missing accents: à, ö, ê, ǒ, Å
Missing symbols: ¢, £
Missing ligatures: Æ, œ
Wrong foreign characters: α, ß, ε
Wrongfully combined/uncombined paragraphs
Wrongful bold/italics

Part of the reason I got into this was the horrors I was running into when reading EPUBs. So I decided to take a stab at it. In the past year I have converted over 160 books from PDF -> EPUB.

Started in ~December 2011 by taking apart EPUBs and fixing typos as I read. Ramped up EPUB production in October 2012, and officially hired since April 2013... so now I just sit around all day doing PROPER conversions.

Quote:

Originally Posted by mb2u

Tex2002ans, that was one marvelous and informative post! It will take me a while to go through it but much of what you said rings true already.

Thank you for the compliments.

But as I said, it is quite time intensive. When I first got started it took about one-two weeks for me to get through one PDF -> completed EPUB.

Nowadays I have it wittled down to ~8-15 hours of work for your average book, some more, some less. (I convert non-fiction economics books for the most part).

Also, if you want to convert fiction, just keep in mind that you may potentially spoil the book for yourself while OCRing! After working on these for so long you learn how to "not read" while fixing, but you still risk potentially spoiling the story!

Luckily with non-fiction, if I "accidentally read" I actually learn stuff!

mb2u · 10-02-2013, 09:01 PM

All my prospective conversion are non-fiction. I know what you mean....it would destroy the flow of the story correcting errors in fiction. It would demolish it!

Tex2002ans · 10-02-2013, 09:44 PM

Quote:

Originally Posted by mb2u

All my prospective conversion are non-fiction.

Glorious!

If you are serious about OCRing and getting high quality work out there, I would not mind teaching everything I know. (I am free over AIM/YIM/MSN/Skype/email).

While you can OCR for your own personal benefit, the benefit does not outweigh the costs (I spend about 8-15 hours just to get a great EPUB, but just starting, you might be spending 40+ hours on a book).

In my opinion, you should try to tackle works that are in the public domain, or books that are released as CC (Creative Commons). After finishing your OCR, and making a clean EPUB, you can then post it on MobileRead/elsewhere so that the ENTIRE WORLD can benefit from your conversion (instead of just you).

Archive.org has scans of a massive amount of public domain books. Or if you are interested in some "training materials", I have a bunch of journal articles that need OCR (~13 pages each).

Tackling the easy/short stuff I believe would have built up my skills/familiarity with the tools way faster, and it definitely keeps the motivation up (makes you feel like you are actually ACCOMPLISHING SOMETHING).

When I first jumped in to OCR I decided it would be a good idea to tackle all the hard stuff first... I wish I didn't do that!

When I used to tackle these large books that were complex/way out of my league, I would spend an entire week on it and felt like I got nowhere!

Quote:

Originally Posted by mb2u

I know what you mean....it would destroy the flow of the story correcting errors in fiction. It would demolish it!

The few fiction books that I actually wanted to read (that were PDF only)... I pretty much just had to feed it through OCR, export, split chapters really fast, and run a few basic cleanup regex. Then I read through the book in Sigil and fixed the errors as I came across them while reading. Took forever, but nothing was spoiled.

willus · 10-02-2013, 10:59 PM

I'm thinking this thread is almost worthy of a sticky in the PDF forum. Lots of good info. Any admins watching?

Tex2002ans · 10-02-2013, 11:18 PM

Quote:

Originally Posted by willus

I'm thinking this thread is almost worthy of a sticky in the PDF forum. Lots of good info. Any admins watching?

There is no way this topic is worthy of stickiness!

Months ago I jotted down a rough outline for a "Tutorial" I have been putting together for all of the steps I take from PDF (Finereader) -> EPUB (Sigil). So many books to convert... I haven't gone back to flesh it out.

I should really take a break from conversion and get it done.

(It would most likely be like the Formula to PNG Tutorial I posted here: https://www.mobileread.com/forums/sho...d.php?t=223254) Step by step, lots of pictures, and lots of REAL LIFE examples (of actual books I worked on).

I wanted to have a "guinea pig" who is quite interested in OCR, so I can refine a lot of what I wrote (expand certain sections, remove others, simplify areas, etc. etc.).

I just took a look at the rough draft and I have 196 lines. Here is the rough draft outline:

Code:

Step 1, getting Finereader prepared

	English; French; German
		façade
	B&W
	Tools - Options - Scan/Open - Do not read and analyze acquired page images
	Tools - Options - Read - Thorough Reading
	Tools - Options - Save - PDF - Use Mixed Raster Content, Enable Tagged PDF, Best Quality
		FB2/EPUB - Title, Author, Best Quality
	

Step 1.5, cleaning the PDF

Step 2, Importing the PDF

Step 3, Layout

Step 4, Editing
	
	
EPUB
	- Clean Finereader output
	- Side by side comparison 
		- Fix Footnotes first
		- Add blockquotes
		- Keep an eye out for indentation on new pages
	- Catch hyphenation (if plenty of time, do ([a-zA-Z])-([a-zA-Z])
		- Index Trick
		
Helpful character sites:
		
http://www.fileformat.info/info/unicode/char/search.htm
		
https://en.wikipedia.org/wiki/Macron
https://en.wikipedia.org/wiki/Grave_accent
https://en.wikipedia.org/wiki/Acute_accent
https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29
https://en.wikipedia.org/wiki/Circumflex
https://en.wikipedia.org/wiki/Caron
https://en.wikipedia.org/wiki/Dagger_%28typography%29
https://en.wikipedia.org/wiki/Greek_letters
	
Finereader:
	
Settings: Document Language: English; French; German;
	
1. Go through the scan and erase artifacts
2. Analyze (Optional)
	-  Tweaking of areas (Optional)
3. Read
4. OCR
	- Spellcheck round
		- Ignore All
			- If you run into a person's name that is "misspelled" I usually ignore all, so all further instances are ignored.
		- Check all blue highlighted areas
		- Pay attention to blue punctuation
	- Pay attention to hyphens between years/numbers
		- Quite often in numbers, a number 1 is accidentally OCRed as lowercase "l" or capital "I"
	- Pay attention to hyphens along the right side margin (if the word carries over to the next line it will have a ¬ instead of a hyphen)
		- Anti-
		- Semi-
		- Quasi-
		- pseudo-
		- Non-
		- Pre-
		- Post-
		- Neo-
		- self-
		- well-being
		- short-term
		- long-term
		- time-preference
		- South-west
		- If a hyphen is there accidentally, and it should be a ¬, this is impossible to fix in Finereader.  It will be fixed at the EPUB step.
	- Pay attention to missing apostrophes
		- Böhm-Bawerks -> Böhm-Bawerk’s
		- mans -> man’s
	- Pay attention to superscripts (quite often they get messed up since they are smaller font)
	- Pay attention to italics
		- Sometimes characters/words before/after actual italics get dragged in by accident
		- Sometimes italic letters become symbols.  Example: <i>I</i> becoming /
	- Do not care much about bold, unless they are actually used for emphasis
		- I find it easiest to strip out all bold in the EPUB step, and then add it in if needed.
		- This takes care of all the accidentally bolded words from the OCR
	- Pay attention to Parenthesis
		- Somtimes they are accidentally brackets instead { } or [ ]
	- Try to change all tables into HTML
		- The current recommendation is to take "images" of very large/complex tables, but this has a few disadvantages:
			- They are not searchable/copy/pastable
			= The images are unreadable on smaller devices
			- Does not scale with the text
			- Images take up a lot more space than HTML
			- A 100% HTML version of the table allows better for the long-term storage of the books.
				- In a future format, the HTML table can be converted easily.
				- Resolutions are only going to go up.
				- HTML tables are more accessible to those with disabilities
	- Keep an eye out for "m" -> "rn"
		- govemment -> government
		- bom -> born
		- tum -> turn
		- returm -> return
		- eam -> earn
		- modem -> modern
		- com -> corn
	- Keep an eye out for the capital letter "I" and the number "1"
		- I942 -> 1942
	- All errors in the actual text of the book should be marked in the format:
		- Page ###: paragraph #: Fixed Text
			This is the copied and pasted <b><i>sentance</i></b> out of the book as it appears in the text with the erro highlighted.
		- Paragraph 0 is used if the top paragraph has carried over from the previous page.
		- Make the actual text fix in the OCR
	- Try to pay attention to missing accents on characters
		- Bohm-Bawerk -> Böhm-Bawerk
	- Try to pay attention for missing quotation marks.  Depending on the book, sometimes ending punctuation + quotation marks are missing or become slashes.
		- . . / -> . . .” or . . .’
		- Sometimes entire ellipsis just are not OCRed
	- Pay attention for italicized variables
		- a, b, c, d  -> <i>a</i>, <i>b</i>, <i>c</i>, <i>d</i>
	- Quite often misspellings occur when a word carries over to the next page, or carries over to the next line (split by a hyphen).
	- Pay attention to two single left/right quotation marks ‘‘ ’’ instead of their proper left/right double quotes “ ”
		- Sometimes there are spaces between ‘ ‘ ’ ’
		-  ‘ ’
	- Do not try to "modernize" the words
		- If the book uses "co-ordination", "to-day", "per cent." or "coöperate", make sure you match the text exactly.
	- ellipsis, I personally avoid the unicode ellipsis character … and instead stick with the normal periods
		- Normal periods work better/look more consistent when there are more than three
		- Mixing ellipsis + normal periods looks very inconsistent.
		

		
After the OCR round:
	- Replace ONE BY ONE.
		- Search for ' and replace with a right single quote ’
			- Keep a left single quote ‘ in your copy/paste, so you can paste that instead if needed.
			- Quite often a prime is used.  Just keep that as a dumb apostrophe. (Many ereaders cannot handle the actual unicode prime character).
		- Search for " and replace with a right double quote ”
			- Keep a left double quote “ in your copy/paste, so you can paste that instead if needed.
			- Sometimes "two dumb apostrophes" '' are OCRed instead of the “smart quotes”
		- Search for /
			- These are not often used in a books, and usually these are typos from the OCR
	
EPUB Steps: Good to know basic XHTML/CSS.  Great to know Regular Expressions:

Regular Expressions
	http://www.regular-expressions.info/
	https://www.mobileread.com/forums/sho...d.php?t=167971

	- Versioning system
		- Use the date and/or version number
			- Last,First.-.Title.of.Book[MM.DD.YYYY].epub
			- Last,First.-.Title.of.Book[MM.DD.YYYY][v.1].epub
			- Save OFTEN, and save a different version BEFORE you try to do any large search/replaces, and up the version number after doing a major "pass" on the EPUB
				- For example, you just finish a spellcheck pass, bump the version from v.2 -> v.3.

Step 1: Split chapters
Step 2: Set up headers
Step 3: Set up blockquotes
	- I try to combine the paragraphs here, match the formatting
		- For example, if the author of the quote is right justified
	- Combine paragraphs if needed
Step 4: Find/Order footnotes
	- Doublecheck that all footnotes are there.
	- Place them at the end of the chapter in the order they were found in the book.
Step 5: Combine paragraphs:
	- Search: -</p>\s+<p>
	- Replace: (BLANK)
		- This will combine paragraphs that end with a hyphen (usually a word carried over from one page to the next)
		- If it is an actual hyphen, then manually combine the paragraphs
	- Search: ([^>”\?\!\.])</p>\s+<p>
	- Replace: \1 
		- This will catch any paragraph that does not end in a punctuation mark
		- Note: There is a space after the \1
		- Note: This WILL catch paragraphs that end with a right parenthesis ) and colon :.
			- These are sometimes valid, sometimes not.

Step ?: Pay attention to upper left corner of page, if you see no indentation, and a capital letter (this most likely means it is a continuation of the paragraph from the page before).  Search for that phrase in the EPUB, and make sure it is combined with the above paragraph.
	
	
Index
	- Search: ([0-9]) ([A-Z])
	- Replace: \1</p> <p class="index">\2
		- This will look for any number followed by a space, and a capital letter
		- Make sure you have selected "Current File" in Sigil, to make sure this only effects the Index.
		

		
Final Passes:
	en dash
	- Search: ([0-9])-([0-9])
	- Replace: \1–\2

Now THAT Tutorial might be worthy of stickiness.

mb2u · 10-03-2013, 09:00 AM

I have many books I'd like to convert so it can't take much time. What I'm looking for is less errors. These are reference materials I want on my ereader and there are many, many books. It can't take much time. Copying and pasting from a PDF to TXT worked OK. If I could reduce the errors that would be golden. I was using Foxit to display the PDF. Maybe other programs display with less errors? That's the only thing I can think of.

DaleDe · 10-03-2013, 03:52 PM

Generally it is better to place this kind of stuff in our wiki and then reference it in a discussion with a back reference in the wiki to the discussion forum. Not every solution requires a hammer.

Dale

SaintsRaw · 12-31-2013, 04:05 AM

I think its impossible to extract text, unless you crack the pdf (rather difficult) or convert it to text with something like http://pdftoword.pro/
after that text extraction is obviously no problem ^^

crich70 · 01-10-2014, 12:16 PM

There is also this program:
click It turns pdf into html, epub and mobi format.

mb2u · 01-10-2014, 01:09 PM

How about errors with either of these programs? People rarely talk about errors but they are often present. The trick is to be able to look around them when reading so they don't distract you.

10-01-2013, 10:19 AM	#1
mb2u Enthusiast Posts: 25 Karma: 6052 Join Date: Jul 2013 Device: Kobo Touch and Mini	Best way to copy text from a PDF or MOBI? I tried using Calibre but got a lot of errors with PDF and MOBI files. I thought it would be great to be able to just extract the text. I love using txt files on ereaders as they are so quick to manage and control (page turning takes a fraction of the time) and you seem to have a lot of font control as it doesn't seem to get in the way as much as some other formats. So the quest is how to best extract that text from PDF and MOBI files?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF to MOBI - keep text formatting	Maryellen_au	Conversion	0	08-12-2013 06:56 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Pdf to MOBI as pictures not as text	Rikkaruohimus	Conversion	4	01-28-2012 08:54 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
PDF -> Copy/Paste -> TXT/HTML -> MOBI ?	summon	Amazon Kindle	9	04-12-2010 11:15 PM

10-01-2013, 08:37 PM	#3
mb2u Enthusiast Posts: 25 Karma: 6052 Join Date: Jul 2013 Device: Kobo Touch and Mini	How about whole lines missing? Pretty hard to live with!

10-02-2013, 12:28 PM	#5
mb2u Enthusiast Posts: 25 Karma: 6052 Join Date: Jul 2013 Device: Kobo Touch and Mini	Tex2002ans, that was one marvelous and informative post! It will take me a while to go through it but much of what you said rings true already. PDF is often just a snapshot so OCR programs are going to make errors. Most people don't realize this but understanding it helps the user understand errors and why they occur.

10-02-2013, 09:01 PM	#7
mb2u Enthusiast Posts: 25 Karma: 6052 Join Date: Jul 2013 Device: Kobo Touch and Mini	All my prospective conversion are non-fiction. I know what you mean....it would destroy the flow of the story correcting errors in fiction. It would demolish it!

10-02-2013, 10:59 PM	#9
willus Fuzzball, the purple cat Posts: 1,273 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	I'm thinking this thread is almost worthy of a sticky in the PDF forum. Lots of good info. Any admins watching?

10-03-2013, 09:00 AM	#11
mb2u Enthusiast Posts: 25 Karma: 6052 Join Date: Jul 2013 Device: Kobo Touch and Mini	I have many books I'd like to convert so it can't take much time. What I'm looking for is less errors. These are reference materials I want on my ereader and there are many, many books. It can't take much time. Copying and pasting from a PDF to TXT worked OK. If I could reduce the errors that would be golden. I was using Foxit to display the PDF. Maybe other programs display with less errors? That's the only thing I can think of.

10-03-2013, 03:52 PM	#12
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	Generally it is better to place this kind of stuff in our wiki and then reference it in a discussion with a back reference in the wiki to the discussion forum. Not every solution requires a hammer. Dale

12-31-2013, 04:05 AM	#13
SaintsRaw Junior Member Posts: 1 Karma: 10 Join Date: Dec 2013 Device: Kindle	I think its impossible to extract text, unless you crack the pdf (rather difficult) or convert it to text with something like http://pdftoword.pro/ after that text extraction is obviously no problem ^^

01-10-2014, 12:16 PM	#14
crich70 Grand Sorcerer Posts: 11,305 Karma: 43993832 Join Date: Feb 2010 Location: Monroe Wisconsin Device: K3, Kindle Paperwhite, Calibre, and Mobipocket for Pc (netbook)	There is also this program: click It turns pdf into html, epub and mobi format.

01-10-2014, 01:09 PM	#15
mb2u Enthusiast Posts: 25 Karma: 6052 Join Date: Jul 2013 Device: Kobo Touch and Mini	How about errors with either of these programs? People rarely talk about errors but they are often present. The trick is to be able to look around them when reading so they don't distract you.

Advert

Advert