Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-06-2013, 07:57 PM   #1
Txomin
Junior Member
Txomin began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
Best format to extract text from speed vs accuracy

Good folk.

I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc).

For the purpose of extracting the text (unicode):

1. Which source format is the best to extract from?
2. Which source format would be fastest to extract from?

Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy.

Does anyone have any experience on this?

Thank you all in advance.
Txomin is offline   Reply With Quote
Old 02-06-2013, 08:31 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Txomin View Post
Good folk.

I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc).

For the purpose of extracting the text (unicode):

1. Which source format is the best to extract from?
2. Which source format would be fastest to extract from?

Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy.

Does anyone have any experience on this?

Thank you all in advance.
Your experiment is pretty good.

I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible.

If you HAVE Acrobat, the PDF might not be so bad .
theducks is online now   Reply With Quote
Old 02-06-2013, 08:38 PM   #3
Txomin
Junior Member
Txomin began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
Quote:
Originally Posted by theducks View Post
Your experiment is pretty good.

I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible.

If you HAVE Acrobat, the PDF might not be so bad .
Thank you for the quick reply and please forgive my ignorance.

What is Tweak? I've been playing with ebook-convert.
Txomin is offline   Reply With Quote
Old 02-06-2013, 09:45 PM   #4
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Txomin View Post
Thank you for the quick reply and please forgive my ignorance.

What is Tweak? I've been playing with ebook-convert.
highlight the book entry: Tap 'T'
This tool allows you to unpack the books pieces to allow (small) edits, then put them back together when done, maintaining the original structure. For Bigger edits( add/remove chapters..., Sigil is easier for the novice-intermediate).
theducks is online now   Reply With Quote
Old 02-06-2013, 10:24 PM   #5
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,457
Karma: 26645808
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by Txomin View Post
Thank you for the quick reply and please forgive my ignorance.

What is Tweak? I've been playing with ebook-convert.
Converting to what format - TXT (i.e. a file with the extension txt) I assume

AFAIK what you see in the EPUB Viewer is what you'll get in TXT output file - but without any formatting/styling or images - the important settings are the TXT Output settings

Given that EPUB is Calibre's native format I would anticipate it might be faster.

If you don't have access to PDF editing software like Acrobat, Nitro etc to do the conversions, then you could try
  1. use MobiCreator program to convert PDF's to PRC
  2. use Calibre to convert the PRC to EPUB
  3. use Sigil to tidy up the EPUB - needed on most books, the Quality Check plugin could also help in this regard
  4. then use Calibre to convert the Sigil'd/QC'd EPUB to TXT.
But if you have a lot of PDF's you would probably save a lot of time if you had Acrobat (you'll need the full product) then you could create the TXT files direct - can't recall if Acrobat does bulk operations. Another possibility are the Nitro PDF tools, some people here have them.

I suggest you steer clear of the "Free PDF to ..." converters unless you get a specific recommendation - as in the case of MobiCreator.

BR

Last edited by BetterRed; 02-06-2013 at 10:29 PM.
BetterRed is online now   Reply With Quote
Old 02-06-2013, 10:41 PM   #6
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,660
Karma: 127838196
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Calibre can convert to TXT. Just dump your eBooks into Calibre (not PDF) and batch convert to TXT. You can leave it running overnight. You don't have to care which is faster as it will just do it while you are not at the computer. I don't know the maximum you can queue at one time, but you could do it with Calibre.
JSWolf is online now   Reply With Quote
Old 02-07-2013, 12:54 AM   #7
Txomin
Junior Member
Txomin began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
Thank you all for the answers and leads.
Txomin is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Increase Epub Search Speed and Accuracy Matimio Sigil 1 12-31-2011 07:08 AM
Page Change Speed - PDF vs <insert format> Polydwarf Astak EZReader 1 02-22-2010 02:11 AM
Text to Speech and audio books - speed? moz Reading and Management 3 05-30-2008 02:02 PM
What is best format, speed for MP3/Acc files? jgbrut Sony Reader 0 11-20-2006 02:02 PM


All times are GMT -4. The time now is 06:10 PM.


MobileRead.com is a privately owned, operated and funded community.