Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 12-27-2013, 02:38 PM   #1
diazlaz
Member
diazlaz began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Dec 2003
Device: Toshiba E755
Extract PDF text and store in custom column

Hi All,

I interested in developing a plugin in that will allow me to extract the text of a PDF file and store it in a custom text or comment column. I have around 8,000 PDFs (already ocr, small amount of text usually 1 page or less from scanned images) imported into calibre from a document management system. I would like to be able to search and tag documents from the text in extracted and stored in this custom column.

I was thinking of the following:

1. enumerate the selected files (select PDF types)
2. open/extract the PDF text content from PDF file
3. store in column

Is there a python library or api already included that will allow me to easily extract the pdf text from the file. I'm new to Python and will look to learn some of it this weekend, I fairly comfortable with perl

Many thanks!
Laz.
diazlaz is offline   Reply With Quote
Old 12-27-2013, 03:52 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,616
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
My experience is that searches will be slower if the Comments column or similar long text columns are included in 'the places to search' - AFAIK the contents aren't indexed.

But that's no reason to not go ahead with a PI

BR
BetterRed is offline   Reply With Quote
Advert
Old 12-30-2013, 10:00 PM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,926
Karma: 22669820
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Simply use calibre to bulk convert your PDF files to txt. That will extract the text from them (assuming they have actual extractable text, which is not always the case with PDF files).
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom yes/no column built from long text column Philantrop Library Management 7 03-23-2013 07:44 PM
Two-column-text possible on M92 (NOT in PDF) Andy_T Onyx Boox 7 02-01-2013 08:31 AM
No search/replace on custom long text column? CWatkinsNash Library Management 2 06-20-2011 05:57 PM
Converting text for a datetime custom column kiwidude Development 2 02-26-2011 10:47 AM
Custom text column with no HTML mfaine Calibre 2 01-07-2011 01:12 PM


All times are GMT -4. The time now is 10:32 AM.


MobileRead.com is a privately owned, operated and funded community.