12-27-2013, 02:38 PM | #1 |
Member
Posts: 15
Karma: 10
Join Date: Dec 2003
Device: Toshiba E755
|
Extract PDF text and store in custom column
Hi All,
I interested in developing a plugin in that will allow me to extract the text of a PDF file and store it in a custom text or comment column. I have around 8,000 PDFs (already ocr, small amount of text usually 1 page or less from scanned images) imported into calibre from a document management system. I would like to be able to search and tag documents from the text in extracted and stored in this custom column. I was thinking of the following: 1. enumerate the selected files (select PDF types) 2. open/extract the PDF text content from PDF file 3. store in column Is there a python library or api already included that will allow me to easily extract the pdf text from the file. I'm new to Python and will look to learn some of it this weekend, I fairly comfortable with perl Many thanks! Laz. |
12-27-2013, 03:52 PM | #2 |
null operator (he/him)
Posts: 20,616
Karma: 26960534
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
My experience is that searches will be slower if the Comments column or similar long text columns are included in 'the places to search' - AFAIK the contents aren't indexed.
But that's no reason to not go ahead with a PI BR |
Advert | |
|
12-30-2013, 10:00 PM | #3 |
creator of calibre
Posts: 43,926
Karma: 22669820
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Simply use calibre to bulk convert your PDF files to txt. That will extract the text from them (assuming they have actual extractable text, which is not always the case with PDF files).
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Custom yes/no column built from long text column | Philantrop | Library Management | 7 | 03-23-2013 07:44 PM |
Two-column-text possible on M92 (NOT in PDF) | Andy_T | Onyx Boox | 7 | 02-01-2013 08:31 AM |
No search/replace on custom long text column? | CWatkinsNash | Library Management | 2 | 06-20-2011 05:57 PM |
Converting text for a datetime custom column | kiwidude | Development | 2 | 02-26-2011 10:47 AM |
Custom text column with no HTML | mfaine | Calibre | 2 | 01-07-2011 01:12 PM |