View Single Post
Old 12-27-2013, 02:38 PM   #1
diazlaz
Member
diazlaz began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Dec 2003
Device: Toshiba E755
Extract PDF text and store in custom column

Hi All,

I interested in developing a plugin in that will allow me to extract the text of a PDF file and store it in a custom text or comment column. I have around 8,000 PDFs (already ocr, small amount of text usually 1 page or less from scanned images) imported into calibre from a document management system. I would like to be able to search and tag documents from the text in extracted and stored in this custom column.

I was thinking of the following:

1. enumerate the selected files (select PDF types)
2. open/extract the PDF text content from PDF file
3. store in column

Is there a python library or api already included that will allow me to easily extract the pdf text from the file. I'm new to Python and will look to learn some of it this weekend, I fairly comfortable with perl

Many thanks!
Laz.
diazlaz is offline   Reply With Quote