MobileRead Forums - View Single Post - Extract PDF text and store in custom column

diazlaz · 12-27-2013, 03:38 PM

Hi All,

I interested in developing a plugin in that will allow me to extract the text of a PDF file and store it in a custom text or comment column. I have around 8,000 PDFs (already ocr, small amount of text usually 1 page or less from scanned images) imported into calibre from a document management system. I would like to be able to search and tag documents from the text in extracted and stored in this custom column.

I was thinking of the following:

1. enumerate the selected files (select PDF types)
2. open/extract the PDF text content from PDF file
3. store in column

Is there a python library or api already included that will allow me to easily extract the pdf text from the file. I'm new to Python and will look to learn some of it this weekend, I fairly comfortable with perl

Many thanks!
Laz.

12-27-2013, 03:38 PM	#1
diazlaz Member Posts: 15 Karma: 10 Join Date: Dec 2003 Device: Toshiba E755	Extract PDF text and store in custom column Hi All, I interested in developing a plugin in that will allow me to extract the text of a PDF file and store it in a custom text or comment column. I have around 8,000 PDFs (already ocr, small amount of text usually 1 page or less from scanned images) imported into calibre from a document management system. I would like to be able to search and tag documents from the text in extracted and stored in this custom column. I was thinking of the following: 1. enumerate the selected files (select PDF types) 2. open/extract the PDF text content from PDF file 3. store in column Is there a python library or api already included that will allow me to easily extract the pdf text from the file. I'm new to Python and will look to learn some of it this weekend, I fairly comfortable with perl Many thanks! Laz.