View Single Post
Old 03-27-2013, 07:39 AM   #1
Mr.Castro0o
Junior Member
Mr.Castro0o began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Mar 2013
Device: none
pdf to txt shell script

Hia.

I have made a simple shell script for character-recognize text from an image pdf and outputs it into a .txt file.

You can get it from the git repository https://github.com/andrecastro0o/ocr



Its main component is tesseract-ocr.
Beside tesseract-orc the script makes use of:
pdftk - to burst the pdf into single page pdfs
imagemagick - to convert the pdf to tiff (tesseract only accepts .tif inputs)

It is simple, but without to much work it could be further developed to allow removal of page headings (author name and book title from each single page). Or it could precede another script that will streamline the creation of an epub version.

Hope it might be useful for some of you. And if you'd like to push new developments, or simply comment on possible developments, that would be great. :]
Mr.Castro0o is offline   Reply With Quote