pdf to txt shell script

Mr.Castro0o · 03-27-2013, 07:39 AM

Hia.

I have made a simple shell script for character-recognize text from an image pdf and outputs it into a .txt file.

You can get it from the git repository https://github.com/andrecastro0o/ocr

Its main component is tesseract-ocr.
Beside tesseract-orc the script makes use of:
pdftk - to burst the pdf into single page pdfs
imagemagick - to convert the pdf to tiff (tesseract only accepts .tif inputs)

It is simple, but without to much work it could be further developed to allow removal of page headings (author name and book title from each single page). Or it could precede another script that will streamline the creation of an epub version.

Hope it might be useful for some of you. And if you'd like to push new developments, or simply comment on possible developments, that would be great. :]

03-27-2013, 07:39 AM	#1
Mr.Castro0o Junior Member Posts: 1 Karma: 10 Join Date: Mar 2013 Device: none	pdf to txt shell script Hia. I have made a simple shell script for character-recognize text from an image pdf and outputs it into a .txt file. You can get it from the git repository https://github.com/andrecastro0o/ocr Its main component is tesseract-ocr. Beside tesseract-orc the script makes use of: pdftk - to burst the pdf into single page pdfs imagemagick - to convert the pdf to tiff (tesseract only accepts .tif inputs) It is simple, but without to much work it could be further developed to allow removal of page headings (author name and book title from each single page). Or it could precede another script that will streamline the creation of an epub version. Hope it might be useful for some of you. And if you'd like to push new developments, or simply comment on possible developments, that would be great. :]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
ePUB + PDF creation script	Trouhel	ePub	30	07-28-2012 09:02 AM
Shell script to wait for user input from K3g keyboard	jmseight	Kindle Developer's Corner	33	04-01-2012 04:32 PM
Reflow script for txt files	neurocyp	PocketBook	6	11-19-2011 04:57 AM
How to open pdf file with e-ink reader from the shell/adb?	Regiomontanus	enTourage Archive	0	09-01-2010 11:56 AM
JetBook PDF antialiasing script	syrex314	Ectaco jetBook	7	05-18-2010 07:23 PM