|03-27-2013, 07:39 AM||#1|
Join Date: Mar 2013
pdf to txt shell script
I have made a simple shell script for character-recognize text from an image pdf and outputs it into a .txt file.
You can get it from the git repository https://github.com/andrecastro0o/ocr
Its main component is tesseract-ocr.
Beside tesseract-orc the script makes use of:
pdftk - to burst the pdf into single page pdfs
imagemagick - to convert the pdf to tiff (tesseract only accepts .tif inputs)
It is simple, but without to much work it could be further developed to allow removal of page headings (author name and book title from each single page). Or it could precede another script that will streamline the creation of an epub version.
Hope it might be useful for some of you. And if you'd like to push new developments, or simply comment on possible developments, that would be great. :]
|ocr, pdf conversion|
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|ePUB + PDF creation script||Trouhel||ePub||30||07-28-2012 09:02 AM|
|Shell script to wait for user input from K3g keyboard||jmseight||Kindle Developer's Corner||33||04-01-2012 04:32 PM|
|Reflow script for txt files||neurocyp||PocketBook||6||11-19-2011 04:57 AM|
|How to open pdf file with e-ink reader from the shell/adb?||Regiomontanus||enTourage Archive||0||09-01-2010 11:56 AM|
|JetBook PDF antialiasing script||syrex314||Ectaco jetBook||7||05-18-2010 07:23 PM|