Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF


Thread Tools Search this Thread
Old 03-27-2013, 08:39 AM   #1
Junior Member
Mr.Castro0o began at the beginning.
Posts: 1
Karma: 10
Join Date: Mar 2013
Device: none
pdf to txt shell script


I have made a simple shell script for character-recognize text from an image pdf and outputs it into a .txt file.

You can get it from the git repository

Its main component is tesseract-ocr.
Beside tesseract-orc the script makes use of:
pdftk - to burst the pdf into single page pdfs
imagemagick - to convert the pdf to tiff (tesseract only accepts .tif inputs)

It is simple, but without to much work it could be further developed to allow removal of page headings (author name and book title from each single page). Or it could precede another script that will streamline the creation of an epub version.

Hope it might be useful for some of you. And if you'd like to push new developments, or simply comment on possible developments, that would be great. :]
Mr.Castro0o is offline   Reply With Quote

ocr, pdf conversion

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
ePUB + PDF creation script Trouhel ePub 30 07-28-2012 10:02 AM
Shell script to wait for user input from K3g keyboard jmseight Kindle Developer's Corner 33 04-01-2012 05:32 PM
Reflow script for txt files neurocyp PocketBook 6 11-19-2011 05:57 AM
How to open pdf file with e-ink reader from the shell/adb? Regiomontanus enTourage Archive 0 09-01-2010 12:56 PM
JetBook PDF antialiasing script syrex314 Ectaco jetBook 7 05-18-2010 08:23 PM

All times are GMT -4. The time now is 09:35 AM. is a privately owned, operated and funded community.