Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 03-27-2013, 07:39 AM   #1
Mr.Castro0o
Junior Member
Mr.Castro0o began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Mar 2013
Device: none
pdf to txt shell script

Hia.

I have made a simple shell script for character-recognize text from an image pdf and outputs it into a .txt file.

You can get it from the git repository https://github.com/andrecastro0o/ocr



Its main component is tesseract-ocr.
Beside tesseract-orc the script makes use of:
pdftk - to burst the pdf into single page pdfs
imagemagick - to convert the pdf to tiff (tesseract only accepts .tif inputs)

It is simple, but without to much work it could be further developed to allow removal of page headings (author name and book title from each single page). Or it could precede another script that will streamline the creation of an epub version.

Hope it might be useful for some of you. And if you'd like to push new developments, or simply comment on possible developments, that would be great. :]
Mr.Castro0o is offline   Reply With Quote
Reply

Tags
ocr, pdf conversion

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
ePUB + PDF creation script Trouhel ePub 30 07-28-2012 09:02 AM
Shell script to wait for user input from K3g keyboard jmseight Kindle Developer's Corner 33 04-01-2012 04:32 PM
Reflow script for txt files neurocyp PocketBook 6 11-19-2011 04:57 AM
How to open pdf file with e-ink reader from the shell/adb? Regiomontanus enTourage Archive 0 09-01-2010 11:56 AM
JetBook PDF antialiasing script syrex314 Ectaco jetBook 7 05-18-2010 07:23 PM


All times are GMT -4. The time now is 09:49 PM.


MobileRead.com is a privately owned, operated and funded community.