Quote:
Originally Posted by shevirsy
GIMP? No OCR.
ImageMagick? No OCR.
Popper OCR? Not that I know.
But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?
|
step 1 - cut the pdf - margins with pagenumbers, all you do not want to OCR using pdfscissors - a small program in Java run directly from the net. Works on Windows, Linux, Mac
Step 2 - program convert from imagemagick to parse pdf into bitmaps. Works on Windows, Linux, Mac
Step 3 - use Tesseract - an open source OCR. Works on Windows, Linux, Mac
Step 4 - use advanced text editor (in my case Vim) to format the text that has broken lines by default. The paragraphs are separated by empty lines, so it is very easy to join all the lines that are not separated by empty line. Works on Windows, Linux, Mac
(Commands for Vim in normal mode [press Escape twice to to get there]:
:set tw=10000
gggqG
[gg means go to the first line of file
gq means "rewrap the text to text width set with previous command :set tw=10000"
G means "to the end of the file"]
)
Tesseract is now being developed by Google - it used to be heavy duty commercial OCR engine
http://en.wikipedia.org/wiki/Tesseract_%28software%29
I have used this to OCR files on Linux.
For step 2 and 3 I have used following script:
Code:
#!/bin/sh
STARTPAGE=13 # set to pagenumber of the first page of PDF you wish to convert
ENDPAGE=253 # set to pagenumber of the last page of PDF you wish to convert
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
touch $OUTPUT
for i in `seq $STARTPAGE $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif
echo processing page \[$(($i - 1 ))\]
tesseract page.tif tempoutput
cat tempoutput.txt >> $OUTPUT
done
There are various graphical front-ends for Tesseract, so you do not HAVE to use commandline. But this is what worked for me.
Please note that output from Tesseract is txt file that doesn't contain formatting info, such as bold, italics, that other [commercial] programs can produce. An example of a really good commercial program is Abbyy FineReader.
At work I use very old version of Recognita, that doesn't process pdf, so I have to convert with imagemagick. But, it was bundled with a scanner that we purchased very long time ago.
At work I also use Readiris - does process pdf, was bundled with HP scanner/printer/copier/fax combo some 5 years ago.