View Single Post
Old 12-08-2011, 08:47 AM   #1
vdp
Enthusiast
vdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watchvdp is clearly one to watch
 
Posts: 45
Karma: 10842
Join Date: Aug 2010
Device: Kindle DXG
LaunchHack - an OCR-based companion to LaunchPad

This kludge can be used along with LaunchPad to initiate some kind of action on the file which is currently selected in the Kindle's stock shell. Currently it works with Kindle DXG, but I believe it can be relatively easily modified to work on the other models(although I don't own and have no experience with the smaller devices).

How it works?
The program has three main steps/parts:

1) First the framebuffer is scanned to find the bold underline of the currently selected file and the title of this file is cropped and is sent to Tesseract OCR engine

2) Tesseract then converts the image to a recognized string. For the time being I am only interested in English text and that's what the software currently assumes. I use English Tesseract model data and haven't tested how well it handles for example German umlauts or cyrillic scripts(the latter will almost certainly require using a different model).
Because the title image can in fact contain both the documents name and metadata(producer of the document etc), the word bounding boxes returned by Tesseract are used to strip the metadata. The criterion used is the distance from the end of the BB of a word to the begining of the BB of the next. If the gap exceeds a certain threshold the next word is considered part of the metadata.

3) The OCR result is not always perfect. For example errors like "Introduct1on" instead of "Introduction" sometimes occur, so some kind of approximate string matching is desirable. A standard metric to measure the similarity between two string is for example Levenshtein distance, but it runs in quadratic time so it seems too "expensive". I am also aware of Levenshtein automata , but as I understand they assume a bounded number of errors that should be known beforehand.
Then I found SimString (paper, code). It uses a weaker notion of similarity based on the number of the common letter n-grams (sequences of letters), but it is fast. Actually I reimplemented the algorithm from the paper(as I understand it) because SimString's implementation uses a persistent database, and I wanted to be able to build it in memory on demand. Anyway it seems to work OK and finds the best matching filename in under a second. Note that it is not always possible to find the true file because the titles in the shell are truncated. E.g. if you have a file "Very very very very long filename1.pdf" and "Very very very very long filename2.pdf", the shell will truncate it to "Very very very long file...".

And that is all that this software does. When started it reads the framebuffer, tries to find the selected title and prints the absolute filename of the file it thinks is the best match.

How it can be used?

For example to start the hawhill's promising PDF viewer from launchpad you can use a kpdfview.ini like this:

Code:
[Actions]
;; run kpdfviewer
P D = !cd /mnt/us/kpdfview; ./reader.lua "`/mnt/us/launchpad/lhack /mnt/us/documents *.pdf 0.6`" &
The tool takes three arguments - the root directory to be searched for a matching file, a comma-separated list of filters which could be used to narrow the search to specific types of files and finally a similarity coefficient. This coefficient is in fact the percentage of the letter 3-grams to be matched. The files that have a lower percentage of overlap with the query will not be considered. With my very limited experiments I have found that the values in the range 0.5-0.6 work good. If this coefficient is too low, the search will be slower, and if too high the true matching file can be rejected/not found.

Installation
To install launchhack just extract the attached archive in /mnt/us/launchpad directory. The source code is here(still doesn't have a Makefile or even a README).

Finally a note to the brave developers that might want to look at the code: If you think the code is crap I agree with you . I admit my guilt in writing too long functions, using sloppy variable names, using classes for code that should be really just a function, lack of error conditions checks(if something go wrong it will just happily explode) and multitude of other unforgivable sins. Maybe some of these will be fixed, but frankly I don't want to spend much more time on this.

Edit:
BTW there is a potential use of this tool, which may not be immediately obvious. You can use it also to start readers for file formats not supported by the stock software. Say you want to be able to open epub files straight from the stock shell in fbKindle (I don't read much fiction and this is just what I have installed).

First modify the goqt.sh script to support passing of arguments to the reader app (add a "$2").
Code:
./"$1" -qws "$2"
Then create a script named lhindex.sh
Code:
#!/bin/sh

LHIDX=/mnt/us/documents/lhindex/$1

# in case it doesn't exist yet ...
mkdir -p $LHIDX

# remove the old entries
rm $LHIDX/*

# create a new 'index'
find /mnt/us/documents/ -name "*.$1" | sed -e "s/\.$1/\.txt/" | awk -F'/' -v I=$LHIDX '{print I"/"$NF;}' | while read f; do
  echo 1 > "$f"
done

# Force Kindle to scan the docs folder
dbus-send --system /default com.lab126.powerd.resuming int32:1
Finally create an epub.ini launchpad script
Code:
[Actions]
;; "Index" epubs
E I = !/mnt/us/launchpad/lhindex.sh epub &

;; run fbKindle
F E = !/mnt/us/fbKindle/goqt.sh FBReader "`/mnt/us/launchpad/lhack /mnt/us/documents *.epub 0.6`"&
Now you can "index" your epubs by pressing shift-E-I. This creates an empty text file(though non zero-length, because the shell doesn't show these) in /mnt/us/documents/lhindex/epub for every epub you have on your Kindle. Then you can select a title in the shell, press shift-F-E and voila fbKindle opens it's epub counterpart.

Edit 2: Added a binary for K3.
Attached Files
File Type: gz launchhack.tar.gz (5.70 MB, 342 views)
File Type: gz launchhack-k3.tar.gz (5.74 MB, 363 views)

Last edited by vdp; 12-27-2011 at 04:27 AM.
vdp is offline   Reply With Quote