View Full Version : Correcting 'll' errors (al , final y, etc)


flameproof
05-08-2012, 11:16 AM
Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll.

I know, the issue is still open and hopefully be fixed one day.

In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all.

Here is what I mean:

str_replace(" al ", " all ", $input);

is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on.

I would need a few 100 str_replace commands.

Is there a cleaner way to use REGEX?

I am not a REGEX expert, but I know you can have options ' al /s|\.|,|\?|!|-' It would make the script better readable.

Just for a starter:



$path = 'C:/xampp/tmp/test.epub';

$zip = new ZipArchive;
if ($zip->open($path) === true) {

for($i = 0; $i < $zip->numFiles; $i++) {

$zip->extractTo('me/', array($zip->getNameIndex($i)));

// here you can run a custom function for the particular extracted file

}

$zip->close();

}


$directory = "C:/xampp/htdocs/bom3/scams/me/";

$html = glob($directory . "*.html");

foreach($html as $value)
{
echo $value."<br>";
}

// just an example with one file...

$data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");

$replacer = str_replace("l ed ", "lled ", $data);
$replacer = str_replace("l ed.", "lled.", $replacer);
$replacer = str_replace("l ed,", "lled,", $replacer);

echo $replacer;



Update: str_replace is ok, but preg_replace is way more functional. I am not at about 80% correction level and I am still working on it! Below is what I got so far:



$data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");

$replacer = preg_replace("/\s([Aa]l)\s(\.|\?|!|,|-)?/", " $1l$2", $data); // all
$replacer = preg_replace("/l\sed([^a-z])/", "lled$1", $replacer); // lled
$replacer = preg_replace("/l y([^a-z])/", "lly$1", $replacer); // lly
$replacer = preg_replace("/\s([WwCc])al\s([^a-z])/", " $1all$2", $replacer); // wall and call
$replacer = preg_replace("/([Ss])til\s([^a-z])/", " $1till$2", $replacer); // still
$replacer = preg_replace("/([Ss])kul\s([^a-z])/", " $1kull$2", $replacer); // skull
$replacer = preg_replace("/Quel\s([^a-z])/", "Quell$1", $replacer); // Quell
$replacer = preg_replace("/ol\saps/", "ollaps", $replacer); // collaps
$replacer = preg_replace("/il\sage/", "illage", $replacer); // village
$replacer = preg_replace("/tel\sige/", "tellige", $replacer); // telligence
$replacer = preg_replace("/al\sucin/", "allucin", $replacer); // hallucina
$replacer = preg_replace("/(\s[Rr])ol\s/", "$1oll", $replacer); // roll
$replacer = preg_replace("/\s([Ww]el)\s([^a-z])/", " $1l$2", $replacer); // well
$replacer = preg_replace("/\s([Ff][aeu]l)\s([^a-z])/", " $1l$2", $replacer); // fall/fell/full
$replacer = preg_replace("/([ao])l\sow/", " $1llow", $replacer); // allow ollow

mrmikel
05-09-2012, 07:55 AM
You can try converting the files from PDF with Mobipocket Creator and take the resulting HTML either into calibre or directly into Sigil and see if the result is better.

frostschutz
05-09-2012, 08:02 AM
Regular expression usually supports "Word boundary" matching some way or other. (I forget which pattern, maybe \< \> or \b?).

So you could replace "boundary a l boundary" with all.

\<al\> -> all.

flameproof
05-09-2012, 10:47 AM
Regular expression usually supports "Word boundary" matching some way or other. (I forget which pattern, maybe \< \> or \b?).

So you could replace "boundary a l boundary" with all.

\<al\> -> all.

There is very little HTML inside the actual text. In the HTML files it is really 'final y al are wel .'

I finished my PHP now. I can chose an ePub file, open it, correct all HTML files inside the ePub, REGEX clean them, and reconstruct the ePub file.

@mrmikel
I tried a few conversion tools and got the same errors. I see the PHP as a little fun brain exercise anyway.

SBT
05-10-2012, 03:55 AM
Just a thought:
Since most of the errors are 'll' -> 'l ', what about putting all double-l words in a loop, someting like (Sorry, don't know PHP):
for word in [actually, hallucinate, tellige, ...]
pattern=replace(word, 'll', 'l\s')
replace(all_text,pattern,word)
next

To find all instances of possibly missing l's, you can try
grep -o "[^ ]\+[^l ]l [a-z][^ ]*" text_file

Ahh... sufficiently advanced regexp is indistiguishable from keyboard white noise :-)

flameproof
05-10-2012, 08:44 AM
Just a thought:
Since most of the errors are 'll' -> 'l ', what about putting all double-l words in a loop, someting like ...

There are way, no WAY, too many of them. And then the variants...

Kill
kill
killing
killer

And I like to catch 'skill' too.

If I put a space in front (\s that is) I will not catch words that are at the lines beginning. It's quite a tedious work to program.