MobileRead Forums - View Single Post - Correcting 'll' errors (al , final y, etc)

flameproof · 05-08-2012, 12:16 PM

Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll.

I know, the issue is still open and hopefully be fixed one day.

In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all.

Here is what I mean:

str_replace(" al ", " all ", $input);

is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on.

I would need a few 100 str_replace commands.

Is there a cleaner way to use REGEX?

I am not a REGEX expert, but I know you can have options ' al /s|\.|,|\?|!|-' It would make the script better readable.

Just for a starter:

PHP Code:


			


$path = 'C:/xampp/tmp/test.epub';



$zip = new ZipArchive;

if ($zip->open($path) === true) {

                    

    for($i = 0; $i < $zip->numFiles; $i++) {

                         

        $zip->extractTo('me/', array($zip->getNameIndex($i)));

                        

        // here you can run a custom function for the particular extracted file

                        

    }

                    

    $zip->close();

                    

}





$directory = "C:/xampp/htdocs/bom3/scams/me/";

 

$html = glob($directory . "*.html");

 

foreach($html as $value)

{

echo $value."<br>";

}



// just an example with one file...



$data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");



$replacer = str_replace("l ed ", "lled ", $data);

$replacer = str_replace("l ed.", "lled.", $replacer);

$replacer = str_replace("l ed,", "lled,", $replacer);



echo $replacer;

Update: str_replace is ok, but preg_replace is way more functional. I am not at about 80% correction level and I am still working on it! Below is what I got so far:

PHP Code:


			


$data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");



$replacer = preg_replace("/\s([Aa]l)\s(\.|\?|!|,|-)?/", " $1l$2", $data);        // all

$replacer = preg_replace("/l\sed([^a-z])/", "lled$1", $replacer);                // lled

$replacer = preg_replace("/l y([^a-z])/", "lly$1", $replacer);                    // lly

$replacer = preg_replace("/\s([WwCc])al\s([^a-z])/", " $1all$2", $replacer);    // wall and call

$replacer = preg_replace("/([Ss])til\s([^a-z])/", " $1till$2", $replacer);        // still

$replacer = preg_replace("/([Ss])kul\s([^a-z])/", " $1kull$2", $replacer);        // skull

$replacer = preg_replace("/Quel\s([^a-z])/", "Quell$1", $replacer);                // Quell

$replacer = preg_replace("/ol\saps/", "ollaps", $replacer);                        // collaps

$replacer = preg_replace("/il\sage/", "illage", $replacer);                        // village

$replacer = preg_replace("/tel\sige/", "tellige", $replacer);                    // telligence

$replacer = preg_replace("/al\sucin/", "allucin", $replacer);                    // hallucina

$replacer = preg_replace("/(\s[Rr])ol\s/", "$1oll", $replacer);                    // roll

$replacer = preg_replace("/\s([Ww]el)\s([^a-z])/", " $1l$2", $replacer);        // well

$replacer = preg_replace("/\s([Ff][aeu]l)\s([^a-z])/", " $1l$2", $replacer);    // fall/fell/full

$replacer = preg_replace("/([ao])l\sow/", " $1llow", $replacer);                // allow ollow

05-08-2012, 12:16 PM	#1
flameproof Member Posts: 17 Karma: 10 Join Date: Dec 2011 Device: Sony PRS-T1	Correcting 'll' errors (al , final y, etc) Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll. I know, the issue is still open and hopefully be fixed one day. In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all. Here is what I mean: str_replace(" al ", " all ", $input); is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on. I would need a few 100 str_replace commands. Is there a cleaner way to use REGEX? I am not a REGEX expert, but I know you can have options ' al /s\|\.\|,\|\?\|!\|-' It would make the script better readable. Just for a starter: PHP Code: $path = 'C:/xampp/tmp/test.epub'; $zip = new ZipArchive; if ($zip->open($path) === true) { for($i = 0; $i < $zip->numFiles; $i++) { $zip->extractTo('me/', array($zip->getNameIndex($i))); // here you can run a custom function for the particular extracted file } $zip->close(); } $directory = "C:/xampp/htdocs/bom3/scams/me/"; $html = glob($directory . ".html"); foreach($html as $value) { echo $value."<br>"; } // just an example with one file... $data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html"); $replacer = str_replace("l ed ", "lled ", $data); $replacer = str_replace("l ed.", "lled.", $replacer); $replacer = str_replace("l ed,", "lled,", $replacer); echo $replacer; Update: str_replace is ok, but preg_replace is way more functional. I am not at about 80% correction level and I am still working on it! Below is what I got so far: PHP Code: $data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html"); $replacer = preg_replace("/\s([Aa]l)\s(\.\|\?\|!\|,\|-)?/", " $1l$2", $data); // all $replacer = preg_replace("/l\sed([^a-z])/", "lled$1", $replacer); // lled $replacer = preg_replace("/l y([^a-z])/", "lly$1", $replacer); // lly $replacer = preg_replace("/\s([WwCc])al\s([^a-z])/", " $1all$2", $replacer); // wall and call $replacer = preg_replace("/([Ss])til\s([^a-z])/", " $1till$2", $replacer); // still $replacer = preg_replace("/([Ss])kul\s([^a-z])/", " $1kull$2", $replacer); // skull $replacer = preg_replace("/Quel\s([^a-z])/", "Quell$1", $replacer); // Quell $replacer = preg_replace("/ol\saps/", "ollaps", $replacer); // collaps $replacer = preg_replace("/il\sage/", "illage", $replacer); // village $replacer = preg_replace("/tel\sige/", "tellige", $replacer); // telligence $replacer = preg_replace("/al\sucin/", "allucin", $replacer); // hallucina $replacer = preg_replace("/(\s[Rr])ol\s/", "$1oll", $replacer); // roll $replacer = preg_replace("/\s([Ww]el)\s([^a-z])/", " $1l$2", $replacer); // well $replacer = preg_replace("/\s([Ff][aeu]l)\s([^a-z])/", " $1l$2", $replacer); // fall/fell/full $replacer = preg_replace("/([ao])l\sow/", " $1llow", $replacer); // allow ollow Last edited by flameproof; 05-09-2012 at 02:22 AM.*