View Single Post
Old 05-08-2012, 11:16 AM   #1
flameproof
Member
flameproof began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1
Correcting 'll' errors (al , final y, etc)

Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll.

I know, the issue is still open and hopefully be fixed one day.

In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all.

Here is what I mean:

str_replace(" al ", " all ", $input);

is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on.

I would need a few 100 str_replace commands.

Is there a cleaner way to use REGEX?

I am not a REGEX expert, but I know you can have options ' al /s|\.|,|\?|!|-' It would make the script better readable.

Just for a starter:

PHP Code:

$path 
'C:/xampp/tmp/test.epub';

$zip = new ZipArchive;
if (
$zip->open($path) === true) {
                    
    for(
$i 0$i $zip->numFiles$i++) {
                         
        
$zip->extractTo('me/', array($zip->getNameIndex($i)));
                        
        
// here you can run a custom function for the particular extracted file
                        
    
}
                    
    
$zip->close();
                    
}


$directory "C:/xampp/htdocs/bom3/scams/me/";
 
$html glob($directory "*.html");
 
foreach(
$html as $value)
{
echo 
$value."<br>";
}

// just an example with one file...

$data file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");

$replacer str_replace("l ed ""lled "$data);
$replacer str_replace("l ed.""lled."$replacer);
$replacer str_replace("l ed,""lled,"$replacer);

echo 
$replacer
Update: str_replace is ok, but preg_replace is way more functional. I am not at about 80% correction level and I am still working on it! Below is what I got so far:

PHP Code:

$data 
file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");

$replacer preg_replace("/\s([Aa]l)\s(\.|\?|!|,|-)?/"" $1l$2"$data);        // all
$replacer preg_replace("/l\sed([^a-z])/""lled$1"$replacer);                // lled
$replacer preg_replace("/l y([^a-z])/""lly$1"$replacer);                    // lly
$replacer preg_replace("/\s([WwCc])al\s([^a-z])/"" $1all$2"$replacer);    // wall and call
$replacer preg_replace("/([Ss])til\s([^a-z])/"" $1till$2"$replacer);        // still
$replacer preg_replace("/([Ss])kul\s([^a-z])/"" $1kull$2"$replacer);        // skull
$replacer preg_replace("/Quel\s([^a-z])/""Quell$1"$replacer);                // Quell
$replacer preg_replace("/ol\saps/""ollaps"$replacer);                        // collaps
$replacer preg_replace("/il\sage/""illage"$replacer);                        // village
$replacer preg_replace("/tel\sige/""tellige"$replacer);                    // telligence
$replacer preg_replace("/al\sucin/""allucin"$replacer);                    // hallucina
$replacer preg_replace("/(\s[Rr])ol\s/""$1oll"$replacer);                    // roll
$replacer preg_replace("/\s([Ww]el)\s([^a-z])/"" $1l$2"$replacer);        // well
$replacer preg_replace("/\s([Ff][aeu]l)\s([^a-z])/"" $1l$2"$replacer);    // fall/fell/full
$replacer preg_replace("/([ao])l\sow/"" $1llow"$replacer);                // allow ollow 

Last edited by flameproof; 05-09-2012 at 01:22 AM.
flameproof is offline   Reply With Quote