Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll.
I know, the issue is still open and hopefully be fixed one day.
In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all.
Here is what I mean:
str_replace(" al ", " all ", $input);
is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on.
I would need a few 100 str_replace commands.
Is there a cleaner way to use REGEX?
I am not a REGEX expert, but I know you can have options ' al /s|\.|,|\?|!|-' It would make the script better readable.
Just for a starter:
PHP Code:
$path = 'C:/xampp/tmp/test.epub';
$zip = new ZipArchive;
if ($zip->open($path) === true) {
for($i = 0; $i < $zip->numFiles; $i++) {
$zip->extractTo('me/', array($zip->getNameIndex($i)));
// here you can run a custom function for the particular extracted file
}
$zip->close();
}
$directory = "C:/xampp/htdocs/bom3/scams/me/";
$html = glob($directory . "*.html");
foreach($html as $value)
{
echo $value."<br>";
}
// just an example with one file...
$data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");
$replacer = str_replace("l ed ", "lled ", $data);
$replacer = str_replace("l ed.", "lled.", $replacer);
$replacer = str_replace("l ed,", "lled,", $replacer);
echo $replacer;
Update: str_replace is ok, but preg_replace is way more functional. I am not at about 80% correction level and I am still working on it! Below is what I got so far:
PHP Code:
$data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");
$replacer = preg_replace("/\s([Aa]l)\s(\.|\?|!|,|-)?/", " $1l$2", $data); // all
$replacer = preg_replace("/l\sed([^a-z])/", "lled$1", $replacer); // lled
$replacer = preg_replace("/l y([^a-z])/", "lly$1", $replacer); // lly
$replacer = preg_replace("/\s([WwCc])al\s([^a-z])/", " $1all$2", $replacer); // wall and call
$replacer = preg_replace("/([Ss])til\s([^a-z])/", " $1till$2", $replacer); // still
$replacer = preg_replace("/([Ss])kul\s([^a-z])/", " $1kull$2", $replacer); // skull
$replacer = preg_replace("/Quel\s([^a-z])/", "Quell$1", $replacer); // Quell
$replacer = preg_replace("/ol\saps/", "ollaps", $replacer); // collaps
$replacer = preg_replace("/il\sage/", "illage", $replacer); // village
$replacer = preg_replace("/tel\sige/", "tellige", $replacer); // telligence
$replacer = preg_replace("/al\sucin/", "allucin", $replacer); // hallucina
$replacer = preg_replace("/(\s[Rr])ol\s/", "$1oll", $replacer); // roll
$replacer = preg_replace("/\s([Ww]el)\s([^a-z])/", " $1l$2", $replacer); // well
$replacer = preg_replace("/\s([Ff][aeu]l)\s([^a-z])/", " $1l$2", $replacer); // fall/fell/full
$replacer = preg_replace("/([ao])l\sow/", " $1llow", $replacer); // allow ollow