Correcting 'll' errors (al , final y, etc)

flameproof · 05-08-2012, 11:16 AM

Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll.

I know, the issue is still open and hopefully be fixed one day.

In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all.

Here is what I mean:

str_replace(" al ", " all ", $input);

is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on.

I would need a few 100 str_replace commands.

Is there a cleaner way to use REGEX?

I am not a REGEX expert, but I know you can have options ' al /s|\.|,|\?|!|-' It would make the script better readable.

Just for a starter:

PHP Code:


			


$path = 'C:/xampp/tmp/test.epub';



$zip = new ZipArchive;

if ($zip->open($path) === true) {

                    

    for($i = 0; $i < $zip->numFiles; $i++) {

                         

        $zip->extractTo('me/', array($zip->getNameIndex($i)));

                        

        // here you can run a custom function for the particular extracted file

                        

    }

                    

    $zip->close();

                    

}





$directory = "C:/xampp/htdocs/bom3/scams/me/";

 

$html = glob($directory . "*.html");

 

foreach($html as $value)

{

echo $value."<br>";

}



// just an example with one file...



$data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");



$replacer = str_replace("l ed ", "lled ", $data);

$replacer = str_replace("l ed.", "lled.", $replacer);

$replacer = str_replace("l ed,", "lled,", $replacer);



echo $replacer;

Update: str_replace is ok, but preg_replace is way more functional. I am not at about 80% correction level and I am still working on it! Below is what I got so far:

PHP Code:


			


$data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");



$replacer = preg_replace("/\s([Aa]l)\s(\.|\?|!|,|-)?/", " $1l$2", $data);        // all

$replacer = preg_replace("/l\sed([^a-z])/", "lled$1", $replacer);                // lled

$replacer = preg_replace("/l y([^a-z])/", "lly$1", $replacer);                    // lly

$replacer = preg_replace("/\s([WwCc])al\s([^a-z])/", " $1all$2", $replacer);    // wall and call

$replacer = preg_replace("/([Ss])til\s([^a-z])/", " $1till$2", $replacer);        // still

$replacer = preg_replace("/([Ss])kul\s([^a-z])/", " $1kull$2", $replacer);        // skull

$replacer = preg_replace("/Quel\s([^a-z])/", "Quell$1", $replacer);                // Quell

$replacer = preg_replace("/ol\saps/", "ollaps", $replacer);                        // collaps

$replacer = preg_replace("/il\sage/", "illage", $replacer);                        // village

$replacer = preg_replace("/tel\sige/", "tellige", $replacer);                    // telligence

$replacer = preg_replace("/al\sucin/", "allucin", $replacer);                    // hallucina

$replacer = preg_replace("/(\s[Rr])ol\s/", "$1oll", $replacer);                    // roll

$replacer = preg_replace("/\s([Ww]el)\s([^a-z])/", " $1l$2", $replacer);        // well

$replacer = preg_replace("/\s([Ff][aeu]l)\s([^a-z])/", " $1l$2", $replacer);    // fall/fell/full

$replacer = preg_replace("/([ao])l\sow/", " $1llow", $replacer);                // allow ollow

mrmikel · 05-09-2012, 07:55 AM

You can try converting the files from PDF with Mobipocket Creator and take the resulting HTML either into calibre or directly into Sigil and see if the result is better.

frostschutz · 05-09-2012, 08:02 AM

Regular expression usually supports "Word boundary" matching some way or other. (I forget which pattern, maybe \< \> or \b?).

So you could replace "boundary a l boundary" with all.

\<al\> -> all.

flameproof · 05-09-2012, 10:47 AM

Quote:

Originally Posted by frostschutz

Regular expression usually supports "Word boundary" matching some way or other. (I forget which pattern, maybe \< \> or \b?).

So you could replace "boundary a l boundary" with all.

\<al\> -> all.

There is very little HTML inside the actual text. In the HTML files it is really 'final y al are wel .'

I finished my PHP now. I can chose an ePub file, open it, correct all HTML files inside the ePub, REGEX clean them, and reconstruct the ePub file.

@mrmikel
I tried a few conversion tools and got the same errors. I see the PHP as a little fun brain exercise anyway.

SBT · 05-10-2012, 03:55 AM

Just a thought:
Since most of the errors are 'll' -> 'l ', what about putting all double-l words in a loop, someting like (Sorry, don't know PHP):

Code:

for word in [actually, hallucinate, tellige, ...]
   pattern=replace(word, 'll', 'l\s')
   replace(all_text,pattern,word)
next

To find all instances of possibly missing l's, you can try

Code:

grep -o "[^ ]\+[^l ]l [a-z][^ ]*" text_file

Ahh... sufficiently advanced regexp is indistiguishable from keyboard white noise :-)

flameproof · 05-10-2012, 08:44 AM

Quote:

Originally Posted by SBT

Just a thought:
Since most of the errors are 'll' -> 'l ', what about putting all double-l words in a loop, someting like ...

There are way, no WAY, too many of them. And then the variants...

Kill
kill
killing
killer

And I like to catch 'skill' too.

If I put a space in front (\s that is) I will not catch words that are at the lines beginning. It's quite a tedious work to program.

05-08-2012, 11:16 AM	#1
flameproof Member Posts: 17 Karma: 10 Join Date: Dec 2011 Device: Sony PRS-T1	Correcting 'll' errors (al , final y, etc) Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll. I know, the issue is still open and hopefully be fixed one day. In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all. Here is what I mean: str_replace(" al ", " all ", $input); is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on. I would need a few 100 str_replace commands. Is there a cleaner way to use REGEX? I am not a REGEX expert, but I know you can have options ' al /s\|\.\|,\|\?\|!\|-' It would make the script better readable. Just for a starter: PHP Code: $path = 'C:/xampp/tmp/test.epub'; $zip = new ZipArchive; if ($zip->open($path) === true) { for($i = 0; $i < $zip->numFiles; $i++) { $zip->extractTo('me/', array($zip->getNameIndex($i))); // here you can run a custom function for the particular extracted file } $zip->close(); } $directory = "C:/xampp/htdocs/bom3/scams/me/"; $html = glob($directory . ".html"); foreach($html as $value) { echo $value."<br>"; } // just an example with one file... $data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html"); $replacer = str_replace("l ed ", "lled ", $data); $replacer = str_replace("l ed.", "lled.", $replacer); $replacer = str_replace("l ed,", "lled,", $replacer); echo $replacer; Update: str_replace is ok, but preg_replace is way more functional. I am not at about 80% correction level and I am still working on it! Below is what I got so far: PHP Code: $data = file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html"); $replacer = preg_replace("/\s([Aa]l)\s(\.\|\?\|!\|,\|-)?/", " $1l$2", $data); // all $replacer = preg_replace("/l\sed([^a-z])/", "lled$1", $replacer); // lled $replacer = preg_replace("/l y([^a-z])/", "lly$1", $replacer); // lly $replacer = preg_replace("/\s([WwCc])al\s([^a-z])/", " $1all$2", $replacer); // wall and call $replacer = preg_replace("/([Ss])til\s([^a-z])/", " $1till$2", $replacer); // still $replacer = preg_replace("/([Ss])kul\s([^a-z])/", " $1kull$2", $replacer); // skull $replacer = preg_replace("/Quel\s([^a-z])/", "Quell$1", $replacer); // Quell $replacer = preg_replace("/ol\saps/", "ollaps", $replacer); // collaps $replacer = preg_replace("/il\sage/", "illage", $replacer); // village $replacer = preg_replace("/tel\sige/", "tellige", $replacer); // telligence $replacer = preg_replace("/al\sucin/", "allucin", $replacer); // hallucina $replacer = preg_replace("/(\s[Rr])ol\s/", "$1oll", $replacer); // roll $replacer = preg_replace("/\s([Ww]el)\s([^a-z])/", " $1l$2", $replacer); // well $replacer = preg_replace("/\s([Ff][aeu]l)\s([^a-z])/", " $1l$2", $replacer); // fall/fell/full $replacer = preg_replace("/([ao])l\sow/", " $1llow", $replacer); // allow ollow Last edited by flameproof; 05-09-2012 at 01:22 AM.*

05-10-2012, 03:55 AM	#5
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	Just a thought: Since most of the errors are 'll' -> 'l ', what about putting all double-l words in a loop, someting like (Sorry, don't know PHP): Code: for word in [actually, hallucinate, tellige, ...] pattern=replace(word, 'll', 'l\s') replace(all_text,pattern,word) next To find all instances of possibly missing l's, you can try Code: grep -o "[^ ]\+[^l ]l [a-z][^ ]*" text_file Ahh... sufficiently advanced regexp is indistiguishable from keyboard white noise :-)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Correcting for a missing line break	Tango Mike	Sigil	31	07-28-2012 05:30 AM
Correcting author's name on existing files	louwin	Library Management	10	03-27-2012 07:15 PM
Flightcrew says Errors, but Sigil says no Errors…	Barcelona	Sigil	4	02-09-2012 07:13 AM
Correcting Ebooks	newlaza	Reading and Management	2	05-19-2011 05:57 AM
Correcting tags after adding	ApK	Calibre	2	01-23-2011 02:02 PM

05-09-2012, 07:55 AM	#2
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	You can try converting the files from PDF with Mobipocket Creator and take the resulting HTML either into calibre or directly into Sigil and see if the result is better.

05-09-2012, 08:02 AM	#3
frostschutz Linux User Posts: 2,279 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	Regular expression usually supports "Word boundary" matching some way or other. (I forget which pattern, maybe \< \> or \b?). So you could replace "boundary a l boundary" with all. \<al\> -> all.

Advert

Advert