Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 05-08-2012, 11:16 AM   #1
flameproof
Member
flameproof began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1
Correcting 'll' errors (al , final y, etc)

Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll.

I know, the issue is still open and hopefully be fixed one day.

In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all.

Here is what I mean:

str_replace(" al ", " all ", $input);

is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on.

I would need a few 100 str_replace commands.

Is there a cleaner way to use REGEX?

I am not a REGEX expert, but I know you can have options ' al /s|\.|,|\?|!|-' It would make the script better readable.

Just for a starter:

PHP Code:

$path 
'C:/xampp/tmp/test.epub';

$zip = new ZipArchive;
if (
$zip->open($path) === true) {
                    
    for(
$i 0$i $zip->numFiles$i++) {
                         
        
$zip->extractTo('me/', array($zip->getNameIndex($i)));
                        
        
// here you can run a custom function for the particular extracted file
                        
    
}
                    
    
$zip->close();
                    
}


$directory "C:/xampp/htdocs/bom3/scams/me/";
 
$html glob($directory "*.html");
 
foreach(
$html as $value)
{
echo 
$value."<br>";
}

// just an example with one file...

$data file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");

$replacer str_replace("l ed ""lled "$data);
$replacer str_replace("l ed.""lled."$replacer);
$replacer str_replace("l ed,""lled,"$replacer);

echo 
$replacer
Update: str_replace is ok, but preg_replace is way more functional. I am not at about 80% correction level and I am still working on it! Below is what I got so far:

PHP Code:

$data 
file_get_contents("C:/xampp/htdocs/bom3/scams/me/index_split_005.html");

$replacer preg_replace("/\s([Aa]l)\s(\.|\?|!|,|-)?/"" $1l$2"$data);        // all
$replacer preg_replace("/l\sed([^a-z])/""lled$1"$replacer);                // lled
$replacer preg_replace("/l y([^a-z])/""lly$1"$replacer);                    // lly
$replacer preg_replace("/\s([WwCc])al\s([^a-z])/"" $1all$2"$replacer);    // wall and call
$replacer preg_replace("/([Ss])til\s([^a-z])/"" $1till$2"$replacer);        // still
$replacer preg_replace("/([Ss])kul\s([^a-z])/"" $1kull$2"$replacer);        // skull
$replacer preg_replace("/Quel\s([^a-z])/""Quell$1"$replacer);                // Quell
$replacer preg_replace("/ol\saps/""ollaps"$replacer);                        // collaps
$replacer preg_replace("/il\sage/""illage"$replacer);                        // village
$replacer preg_replace("/tel\sige/""tellige"$replacer);                    // telligence
$replacer preg_replace("/al\sucin/""allucin"$replacer);                    // hallucina
$replacer preg_replace("/(\s[Rr])ol\s/""$1oll"$replacer);                    // roll
$replacer preg_replace("/\s([Ww]el)\s([^a-z])/"" $1l$2"$replacer);        // well
$replacer preg_replace("/\s([Ff][aeu]l)\s([^a-z])/"" $1l$2"$replacer);    // fall/fell/full
$replacer preg_replace("/([ao])l\sow/"" $1llow"$replacer);                // allow ollow 

Last edited by flameproof; 05-09-2012 at 01:22 AM.
flameproof is offline   Reply With Quote
Old 05-09-2012, 07:55 AM   #2
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
You can try converting the files from PDF with Mobipocket Creator and take the resulting HTML either into calibre or directly into Sigil and see if the result is better.
mrmikel is offline   Reply With Quote
Advert
Old 05-09-2012, 08:02 AM   #3
frostschutz
Linux User
frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.
 
frostschutz's Avatar
 
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
Regular expression usually supports "Word boundary" matching some way or other. (I forget which pattern, maybe \< \> or \b?).

So you could replace "boundary a l boundary" with all.

\<al\> -> all.
frostschutz is offline   Reply With Quote
Old 05-09-2012, 10:47 AM   #4
flameproof
Member
flameproof began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1
Quote:
Originally Posted by frostschutz View Post
Regular expression usually supports "Word boundary" matching some way or other. (I forget which pattern, maybe \< \> or \b?).

So you could replace "boundary a l boundary" with all.

\<al\> -> all.
There is very little HTML inside the actual text. In the HTML files it is really 'final y al are wel .'

I finished my PHP now. I can chose an ePub file, open it, correct all HTML files inside the ePub, REGEX clean them, and reconstruct the ePub file.

@mrmikel
I tried a few conversion tools and got the same errors. I see the PHP as a little fun brain exercise anyway.
flameproof is offline   Reply With Quote
Old 05-10-2012, 03:55 AM   #5
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Just a thought:
Since most of the errors are 'll' -> 'l ', what about putting all double-l words in a loop, someting like (Sorry, don't know PHP):
Code:
for word in [actually, hallucinate, tellige, ...]
   pattern=replace(word, 'll', 'l\s')
   replace(all_text,pattern,word)
next
To find all instances of possibly missing l's, you can try
Code:
grep -o "[^ ]\+[^l ]l [a-z][^ ]*" text_file
Ahh... sufficiently advanced regexp is indistiguishable from keyboard white noise :-)
SBT is offline   Reply With Quote
Advert
Old 05-10-2012, 08:44 AM   #6
flameproof
Member
flameproof began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1
Quote:
Originally Posted by SBT View Post
Just a thought:
Since most of the errors are 'll' -> 'l ', what about putting all double-l words in a loop, someting like ...
There are way, no WAY, too many of them. And then the variants...

Kill
kill
killing
killer

And I like to catch 'skill' too.

If I put a space in front (\s that is) I will not catch words that are at the lines beginning. It's quite a tedious work to program.
flameproof is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Correcting for a missing line break Tango Mike Sigil 31 07-28-2012 05:30 AM
Correcting author's name on existing files louwin Library Management 10 03-27-2012 07:15 PM
Flightcrew says Errors, but Sigil says no Errors… Barcelona Sigil 4 02-09-2012 07:13 AM
Correcting Ebooks newlaza Reading and Management 2 05-19-2011 05:57 AM
Correcting tags after adding ApK Calibre 2 01-23-2011 02:02 PM


All times are GMT -4. The time now is 06:36 AM.


MobileRead.com is a privately owned, operated and funded community.