![]() |
#1 |
Member
![]() Posts: 17
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1
|
Correcting 'll' errors (al , final y, etc)
Some epub files that I converted from PDF with Calibre have an issue with one missing 'l' where there should be a a double ll.
I know, the issue is still open and hopefully be fixed one day. In the meantime I wrote a PHP script that can fix the issue. However, there seem to be endless errors and it's hard to cover all. Here is what I mean: str_replace(" al ", " all ", $input); is ok for 'al' the text, but ' al ,' will be replaced with ' all ,', and then there is ' al ?' and - and ! and so on. I would need a few 100 str_replace commands. Is there a cleaner way to use REGEX? I am not a REGEX expert, but I know you can have options ' al /s|\.|,|\?|!|-' It would make the script better readable. Just for a starter: PHP Code:
PHP Code:
Last edited by flameproof; 05-09-2012 at 01:22 AM. |
![]() |
![]() |
![]() |
#2 |
Color me gone
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
You can try converting the files from PDF with Mobipocket Creator and take the resulting HTML either into calibre or directly into Sigil and see if the result is better.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Linux User
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,282
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
|
Regular expression usually supports "Word boundary" matching some way or other. (I forget which pattern, maybe \< \> or \b?).
So you could replace "boundary a l boundary" with all. \<al\> -> all. |
![]() |
![]() |
![]() |
#4 | |
Member
![]() Posts: 17
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1
|
Quote:
I finished my PHP now. I can chose an ePub file, open it, correct all HTML files inside the ePub, REGEX clean them, and reconstruct the ePub file. @mrmikel I tried a few conversion tools and got the same errors. I see the PHP as a little fun brain exercise anyway. |
|
![]() |
![]() |
![]() |
#5 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Just a thought:
Since most of the errors are 'll' -> 'l ', what about putting all double-l words in a loop, someting like (Sorry, don't know PHP): Code:
for word in [actually, hallucinate, tellige, ...] pattern=replace(word, 'll', 'l\s') replace(all_text,pattern,word) next Code:
grep -o "[^ ]\+[^l ]l [a-z][^ ]*" text_file |
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Member
![]() Posts: 17
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1
|
Quote:
Kill kill killing killer And I like to catch 'skill' too. If I put a space in front (\s that is) I will not catch words that are at the lines beginning. It's quite a tedious work to program. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Correcting for a missing line break | Tango Mike | Sigil | 31 | 07-28-2012 05:30 AM |
Correcting author's name on existing files | louwin | Library Management | 10 | 03-27-2012 07:15 PM |
Flightcrew says Errors, but Sigil says no Errors… | Barcelona | Sigil | 4 | 02-09-2012 07:13 AM |
Correcting Ebooks | newlaza | Reading and Management | 2 | 05-19-2011 05:57 AM |
Correcting tags after adding | ApK | Calibre | 2 | 01-23-2011 02:02 PM |