![]() |
#1 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Aug 2022
Device: none
|
a little help
Hello everyone, new to the forum maybe I'm wrong the place where I place my request.
Can you help me do something please to correct this type of error? Is there an automatism, maybe a plugin, to correct many pages of this epub? I attach two .txt files, as an example. I hope and look forward with confidence. Thanks a lot |
![]() |
![]() |
![]() |
#2 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 78,914
Karma: 143095300
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
What eBook is this and where did it come from?
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
|
Not really. Do you have the PDF or DOC file?
Last edited by Sarmat89; 08-09-2022 at 11:58 AM. |
![]() |
![]() |
![]() |
#4 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Aug 2022
Device: none
|
|
![]() |
![]() |
![]() |
#5 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,441
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
It is a fairly trivial job using Sigil.
The first step was to mend and prettify the files to get rid of those extra spaces at the end of the line. The second step was to convert the space-nbsp pairs at a single space and the remaining nbsp's to a single space. In the third step, to convert the multiple paragraphs to a single paragraph, I used a simple regex: Code:
</p> <p class="p1">([a-z]) Code:
\1 Code:
<span style="letter-spacing: 0.0682em;"> Code:
<span> I then used DiapDealer's TagMechanic to remove all the naked spans. The result was: Spoiler:
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Aug 2022
Device: none
|
Quote:
as a good ignorant on the subject I will try your suggestions ![]() |
|
![]() |
![]() |
![]() |
#7 | |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,441
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Quote:
![]() As I mentioned this was done using Sigil as the editor, it can also be done using calibre's editor but some of the syntax is slightly different. |
|
![]() |
![]() |
![]() |
#8 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
|
You should use \p{Ll} instead of [a-z], though.
|
![]() |
![]() |
![]() |
#9 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,441
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
I thought that [a-z] would be a bit clearer that I was matching lower case letters at the start of a line. \p{\p{Ll}} matches a lower case letter that has an upper case equivalent and as far as I am aware, Italian lacks lower case letters that don't have an upper case equivalent. Even German now has an upper case ẞ (since 2017???).
Edit: Mostly a question of style and evidence that if there is more than one way to do something, they will all be used. ![]() Edit2: As pointed out, [a-z] does not catch accented letters. Me bad! Last edited by DNSB; 08-11-2022 at 04:06 PM. |
![]() |
![]() |
![]() |
#10 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
|
Italian has quite a number of accented letters which are not caught by [a-z].
|
![]() |
![]() |
![]() |
#11 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,441
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Betrayed by my English background! I had never noted that [a-z] did not match an accented lower case letter.
|
![]() |
![]() |
![]() |
#12 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Last year, I wrote this post describing my 3-step "merge 'broken' paragraphs" method: I deal with a lot of OCR from PDFs, so fixing up all that mess is very common. ![]() I also listed a ton of other advanced tricks + common errors to look out for. Sometimes it's better to use more "human-readable" examples instead of "correct, but extremely cryptic" regex... especially when teaching complete noobs. ![]() I remember when I first learned about regex, they consistently used complicated "email verification" examples... where you have no idea what sort of voodoo made it work. It wasn't until years later, when actually working on the OCR stuff, that I figured out the true power of regex by building up from the very basic building blocks. ![]() - - - PS. If you want even more regex tips, I usually color-code and give step-by-step breakdowns of my examples. See: Last edited by Tex2002ans; 08-11-2022 at 09:04 PM. |
|
![]() |
![]() |