Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 08-09-2022, 11:03 AM   #1
truborn
Junior Member
truborn began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2022
Device: none
a little help

Hello everyone, new to the forum maybe I'm wrong the place where I place my request.
Can you help me do something please to correct this type of error?
Is there an automatism, maybe a plugin, to correct many pages of this epub?
I attach two .txt files, as an example.
I hope and look forward with confidence.
Thanks a lot
Attached Files
File Type: txt ec1.txt (3.0 KB, 113 views)
File Type: txt ec2.txt (5.0 KB, 95 views)
truborn is offline   Reply With Quote
Old 08-09-2022, 11:46 AM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 78,914
Karma: 143095300
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
What eBook is this and where did it come from?
JSWolf is offline   Reply With Quote
Advert
Old 08-09-2022, 11:54 AM   #3
Sarmat89
Fanatic
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
Not really. Do you have the PDF or DOC file?

Last edited by Sarmat89; 08-09-2022 at 11:58 AM.
Sarmat89 is offline   Reply With Quote
Old 08-09-2022, 03:06 PM   #4
truborn
Junior Member
truborn began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2022
Device: none
Quote:
Originally Posted by Sarmat89 View Post
Not really. Do you have the PDF or DOC file?
sorry, no, i don't have pdf or doc. The epub was given to me as it is, a line-by-line correction would be a huge job. Is there really nothing to do?
Thanks for your interest
truborn is offline   Reply With Quote
Old 08-09-2022, 03:32 PM   #5
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 44,441
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
It is a fairly trivial job using Sigil.

The first step was to mend and prettify the files to get rid of those extra spaces at the end of the line.

The second step was to convert the space-nbsp pairs at a single space and the remaining nbsp's to a single space.

In the third step, to convert the multiple paragraphs to a single paragraph, I used a simple regex:
Code:
</p>

  <p class="p1">([a-z])
with the replacement being space\1:
Code:
 \1
For the fourth step, I searched for:
Code:
<span style="letter-spacing: 0.0682em;">
and replaced it with:
Code:
<span>
.

I then used DiapDealer's TagMechanic to remove all the naked spans.

The result was:
Spoiler:
Quote:
<p class="p1">Quando entrai in camera mia, credetti di aver lasciato uno di quei film acceso sul lettore DVD. Ma nel giro di pochi secondi capii che non era così. Ciò che vedevo era reale. Per quanto realistici si sforzassero di essere i film che guardavo, c’era sempre la consapevolezza che si trattava di finzione. Questo mi creava un blocco nella mente, una sorta di pozzo dell’orrore del quale non ero del tutto conscia finché non assistevo al dolore vero, alla sofferenza autentica. Alcune scene del notiziario serale mi disgustavano e mi turbavano e dovevo comunque arrendermi a guardare almeno in parte la robaccia che guardavano online i miei amici. Decapitazioni, incidenti d’auto, morti e omicidi della vita reale. Sapevo che scene del genere mi avevano traumatizzato sul serio. Inoltre, ricordi dell’incidente sepolti nella memoria riaffioravano nei momenti più inaspettati. Mi ci volle qualche istante per rendermi conto di quanto stavo osservando. Un ammasso di carne rossa e sanguinolenta che penzolava da una fune, ondeggiando al minimo tocco. Sullo sfondo, i resti di un paio di teloni per tende da campeggio rovesciati a terra e su uno di essi una sagoma travolta e aggredita, che muoveva in fretta gli arti. Sembrava uno di quei giocattoli a molla che era stato caricato troppo.</p>

<p class="p1">Battei le palpebre e mi sedetti sul letto. “È il posto che stavo guardando prima”.</p>

<p class="p1">La grotta ora era fuori dall’inquadratura e la videocamera era ferma immobile, come se fosse stata sistemata su un treppiede. Le luci appese tra le tende ancora in piedi si agitavano a mezz’aria, proiettando ombre inquiete.</p>

<p class="p1">Battei di nuovo le palpebre, come per resettare la vista. Tenni gli occhi chiusi per più tempo del normale, pensando: “Che cosa sto guardando?”.</p>

<p class="p1">Quando li riaprii, qualcuno sbucò dal fitto degli alberi e cercò di entrare in una delle grandi tende. Qualcosa – una sagoma nell’aria, una macchia sullo schermo, forse persino un’immagine fantasma – seguì i malcapitati attraverso la radura.</p>

<p class="p1">Non appena li toccò, quelli caddero a terra.</p>

<p class="p1">Il cuore mi galoppava nel petto, battendo dolorosamente all’impazzata. Mi avvicinai allo schermo, ma le persone era molto lontane, nascoste dalle ombre tremolanti, e la vicinanza non fece altro che rendere l’immagine ancora più sgranata. Sembrava che stessero lottando. Il loro viso non era più bianco.</p>

<p class="p1">Era rosso.</p>

<p class="p1">Se si fosse trattato di un film horror, avrei riso per gli effetti speciali. Non riuscivo a vedere cosa stesse accadendo. Era tutto così confuso. Le sagome massacrate ora si contorcevano appena, come se stessero per esaurire le forze. La carne continuava a dondolare.</p>

<p class="p1"><a id="a85"></a>Qualcosa si separò dall’oggetto appeso alla fune, da quelli che ormai avevo capito essere i resti di uno degli speleologi. Rimase lì aggrappata per un po’,</p>
DNSB is online now   Reply With Quote
Advert
Old 08-11-2022, 09:08 AM   #6
truborn
Junior Member
truborn began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2022
Device: none
Quote:
Originally Posted by DNSB View Post
It is a fairly trivial job using Sigil.

The first step was to mend and prettify the files to get rid of those extra spaces at the end of the line.

The second step was to convert the space-nbsp pairs at a single space and the remaining nbsp's to a single space.

In the third step, to convert the multiple paragraphs to a single paragraph, I used a simple regex:
Code:
</p>

  <p class="p1">([a-z])
with the replacement being space\1:
Code:
 \1
For the fourth step, I searched for:
Code:
<span style="letter-spacing: 0.0682em;">
and replaced it with:
Code:
<span>
.

I then used DiapDealer's TagMechanic to remove all the naked spans.

The result was:
Thanks so much
as a good ignorant on the subject I will try your suggestions
truborn is offline   Reply With Quote
Old 08-11-2022, 11:39 AM   #7
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 44,441
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by truborn View Post
Thanks so much
as a good ignorant on the subject I will try your suggestions
Basically, as long as you keep a backup of the book, you can always start over. So have fun.

As I mentioned this was done using Sigil as the editor, it can also be done using calibre's editor but some of the syntax is slightly different.
DNSB is online now   Reply With Quote
Old 08-11-2022, 12:26 PM   #8
Sarmat89
Fanatic
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
You should use \p{Ll} instead of [a-z], though.
Sarmat89 is offline   Reply With Quote
Old 08-11-2022, 01:40 PM   #9
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 44,441
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by Sarmat89 View Post
You should use \p{Ll} instead of [a-z], though.
I thought that [a-z] would be a bit clearer that I was matching lower case letters at the start of a line. \p{\p{Ll}} matches a lower case letter that has an upper case equivalent and as far as I am aware, Italian lacks lower case letters that don't have an upper case equivalent. Even German now has an upper case ẞ (since 2017???).

Edit: Mostly a question of style and evidence that if there is more than one way to do something, they will all be used.

Edit2: As pointed out, [a-z] does not catch accented letters. Me bad!

Last edited by DNSB; 08-11-2022 at 04:06 PM.
DNSB is online now   Reply With Quote
Old 08-11-2022, 03:09 PM   #10
Sarmat89
Fanatic
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
Italian has quite a number of accented letters which are not caught by [a-z].
Sarmat89 is offline   Reply With Quote
Old 08-11-2022, 04:06 PM   #11
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 44,441
Karma: 167726581
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Betrayed by my English background! I had never noted that [a-z] did not match an accented lower case letter.
DNSB is online now   Reply With Quote
Old 08-11-2022, 08:52 PM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by DNSB View Post
In the third step, to convert the multiple paragraphs to a single paragraph, I used a simple regex:
Good steps. I agree.

Last year, I wrote this post describing my 3-step "merge 'broken' paragraphs" method:

I deal with a lot of OCR from PDFs, so fixing up all that mess is very common.

I also listed a ton of other advanced tricks + common errors to look out for.

Quote:
Originally Posted by Sarmat89 View Post
You should use \p{Ll} instead of [a-z], though.
Sometimes it's better to use more "human-readable" examples instead of "correct, but extremely cryptic" regex... especially when teaching complete noobs.

I remember when I first learned about regex, they consistently used complicated "email verification" examples... where you have no idea what sort of voodoo made it work.

It wasn't until years later, when actually working on the OCR stuff, that I figured out the true power of regex by building up from the very basic building blocks.

- - -

PS. If you want even more regex tips, I usually color-code and give step-by-step breakdowns of my examples. See:

Last edited by Tex2002ans; 08-11-2022 at 09:04 PM.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump


All times are GMT -4. The time now is 09:01 PM.


MobileRead.com is a privately owned, operated and funded community.