![]() |
#1 |
Tenrec
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 724
Karma: 1076988
Join Date: Oct 2012
Device: Kobo Aura One, Kobo Glo
|
Trying to fix typos and bad formatting
I'm not really experienced with this sort of thing, but I have been reading a horribly typo filled epub and finally decided to try and fix it. It's a public domain book (pretty sure in all countries), so I'm posting a copy here.
I tried using tweak book in calibre....so the first file has been edited....but go to page 143 out of 1114 in the calibre book viewer (not sure how else to mark the place), and it is in the original condition. I'm kind of mad I wasted the time in the first file as that is about how far I'd already read on my glo. It seems it is one of those books that have been scanned, used a crappy OCR and that is it....since it's free and there aren't lots of copies of this book out there, I guess I can't complain....I had trouble enough finding this copy. BUT fixing it has been a real pain. Even doing spell check is pretty annoying since there is a lot of french in the book as well as a lot of names of people and places that don't repeat. I don't even know if I could do the spell check if I didn't already know how this author writes and how his work is usually translated. On top of typos, there are innummerable extra spaces *everywhere*....and sentences lopped in half with a new page and new paragraph marker. So I've been trying to do search and search and replace, but there are so many different cases. So I guess the big question is, is there a simpler way to do what I've been doing?? I've spent hours just doing the first file and there are 8 or so. I could pretty much read it in the process. I'm not at all experienced at this type of thing, so maybe I'm missing something, or maybe this situation just requires this much work? I also was curious about 2 formatting things: 1) the <p> and <p/> markers...is it the fact that the next ones are on a new line that is making the spaces appear between paragraphs on my glo? 2) there are randomly placed new pages using this code: <div class="newpage" id="page-6"/> Is that page ID info pertinent to counting pages using the ADE method, or is it a totally unnecessary new page that I can get rid of (especially since they are never anywhere near new chapters) If an expert wants to take a look at one of the files (not part0000...unless you can correct my edited version even more ![]() |
![]() |
![]() |
![]() |
#2 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: May 2012
Location: Australia
Device: Kobo Touch
|
I've used an ePub to PDF converter, then used MS Word to fix up the spelling, formatting and other mistakes, then loaded it into Calibre to convert back into ePub. However, it's a very laborious and incredibly time-consuming exercise. Others may have better methods.
|
![]() |
![]() |
![]() |
#3 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
When I fix books like this, I use Sigil. That allows me to see the epub and do search and replace across all files in it. For a book like this that probably an OCRed scan, that means I can fix the repeated mistakes easily. It also has a spelling checker, and you can add words or names to the list to ignore.
I use Tweak books in calibre for simple things. Mainly to edit the stylesheet or a single spelling mistake. Don't forget to press the "Rebuild" button. I've hit the cancel button or escape key a few times without thinking and wondered where the changes went. Having the <p> tags start in a new line won't add spaces. From what I can see, the extra spaces are between the <p> and </p>. There are a lot wrapping the quotes and other punctuation. I know the first change I would be would be a global change of Code:
<p> " Code:
<p>" The "<div class="newpage" id="page-405"/>" have probably been put there by the person who created the file to map back to the original book. The desktop ADE didn't seem to use them. As there isn't a definition of the "newpage" class, I don't think it will do anything. I have seen a similar thing in other books but they used an anchor tag. Just remembered: I looked in the MR library for a copy of this. There is only a German version. Unless you are trying to match a paper copy you have, it might be useful to look at it to get ideas of the styles used and in to compare the punctuation. Last edited by davidfor; 03-27-2013 at 11:36 PM. Reason: Accidently hot the submit button. |
![]() |
![]() |
![]() |
#4 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,821
Karma: 19162882
Join Date: Nov 2012
Location: Te Riu-a-Māui
Device: Kobo Glo
|
Quote:
None of this is really Kobo-specific, you might get more help in the epub forum. |
|
![]() |
![]() |
![]() |
#5 | |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,887
Karma: 168802811
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Quote:
I inserted a more useful stylesheet which you might want to take a look at -- it's where I set up stuff like left justification, line spacing, spacing between paragraphs and other fun stuff like an inline-block display to allow centering images on ADE and it's equally braindead derivatives. I also added a cover image which is a bit easier to look at than a black block with a bar code. This took about 20 minutes total. I've left most of the typographical errors for you to clean up since I didn't want to actually read the book. ![]() Regards, David |
|
![]() |
![]() |
![]() |
#6 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 415
Karma: 469928
Join Date: Feb 2012
Device: Kobo Clara BW, Moaan Inkpalm 5.
|
I use calibre to explode the epub, then use bluefish to make corrections.
|
![]() |
![]() |
![]() |
#7 |
Tenrec
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 724
Karma: 1076988
Join Date: Oct 2012
Device: Kobo Aura One, Kobo Glo
|
Thanks to all for your replies, especially davidfor and DNSB!
davidfor, I had read a lot of people saying sigil isn't supported any longer in linux, so I hadn't bothered looking into that before..but when you mentioned that I can do all the files together, I thought it was worth a look, and realised that they must have meant that the support is lacking compared to the past. So I was able to get an (albeit) older version installed this morning. I guess if I ever learn how to build from source code, I can do that as they do provide that. But this should work for now. Thanks also for pointing out the german version, I'll take a look at that to give me ideas. DNSB, thanks for going to the trouble of fixing up some of bigger mistakes....that's awesome! And yeah, it is pretty much a reading job to fix the OCR mistakes....and a pretty unenjoyable reading experience at that ![]() I will look at the stylesheet...I don't know much about that at this point, so I guess I should learn, but d*** ignorance was supposed to be bliss!!! Another question: One weird thing did happen when I put the partially corrected version back on to my glo. Last night as I was reading the little section I'd corrected yesterday, but hadn't previously read, I stumbled upon a sentence that doesn't appear correctly on my glo. "To avoid having to make the journey on horseback." (page 56 of 436 on my glo) On the glo, it shows "To avoid having to make the journey on horse-".... when I tried to highlight "horse-"...it came up as horseback in the dictionary, but with the markers being off screen, and not showing the word as highlighted. When I highlight that sentence and the next, the highlighting works normally, and when I go into annotations, the whole sentence is there, but it never shows the rest of the word on my screen. Did I mistakenly delete something in the coding, or is this some other issue? How can I fix it, anyone know? The half of the word is not missing when I use the calibre viewer. Edit: when I look at in Sigil, I don't see anything strange with the word horseback in that sentence...so it seems to be something else I tried changing the font size to be smaller, and it shows the whole word. But when I change the font size to be bigger again, it doesn't show it. When I changed the font (this time to caecilia) and the whole sentence fit again on one line, it showed, but when I enlarged that font so that it wouldn't fit on one line, it once again cut off the 2nd half of the word. I tried a few fonts, and as soon as I made them big enough to not fit the sentence on one line, the same thing happened. I haven't seen this happen before....seems quite strange. Any ideas? Last edited by Uschiekid; 03-28-2013 at 10:02 AM. Reason: update! |
![]() |
![]() |
![]() |
#8 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,309
Karma: 78876004
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
The missing word (or part of a word) is an issue I have raised numerous times in the Kobo beta test, to an overwhelming silence.
I've even supplied a one page test case that demonstrates it too them. it appears to be dependent on font and font size. |
![]() |
![]() |
![]() |
#9 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,169
Karma: 144286760
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
Just edit the ePub and forget PDF even exists. |
|
![]() |
![]() |
![]() |
#10 | |
Tenrec
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 724
Karma: 1076988
Join Date: Oct 2012
Device: Kobo Aura One, Kobo Glo
|
Quote:
Thanks for letting me know, I won't waste time trying to fix it! Edit: Interestingly, I just sent DNSB's copy to my glo and that problem doesn't arise. The word properly goes onto a second line when the font is enlarged. Last edited by Uschiekid; 03-28-2013 at 11:15 AM. |
|
![]() |
![]() |
![]() |
#11 | |
Tenrec
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 724
Karma: 1076988
Join Date: Oct 2012
Device: Kobo Aura One, Kobo Glo
|
Quote:
The copy I uploaded here on MR, I had just done the "generate cover" thing in metadata, so I had an ok cover on my glo. But obviously the cover you added is better, I just am unaware of how to make that cover visible. Edit: Still curious to know what I did wrong, but I ended up just copying the cover in the metadata of the german version that davidfor pointed out (which of course is the same as you added) Last edited by Uschiekid; 03-28-2013 at 01:20 PM. |
|
![]() |
![]() |
![]() |
#12 |
Tenrec
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 724
Karma: 1076988
Join Date: Oct 2012
Device: Kobo Aura One, Kobo Glo
|
This is turning out to be super helpful, as I read german (but for obvious reasons don't want to just read an orginally french book in german!) and I can easily check when the OCR mistakes are unintelligible or for unknown names/places! So thanks again!
Last edited by Uschiekid; 03-28-2013 at 01:52 PM. |
![]() |
![]() |
![]() |
#13 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 691
Karma: 3026110
Join Date: Dec 2008
Location: Lancashire, U.K.
Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +
|
Did you download the original from Archives.org ?
If so then it's not surprising that there is a lot of poor formatting as the OCR'd text in those books was really intended to provide a text search layer for the DJVU (and perhaps PDF) versions so the text hasn't been cleaned up before they used some process to reformat it as EPUB. You sometimes find whole pages that have missed being OCRd - the original image shows up in the DJVU but is not mapped to any text. The point here is that something like OpenOffice/LibreOffice is an ideal way to edit the basic words, sentences etc in the document without having to worry about the html markup in search & replace. My approach for these books is to download the text file and then use OpenOffice to edit it using a set of macros for general text tidying then either import the resultant .odt file into Calibre or use the W2Epub extension to produce the EPUB. I think the d/l'd text files are in markdown format so you could do an initial conversion to extract headers etc using some other s/w or use macros to translate headings etc. Exporting the text using a DJVU viewer gives simple text without the markup. I'm pretty sure that neither have any italic or bold formatting in the extracted text and the only way to re-introduce them is simply by comparing the image version with the corresponding text. It's a painful process as I know, and one made worse if your book is full of dialect or foreign words which shows up as mis-spelling. BobC |
![]() |
![]() |
![]() |
#15 | |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,887
Karma: 168802811
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Quote:
Look for this line: <meta content="cover-image" name="cover" /> and replace it with: <meta content="cover" name="cover" /> and Calibre should show the correct cover. Last edited by DNSB; 03-28-2013 at 05:05 PM. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to edit ebooks (fix typos etc.) while reading? | MCSmarties | Reading and Management | 6 | 07-28-2012 05:08 PM |
Ebooks filled with typos and bad formatting, is it unavoidable? | Algiedi | General Discussions | 70 | 08-02-2011 11:07 AM |
REALLY bad formatting | SeaBookGuy | Amazon Kindle | 9 | 01-05-2011 03:29 PM |
How do I fix formatting on this ebook | lunixer | General Discussions | 9 | 08-16-2010 11:12 PM |
How do I fix bad formatting? | ghostgrass | Calibre | 12 | 05-24-2010 02:18 PM |