Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 10-17-2010, 07:54 AM   #1
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 2,849
Karma: 1163098
Join Date: Sep 2010
Device: Kobo aura HD, Kobo Arc, Kindle Fire HDX 8.9 , Kindle for PC
pdf with split words at end of line - how best to convert

attempting a pdf to epub ( & yes I know its a dumb thing to do ) but all goes well except where the original PDF has split a word over 2 lines- which happens a lot in this document

e.g. if PDF goes
line 1: xxxxxxxxxxxxxxxxxxxxxx al-
line 2: so xxxxxxxxxxxxxxxxx

then the epub comes out as " al‐ so"
but with the hyphen replaced by thick black bold? vertical line after the l of also NB it doesn't appear when I copy from epub reader & paste to here ), but I see it also in the source window when I open calibre wizard.

a text version of the source ( In notepad) shows
al-
so

i.e. there's a line break in there.

it must be to do with how a line break character in the PDF is being translated.

is there any way to remove / suppress it ?

update - I ticked the transliterate unicode box & recoverted zip to epub - that removed the thick black character so now I just see a broken word e.g. "al- so" .

is it possible to force an auto repair of all broken words somehow. it would be like a global replace of "- " with NULL but filtering out the genuine use of "-" characters - something like remove all "- " except when preceeded by a space ?

Last edited by cybmole; 10-17-2010 at 08:03 AM.
cybmole is offline   Reply With Quote
Old 10-17-2010, 08:51 AM   #2
sherman
Addict
sherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animalssherman is kind to children and small, furry animals
 
Posts: 268
Karma: 6566
Join Date: Aug 2008
Location: New Plymouth - NZ
Device: Sony PRS-505/SC, B&N Nook, Sony PRS-650/BC, Kobo Glo
The main problem is that while many of the end of line hyphens are there to break up words to improve the typography of the book, some will be genuinely hyphenated words that should remain so.

And there probably isn't an automated way of determining this during conversion.
sherman is offline   Reply With Quote
 
Enthusiast
Old 10-17-2010, 09:16 AM   #3
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Actually Calibre does go through and remove hyphenated words intelligently. It uses the document itself as a dictionary to see if there is a variant of the word without a hyphen, and deletes the hyphen if there is a match.

The problem in this case is it's a crappy pdf with some other character encoded in addition to the hyphen. Unless this is a common issue across many pdfs (and I've never seen it with lots of test cases), it's probably not something that will get covered in the code.
ldolse is offline   Reply With Quote
Old 10-17-2010, 10:44 AM   #4
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 2,849
Karma: 1163098
Join Date: Sep 2010
Device: Kobo aura HD, Kobo Arc, Kindle Fire HDX 8.9 , Kindle for PC
Quote:
Originally Posted by ldolse View Post
Actually Calibre does go through and remove hyphenated words intelligently. It uses the document itself as a dictionary to see if there is a variant of the word without a hyphen, and deletes the hyphen if there is a match.

The problem in this case is it's a crappy pdf with some other character encoded in addition to the hyphen. Unless this is a common issue across many pdfs (and I've never seen it with lots of test cases), it's probably not something that will get covered in the code.
Globalistan - Pepe Escobar
seems to be a good quality, non-commercial PDF, unless I'm misunderstanding the creative commons licence ?
QUOTE from http://www.nimblebooks.com/wordpress...mmons-license/
GLOBALISTAN free under Creative Commons License
Inspired by the example of the science fiction novelist Peter Watts, who released the full text of his outstanding novel BLINDSIGHT under a Creative Commons License last year to deservedly rapturous acclaim from Boing Boing! and many others, Pepe Escobar and I are happy to announce the Free GLOBALISTAN Project.

The full text of Pepe’s brilliant new book, GLOBALISTAN: HOW THE GLOBALIZED WORLD IS DISSOLVING INTO LIQUID WAR, is now available under a Creative Commons license in both PDF and html format
ENDQUOTE

maybe I should try grabbing & converting a html version instead ? Unfortunately the link to html version at the above site seems broken - only the pdf link is working.

PS could someone please explain - if the book is being legally distributed for free, with the author's blessing , how come Amazon still want £5.27 for a Kindle version ?
Attached Thumbnails
Click image for larger version

Name:	cover.jpg
Views:	87
Size:	323.6 KB
ID:	59890  

Last edited by cybmole; 10-17-2010 at 10:55 AM.
cybmole is offline   Reply With Quote
Old 10-17-2010, 11:43 AM   #5
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 2,849
Karma: 1163098
Join Date: Sep 2010
Device: Kobo aura HD, Kobo Arc, Kindle Fire HDX 8.9 , Kindle for PC
Quote:
Originally Posted by sherman View Post
The main problem is that while many of the end of line hyphens are there to break up words to improve the typography of the book, some will be genuinely hyphenated words that should remain so.

And there probably isn't an automated way of determining this during conversion.
on my Kindle, all the genuine hyphenated words appear like this "xxxxx-xxxxx", all the faulty ones are like this "xxxxx- xxxx" i.e. only the faulty ones have a space after the hyphen, so maybe an auto-fix IS possible ?

UPDATE _ i think I may have fixed it - I converted .mobi to .rtf & began a [ find "- " replace with null] process in Word , after doing a few manually it seemed to be finding only correct items to fix so I fired off replace all which did 1100+ changes. I'll convert back into .mobi now & see how it goes - well it improved the text , I think.

but a regex solution would maybe be better, I've preserved an unchanged epub version for possible further experimentation.

I see also that in the epub and mobi conversions some pictures are messed up - this is probably a epub format limitation. - the original PDF contains charts that seem to be made of 6 or 7 panels appended together horizontally.
the convertsion process has separated those into vertical stacks of picture slices. I guess I'll have to read the pdf to see those correctly.

Last edited by cybmole; 10-17-2010 at 12:04 PM.
cybmole is offline   Reply With Quote
Old 10-17-2010, 12:04 PM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cybmole View Post
PS could someone please explain - if the book is being legally distributed for free, with the author's blessing , how come Amazon still want £5.27 for a Kindle version ?
Because Amazon wants your money?
I was reading my Tom's Hardware recipe-created ebook today and saw that you could buy a Kindle version (probably with all the ads from the site) for only $.99 a month to replace my Calibre free version.
Starson17 is offline   Reply With Quote
Old 10-17-2010, 04:17 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by cybmole View Post
on my Kindle, all the genuine hyphenated words appear like this "xxxxx-xxxxx", all the faulty ones are like this "xxxxx- xxxx" i.e. only the faulty ones have a space after the hyphen, so maybe an auto-fix IS possible ?

UPDATE _ i think I may have fixed it - I converted .mobi to .rtf & began a [ find "- " replace with null] process in Word , after doing a few manually it seemed to be finding only correct items to fix so I fired off replace all which did 1100+ changes. I'll convert back into .mobi now & see how it goes - well it improved the text , I think.

but a regex solution would maybe be better, I've preserved an unchanged epub version for possible further experimentation.

I see also that in the epub and mobi conversions some pictures are messed up - this is probably a epub format limitation. - the original PDF contains charts that seem to be made of 6 or 7 panels appended together horizontally.
the convertsion process has separated those into vertical stacks of picture slices. I guess I'll have to read the pdf to see those correctly.
Just use the remove header/footer regex option to delete the hyphens then.
Code:
(?<=\w)‐\s
That is a different unicode code point than the hyphen that typically occurs in most documents. I'll look into adding that to the default de-hyphenation regex.
ldolse is offline   Reply With Quote
Old 10-18-2010, 02:15 AM   #8
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 2,849
Karma: 1163098
Join Date: Sep 2010
Device: Kobo aura HD, Kobo Arc, Kindle Fire HDX 8.9 , Kindle for PC
thanks for the code - I@d already done it via rtf & word but that looks much slicker.

I have more questions about regex for character replacement but I'll start a new thread
cybmole is offline   Reply With Quote
Old 10-19-2010, 08:27 AM   #9
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 2,849
Karma: 1163098
Join Date: Sep 2010
Device: Kobo aura HD, Kobo Arc, Kindle Fire HDX 8.9 , Kindle for PC
my repair job via word is flowing well on Kindle, but as an additional test I ran the same Globalistan.pdf through DNAML software's pdftoepub, to see how it got on with the line break words :-

it screwed up: in epub reader I see the vertical bars, & here I see exclamation marks ( after copy paste).
book extract showing the bug:
"context of re‐medievalization, where those who control power control weapons, money and The Word, this book also aims to provide a counter‐narrative."

PS autokindle gave up after 5 seconds saying possible copy protection ( duh - it's an unprotected PDF ! )
cybmole is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Stop line wrapping at quotes at the end of a paragraph sherman ePub 6 05-13-2010 02:52 PM
Denial of Service 5: End of Line. Steven Lyle Jordan Writers' Corner 19 11-10-2009 10:58 PM
Repagination problems? Losing words at the end of pages melrowgo Sony Reader 6 05-26-2009 12:57 PM
only 1 to 3 words appear on every second line ricknz Calibre 4 07-17-2008 07:49 AM
After split pdf file, use Rasterfarian. harpum Sony Reader 0 07-14-2007 01:20 AM


All times are GMT -4. The time now is 09:53 AM.


MobileRead.com is a privately owned, operated and funded community.