10-17-2010, 07:54 AM | #1 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
pdf with split words at end of line - how best to convert
attempting a pdf to epub ( & yes I know its a dumb thing to do ) but all goes well except where the original PDF has split a word over 2 lines- which happens a lot in this document
e.g. if PDF goes line 1: xxxxxxxxxxxxxxxxxxxxxx al- line 2: so xxxxxxxxxxxxxxxxx then the epub comes out as " al‐ so" but with the hyphen replaced by thick black bold? vertical line after the l of also NB it doesn't appear when I copy from epub reader & paste to here ), but I see it also in the source window when I open calibre wizard. a text version of the source ( In notepad) shows al- so i.e. there's a line break in there. it must be to do with how a line break character in the PDF is being translated. is there any way to remove / suppress it ? update - I ticked the transliterate unicode box & recoverted zip to epub - that removed the thick black character so now I just see a broken word e.g. "al- so" . is it possible to force an auto repair of all broken words somehow. it would be like a global replace of "- " with NULL but filtering out the genuine use of "-" characters - something like remove all "- " except when preceeded by a space ? Last edited by cybmole; 10-17-2010 at 08:03 AM. |
10-17-2010, 08:51 AM | #2 |
Guru
Posts: 869
Karma: 2676800
Join Date: Aug 2008
Location: Taranaki - NZ
Device: Kobo Aura H2O, Kobo Forma
|
The main problem is that while many of the end of line hyphens are there to break up words to improve the typography of the book, some will be genuinely hyphenated words that should remain so.
And there probably isn't an automated way of determining this during conversion. |
Advert | |
|
10-17-2010, 09:16 AM | #3 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Actually Calibre does go through and remove hyphenated words intelligently. It uses the document itself as a dictionary to see if there is a variant of the word without a hyphen, and deletes the hyphen if there is a match.
The problem in this case is it's a crappy pdf with some other character encoded in addition to the hyphen. Unless this is a common issue across many pdfs (and I've never seen it with lots of test cases), it's probably not something that will get covered in the code. |
10-17-2010, 10:44 AM | #4 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
seems to be a good quality, non-commercial PDF, unless I'm misunderstanding the creative commons licence ? QUOTE from http://www.nimblebooks.com/wordpress...mmons-license/ GLOBALISTAN free under Creative Commons License Inspired by the example of the science fiction novelist Peter Watts, who released the full text of his outstanding novel BLINDSIGHT under a Creative Commons License last year to deservedly rapturous acclaim from Boing Boing! and many others, Pepe Escobar and I are happy to announce the Free GLOBALISTAN Project. The full text of Pepe’s brilliant new book, GLOBALISTAN: HOW THE GLOBALIZED WORLD IS DISSOLVING INTO LIQUID WAR, is now available under a Creative Commons license in both PDF and html format ENDQUOTE maybe I should try grabbing & converting a html version instead ? Unfortunately the link to html version at the above site seems broken - only the pdf link is working. PS could someone please explain - if the book is being legally distributed for free, with the author's blessing , how come Amazon still want £5.27 for a Kindle version ? Last edited by cybmole; 10-17-2010 at 10:55 AM. |
|
10-17-2010, 11:43 AM | #5 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
UPDATE _ i think I may have fixed it - I converted .mobi to .rtf & began a [ find "- " replace with null] process in Word , after doing a few manually it seemed to be finding only correct items to fix so I fired off replace all which did 1100+ changes. I'll convert back into .mobi now & see how it goes - well it improved the text , I think. but a regex solution would maybe be better, I've preserved an unchanged epub version for possible further experimentation. I see also that in the epub and mobi conversions some pictures are messed up - this is probably a epub format limitation. - the original PDF contains charts that seem to be made of 6 or 7 panels appended together horizontally. the convertsion process has separated those into vertical stacks of picture slices. I guess I'll have to read the pdf to see those correctly. Last edited by cybmole; 10-17-2010 at 12:04 PM. |
|
Advert | |
|
10-17-2010, 12:04 PM | #6 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I was reading my Tom's Hardware recipe-created ebook today and saw that you could buy a Kindle version (probably with all the ads from the site) for only $.99 a month to replace my Calibre free version. |
|
10-17-2010, 04:17 PM | #7 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Code:
(?<=\w)‐\s |
|
10-18-2010, 02:15 AM | #8 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
thanks for the code - I@d already done it via rtf & word but that looks much slicker.
I have more questions about regex for character replacement but I'll start a new thread |
10-19-2010, 08:27 AM | #9 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
my repair job via word is flowing well on Kindle, but as an additional test I ran the same Globalistan.pdf through DNAML software's pdftoepub, to see how it got on with the line break words :-
it screwed up: in epub reader I see the vertical bars, & here I see exclamation marks ( after copy paste). book extract showing the bug: "context of re‐medievalization, where those who control power control weapons, money and The Word, this book also aims to provide a counter‐narrative." PS autokindle gave up after 5 seconds saying possible copy protection ( duh - it's an unprotected PDF ! ) |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Stop line wrapping at quotes at the end of a paragraph | sherman | ePub | 6 | 05-13-2010 02:52 PM |
Denial of Service 5: End of Line. | Steven Lyle Jordan | Writers' Corner | 19 | 11-10-2009 10:58 PM |
Repagination problems? Losing words at the end of pages | melrowgo | Sony Reader | 6 | 05-26-2009 12:57 PM |
only 1 to 3 words appear on every second line | ricknz | Calibre | 4 | 07-17-2008 07:49 AM |
After split pdf file, use Rasterfarian. | harpum | Sony Reader | 0 | 07-14-2007 01:20 AM |