MobileRead Forums - View Single Post - pdf with split words at end of line

ldolse · 10-17-2010, 04:17 PM

Quote:

Originally Posted by cybmole

on my Kindle, all the genuine hyphenated words appear like this "xxxxx-xxxxx", all the faulty ones are like this "xxxxx- xxxx" i.e. only the faulty ones have a space after the hyphen, so maybe an auto-fix IS possible ?

UPDATE _ i think I may have fixed it - I converted .mobi to .rtf & began a [ find "- " replace with null] process in Word , after doing a few manually it seemed to be finding only correct items to fix so I fired off replace all which did 1100+ changes. I'll convert back into .mobi now & see how it goes - well it improved the text , I think.

but a regex solution would maybe be better, I've preserved an unchanged epub version for possible further experimentation.

I see also that in the epub and mobi conversions some pictures are messed up - this is probably a epub format limitation. - the original PDF contains charts that seem to be made of 6 or 7 panels appended together horizontally.
the convertsion process has separated those into vertical stacks of picture slices. I guess I'll have to read the pdf to see those correctly.

Just use the remove header/footer regex option to delete the hyphens then.

Code:

(?<=\w)‐\s

That is a different unicode code point than the hyphen that typically occurs in most documents. I'll look into adding that to the default de-hyphenation regex.