View Single Post
Old 09-04-2010, 07:21 AM   #2
GrzegorzN
Junior Member
GrzegorzN began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2010
Device: Kindle 3
Oh yeah, I've noticed the exact same issue with some PDFs in my collection. It seems that every accented Polish letter is in fact composed of 2 (or more) characters, layered on top of another (is this internally a ligature? probably not...)

Of course once you convert the PDF to a format that doesn't support overlapping characters, they get shifted into separate positions, and you get the result you're seeing.

Most of the stuff I'm having trouble with are papers built with pdfTeX. I suppose PostScript->PDF documents might also suffer from the same problem. Do your PDF properties say which tool was used to create it?

Anyway, I suppose that if the combinations used to represent accented characters in the PDF are specific enough, it might(?) be possible to do patch it up with a series of 'search&replace':

˛e -> ę
´s -> ś
´c -> ć
˛\n\na -> ą
(...)

You can definitely 'fix' your document by exporting it to some editable format like text or RTF, and search-replacing all the broken substrings (and then converting it to epub), but it would be nice to have it built into the conversion process.

According to TeX the way to typeset each 'accented' glyph (and there's a huge number of them) might differ between fonts, but maybe in practice it's not that bad. I see for example that your PDFs are mapping 'ą' in exactly the same way as mine -- using a string that includes two linebreaks. So there might be a common pattern, and if that's the case, it might be possible to create a reverse mapping table. There might even be some industry standard that describes mappings like 'oacute -> ´o'

I don't think Calibre supports anything like that at the moment though (?)
GrzegorzN is offline   Reply With Quote