![]() |
#1 | ||
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
|
PDF 2 EPUB - font problem
Hi,
I am trying to convert a pdf to epub using Calibre. This is a text fragment from pdf before conversion: Quote:
Quote:
I've attached screenshot of fonts used in the pdf, seems like none of them is truetype. What should I do to convert correctly? Help anyone? |
||
![]() |
![]() |
![]() |
#2 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Aug 2010
Device: Kindle 3
|
Oh yeah, I've noticed the exact same issue with some PDFs in my collection. It seems that every accented Polish letter is in fact composed of 2 (or more) characters, layered on top of another (is this internally a ligature? probably not...)
Of course once you convert the PDF to a format that doesn't support overlapping characters, they get shifted into separate positions, and you get the result you're seeing. Most of the stuff I'm having trouble with are papers built with pdfTeX. I suppose PostScript->PDF documents might also suffer from the same problem. Do your PDF properties say which tool was used to create it? Anyway, I suppose that if the combinations used to represent accented characters in the PDF are specific enough, it might(?) be possible to do patch it up with a series of 'search&replace': ˛e -> ę ´s -> ś ´c -> ć ˛\n\na -> ą (...) You can definitely 'fix' your document by exporting it to some editable format like text or RTF, and search-replacing all the broken substrings (and then converting it to epub), but it would be nice to have it built into the conversion process. According to TeX the way to typeset each 'accented' glyph (and there's a huge number of them) might differ between fonts, but maybe in practice it's not that bad. I see for example that your PDFs are mapping 'ą' in exactly the same way as mine -- using a string that includes two linebreaks. So there might be a common pattern, and if that's the case, it might be possible to create a reverse mapping table. There might even be some industry standard that describes mappings like 'oacute -> ´o' I don't think Calibre supports anything like that at the moment though (?) |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
|
Thanks for your reply. The pdf creator is GPL Ghostscript 8.56. I will try to put this into another format and then replace broken letters, but it doesn't look like a very convenient method though.
|
![]() |
![]() |
![]() |
#4 |
Member
![]() Posts: 23
Karma: 12
Join Date: Jul 2010
Device: Kindle
|
|
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
|
Yes, I did it both ways, with and without, no difference, the problem remains and both EPUBS look visually the same.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | ||
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
Quote:
I just need the lowercase and upper case unicode character and the two characters that the PDF uses to represent them. |
||
![]() |
![]() |
![]() |
#7 |
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
|
That would be great! The question is how can I get this complete info that you need? For some letters it looks simple, for others (like ą) there are some additional linebreaks... I mean, is there a way to completely, at once and for all troubling letters, find the proper mapping?
|
![]() |
![]() |
![]() |
#8 | ||
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
Code:
A B C Quote:
Basically, knowing the alphabet for the language you're working with, identify the non ascii characters, convert a PDF and find all of those characters and put together the mapping. |
||
![]() |
![]() |
![]() |
#9 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
|
Quote:
Is that what you need? |
|
![]() |
![]() |
![]() |
#10 |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
|
![]() |
![]() |
![]() |
#11 |
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
|
Thank you very much!
|
![]() |
![]() |
![]() |
#12 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Aug 2010
Device: Kindle 3
|
In my case the precise code for 'ą' seems to always be
<SPACE>˛<CR><LF><CR><LF>a Or, in UTF-8 hex, Win-style newlines: 0x20 0xCB 0x9B 0x0D 0x0A 0x0D 0x0A 0x61 I suppose the <CR><LF> might be sensitive to EOL settings of the converter/output writer. All remaining accented characters in my PDFs match sulka's tables. Thanks! |
![]() |
![]() |
![]() |
#13 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Aug 2010
Device: Kindle 3
|
I've tested the latest Calibre release (0.7.18) -- does a good job of converting accented characters in my PDFs, but 'ą' characters are now prefixed with a space. Maybe sulka will give it a go and report as well, it'd be good to know it's just my exotic PDFs are causing problems.
I enabled debug in conversion options (should've done it right at the beginning...) and I see that in the input\index.html document the 'ą' character (in the middle of a word) is represented as ˛<br>[CR][LF]a However, I don't see any way of fixing that -- leading spaces might be valid word separators (if a word begins with an accented character), so they shouldn't be automatically removed. I guess I'll have to use an intermediate output format and apply some manual fixes to it, but at least the converter does most of the work for me now, so that's a big improvement ![]() |
![]() |
![]() |
![]() |
#14 |
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
|
I'll try it and let know. Frankly, I don't recall any Polish words starting with "ą" (and this is the only letter with leading space), so there is very little chance those spaces are something else than part of "ą". Only in some artificial forms (like i.e. "Edward Ącki"), but this is extremely rare. Besides, there is comma in between, usually there is a space after comma and no space before at the same time, so every combination of [space]+[,]+[a] should be "ą".
|
![]() |
![]() |
![]() |
#15 | |||
Member
![]() Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
|
OK, here we go. I see two different issues here (taken from the debugged html):
The first one: Quote:
The second one: Quote:
The corrected text should look like: Quote:
|
|||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to ePub problem | Dark123 | Calibre | 16 | 08-08-2010 08:09 AM |
PDF to ePub (New line problem) | Dark123 | Calibre | 3 | 02-13-2010 08:41 PM |
Pocket Pro epub font change problem | vidgej | ePub | 8 | 02-12-2010 12:30 PM |
Problem with epub to lrf with changing font size and line spacing | kleinbiker | Calibre | 4 | 12-17-2009 08:55 AM |
Wide margins in html to epub; font size mngmt; PDF metadata | dementrio | Calibre | 2 | 08-01-2009 01:33 AM |