PDF 2 EPUB - font problem

sulka · 09-03-2010, 03:52 PM

Hi,

I am trying to convert a pdf to epub using Calibre. This is a text fragment from pdf before conversion:

Quote:

—Nie chodzi o to, że biorę—tłumaczył ktoś, gdy Case przeciskał się między
ludźmi stłoczonymi u wejścia do Chat.—Po prostu mój organizm cierpi na silny
niedostatek narkotyków.
To był głos z Ciągu i żart z Ciągu. W barze Chatsubo spotykali się zawodowi
ekspatrianci. Można tu było pić przez cały tydzień, nie słysząc nawet słowa po japońsku.

and this one is from epub after conversion:

Quote:

— Nie chodzi o to, ˙ze bior˛e — tłumaczył kto´s, gdy Case przeciskał si˛e mi˛edzy lud´zmi stłoczonymi u wej´scia do Chat. — Po prostu mój organizm cierpi na silny niedostatek narkotyków.

To był głos z Ci ˛

agu i ˙zart z Ci ˛

agu. W barze Chatsubo spotykali si˛e zawodowi

ekspatrianci. Mo˙zna tu było pi´c przez cały tydzie´n, nie słysz ˛

What's additional: when I select this text fragment under acrobat, ctrl-c and ctrl-v into notepad, the same problem occurs in notepad as in epub.

I've attached screenshot of fonts used in the pdf, seems like none of them is truetype. What should I do to convert correctly? Help anyone?

GrzegorzN · 09-04-2010, 07:21 AM

Oh yeah, I've noticed the exact same issue with some PDFs in my collection. It seems that every accented Polish letter is in fact composed of 2 (or more) characters, layered on top of another (is this internally a ligature? probably not...)

Of course once you convert the PDF to a format that doesn't support overlapping characters, they get shifted into separate positions, and you get the result you're seeing.

Most of the stuff I'm having trouble with are papers built with pdfTeX. I suppose PostScript->PDF documents might also suffer from the same problem. Do your PDF properties say which tool was used to create it?

Anyway, I suppose that if the combinations used to represent accented characters in the PDF are specific enough, it might(?) be possible to do patch it up with a series of 'search&replace':

˛e -> ę
´s -> ś
´c -> ć
˛\n\na -> ą
(...)

You can definitely 'fix' your document by exporting it to some editable format like text or RTF, and search-replacing all the broken substrings (and then converting it to epub), but it would be nice to have it built into the conversion process.

According to TeX the way to typeset each 'accented' glyph (and there's a huge number of them) might differ between fonts, but maybe in practice it's not that bad. I see for example that your PDFs are mapping 'ą' in exactly the same way as mine -- using a string that includes two linebreaks. So there might be a common pattern, and if that's the case, it might be possible to create a reverse mapping table. There might even be some industry standard that describes mappings like 'oacute -> ´o'

I don't think Calibre supports anything like that at the moment though (?)

sulka · 09-04-2010, 08:22 AM

Thanks for your reply. The pdf creator is GPL Ghostscript 8.56. I will try to put this into another format and then replace broken letters, but it doesn't look like a very convenient method though.

Glenndk · 09-04-2010, 08:42 AM

Quote:

Originally Posted by sulka

Hi,
...
What should I do to convert correctly? Help anyone?

Did you remember to switch on "keep ligatures" in:
Preferences ->Conversion.
?

sulka · 09-04-2010, 09:13 AM

Yes, I did it both ways, with and without, no difference, the problem remains and both EPUBS look visually the same.

user_none · 09-04-2010, 10:02 AM

Quote:

Originally Posted by GrzegorzN

Oh yeah, I've noticed the exact same issue with some PDFs in my collection. It seems that every accented Polish letter is in fact composed of 2 (or more) characters, layered on top of another (is this internally a ligature? probably not...)

It's not a ligature it's your fist idea. The PDF stores the character as two characters it draws over one another.

Quote:

Originally Posted by GrzegorzN

According to TeX the way to typeset each 'accented' glyph (and there's a huge number of them) might differ between fonts, but maybe in practice it's not that bad. I see for example that your PDFs are mapping 'ą' in exactly the same way as mine -- using a string that includes two linebreaks. So there might be a common pattern, and if that's the case, it might be possible to create a reverse mapping table. There might even be some industry standard that describes mappings like 'oacute -> ´o'

I don't think Calibre supports anything like that at the moment though (?)

PDF input uses a character mapping for the parents issue. However, German, Spanish and French characters are all I added support for because that's what I had test books for and am somewhat familiar with. I can easily add other characters to the mapping.

I just need the lowercase and upper case unicode character and the two characters that the PDF uses to represent them.

sulka · 09-04-2010, 03:37 PM

Quote:

Originally Posted by user_none

I can easily add other characters to the mapping.

I just need the lowercase and upper case unicode character and the two characters that the PDF uses to represent them.

That would be great! The question is how can I get this complete info that you need? For some letters it looks simple, for others (like ą) there are some additional linebreaks... I mean, is there a way to completely, at once and for all troubling letters, find the proper mapping?

user_none · 09-04-2010, 03:45 PM

Quote:

Originally Posted by sulka

That would be great! The question is how can I get this complete info that you need? For some letters it looks simple, for others (like ą) there are some additional linebreaks... I mean, is there a way to completely, at once and for all troubling letters, find the proper mapping?

I just need to know the characters (ordered by how they show in the document) and the letter it should be. calibre handles the line breaks and spaces when mapping. So a text like:

Code:

A B C

Quote:

Originally Posted by sulka

I mean, is there a way to completely, at once and for all troubling letters, find the proper mapping?

Unfortunately no. I went though and found PDFs in German, Spanish and French and identified all non ascii characters, converted and made the mapping.

Basically, knowing the alphabet for the language you're working with, identify the non ascii characters, convert a PDF and find all of those characters and put together the mapping.

sulka · 09-04-2010, 04:36 PM

Quote:

˙z -> ż
´s -> ś
˛a -> ą (there is an additional space before ˛ not visible in the forum...)
˛e -> ę
´n -> ń
´c -> ć
´z -> ź

˙Z -> Ż
´S -> Ś
˛A -> Ą (there is an additional space before ˛ not visible in the forum...)
˛E -> Ę
´N -> Ń
´C -> Ć
´Z -> Ź

Strange thing with ą, the first time I copied it from EPUB, it put additional line breaks, now after another conversion, there is only one space and ˛ sign...

Is that what you need?

user_none · 09-04-2010, 05:11 PM

Quote:

Originally Posted by sulka

Is that what you need?

Yep. I've added those characters to the mapping. If you find more let me know and I'll add those as well. It should be in the next release; if you experience issues let me know.

sulka · 09-04-2010, 05:13 PM

Thank you very much!

GrzegorzN · 09-05-2010, 06:41 PM

In my case the precise code for 'ą' seems to always be
<SPACE>˛<CR><LF><CR><LF>a

Or, in UTF-8 hex, Win-style newlines:
0x20 0xCB 0x9B 0x0D 0x0A 0x0D 0x0A 0x61

I suppose the <CR><LF> might be sensitive to EOL settings of the converter/output writer.

All remaining accented characters in my PDFs match sulka's tables.

Thanks!

GrzegorzN · 09-11-2010, 08:24 AM

I've tested the latest Calibre release (0.7.18) -- does a good job of converting accented characters in my PDFs, but 'ą' characters are now prefixed with a space. Maybe sulka will give it a go and report as well, it'd be good to know it's just my exotic PDFs are causing problems.

I enabled debug in conversion options (should've done it right at the beginning...) and I see that in the input\index.html document the 'ą' character (in the middle of a word) is represented as

 ˛ [CR][LF]a

However, I don't see any way of fixing that -- leading spaces might be valid word separators (if a word begins with an accented character), so they shouldn't be automatically removed. I guess I'll have to use an intermediate output format and apply some manual fixes to it, but at least the converter does most of the work for me now, so that's a big improvement

sulka · 09-13-2010, 08:26 AM

Quote:

Originally Posted by GrzegorzN

However, I don't see any way of fixing that -- leading spaces might be valid word separators (if a word begins with an accented character), so they shouldn't be automatically removed.

I'll try it and let know. Frankly, I don't recall any Polish words starting with "ą" (and this is the only letter with leading space), so there is very little chance those spaces are something else than part of "ą". Only in some artificial forms (like i.e. "Edward Ącki"), but this is extremely rare. Besides, there is comma in between, usually there is a space after comma and no space before at the same time, so every combination of [space]+[,]+[a] should be "ą".

sulka · 09-13-2010, 01:46 PM

OK, here we go. I see two different issues here (taken from the debugged html):

The first one:

Quote:

Niebo nad portem miało barwę ekranu monitora nastrojonego na nieistniej ący
kanał.

This one looks like the conversion is almost ok, there is an additional space before "ą", see the word "nieistniej ący", should be "nieistniejący".

The second one:

Quote:

i wielu innym, którzy wiedz ˛
a, 
za co. 
CZ ˛
E Ś ´
C I
CHIBA CITY BLUES
Rozdział 1

This one is strange, like the conversion tool does not take into consideration that the text is bold/italized/underlined, the problem is not only with "ą", but also with "Ę" and "Ć" at least...

The corrected text should look like:

Quote:

i wielu innym, którzy wiedzą, 
za co. 
CZĘŚĆ I
CHIBA CITY BLUES
Rozdział 1

09-11-2010, 08:24 AM	#13
GrzegorzN Junior Member Posts: 9 Karma: 10 Join Date: Aug 2010 Device: Kindle 3	I've tested the latest Calibre release (0.7.18) -- does a good job of converting accented characters in my PDFs, but 'ą' characters are now prefixed with a space. Maybe sulka will give it a go and report as well, it'd be good to know it's just my exotic PDFs are causing problems. I enabled debug in conversion options (should've done it right at the beginning...) and I see that in the input\index.html document the 'ą' character (in the middle of a word) is represented as  ˛<br>[CR][LF]a However, I don't see any way of fixing that -- leading spaces might be valid word separators (if a word begins with an accented character), so they shouldn't be automatically removed. I guess I'll have to use an intermediate output format and apply some manual fixes to it, but at least the converter does most of the work for me now, so that's a big improvement

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF to ePub problem	Dark123	Calibre	16	08-08-2010 08:09 AM
PDF to ePub (New line problem)	Dark123	Calibre	3	02-13-2010 08:41 PM
Pocket Pro epub font change problem	vidgej	ePub	8	02-12-2010 12:30 PM
Problem with epub to lrf with changing font size and line spacing	kleinbiker	Calibre	4	12-17-2009 08:55 AM
Wide margins in html to epub; font size mngmt; PDF metadata	dementrio	Calibre	2	08-01-2009 01:33 AM

09-04-2010, 07:21 AM	#2
GrzegorzN Junior Member Posts: 9 Karma: 10 Join Date: Aug 2010 Device: Kindle 3	Oh yeah, I've noticed the exact same issue with some PDFs in my collection. It seems that every accented Polish letter is in fact composed of 2 (or more) characters, layered on top of another (is this internally a ligature? probably not...) Of course once you convert the PDF to a format that doesn't support overlapping characters, they get shifted into separate positions, and you get the result you're seeing. Most of the stuff I'm having trouble with are papers built with pdfTeX. I suppose PostScript->PDF documents might also suffer from the same problem. Do your PDF properties say which tool was used to create it? Anyway, I suppose that if the combinations used to represent accented characters in the PDF are specific enough, it might(?) be possible to do patch it up with a series of 'search&replace': ˛e -> ę ´s -> ś ´c -> ć ˛\n\na -> ą (...) You can definitely 'fix' your document by exporting it to some editable format like text or RTF, and search-replacing all the broken substrings (and then converting it to epub), but it would be nice to have it built into the conversion process. According to TeX the way to typeset each 'accented' glyph (and there's a huge number of them) might differ between fonts, but maybe in practice it's not that bad. I see for example that your PDFs are mapping 'ą' in exactly the same way as mine -- using a string that includes two linebreaks. So there might be a common pattern, and if that's the case, it might be possible to create a reverse mapping table. There might even be some industry standard that describes mappings like 'oacute -> ´o' I don't think Calibre supports anything like that at the moment though (?)

09-04-2010, 08:22 AM	#3
sulka Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Nook	Thanks for your reply. The pdf creator is GPL Ghostscript 8.56. I will try to put this into another format and then replace broken letters, but it doesn't look like a very convenient method though.

09-04-2010, 09:13 AM	#5
sulka Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Nook	Yes, I did it both ways, with and without, no difference, the problem remains and both EPUBS look visually the same.

09-04-2010, 05:13 PM	#11
sulka Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Nook	Thank you very much!

09-05-2010, 06:41 PM	#12
GrzegorzN Junior Member Posts: 9 Karma: 10 Join Date: Aug 2010 Device: Kindle 3	In my case the precise code for 'ą' seems to always be <SPACE>˛<CR><LF><CR><LF>a Or, in UTF-8 hex, Win-style newlines: 0x20 0xCB 0x9B 0x0D 0x0A 0x0D 0x0A 0x61 I suppose the <CR><LF> might be sensitive to EOL settings of the converter/output writer. All remaining accented characters in my PDFs match sulka's tables. Thanks!

Advert

Advert