Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-03-2010, 03:52 PM   #1
sulka
Member
sulka began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
PDF 2 EPUB - font problem

Hi,

I am trying to convert a pdf to epub using Calibre. This is a text fragment from pdf before conversion:
Quote:
—Nie chodzi o to, że biorę—tłumaczył ktoś, gdy Case przeciskał się między
ludźmi stłoczonymi u wejścia do Chat.—Po prostu mój organizm cierpi na silny
niedostatek narkotyków.
To był głos z Ciągu i żart z Ciągu. W barze Chatsubo spotykali się zawodowi
ekspatrianci. Można tu było pić przez cały tydzień, nie słysząc nawet słowa po japońsku.
and this one is from epub after conversion:
Quote:
— Nie chodzi o to, ˙ze bior˛e — tłumaczył kto´s, gdy Case przeciskał si˛e mi˛edzy lud´zmi stłoczonymi u wej´scia do Chat. — Po prostu mój organizm cierpi na silny niedostatek narkotyków.

To był głos z Ci ˛

agu i ˙zart z Ci ˛

agu. W barze Chatsubo spotykali si˛e zawodowi

ekspatrianci. Mo˙zna tu było pi´c przez cały tydzie´n, nie słysz ˛
What's additional: when I select this text fragment under acrobat, ctrl-c and ctrl-v into notepad, the same problem occurs in notepad as in epub.

I've attached screenshot of fonts used in the pdf, seems like none of them is truetype. What should I do to convert correctly? Help anyone?
Attached Thumbnails
Click image for larger version

Name:	Fonts.jpg
Views:	345
Size:	65.4 KB
ID:	57581  
sulka is offline   Reply With Quote
Old 09-04-2010, 07:21 AM   #2
GrzegorzN
Junior Member
GrzegorzN began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2010
Device: Kindle 3
Oh yeah, I've noticed the exact same issue with some PDFs in my collection. It seems that every accented Polish letter is in fact composed of 2 (or more) characters, layered on top of another (is this internally a ligature? probably not...)

Of course once you convert the PDF to a format that doesn't support overlapping characters, they get shifted into separate positions, and you get the result you're seeing.

Most of the stuff I'm having trouble with are papers built with pdfTeX. I suppose PostScript->PDF documents might also suffer from the same problem. Do your PDF properties say which tool was used to create it?

Anyway, I suppose that if the combinations used to represent accented characters in the PDF are specific enough, it might(?) be possible to do patch it up with a series of 'search&replace':

˛e -> ę
´s -> ś
´c -> ć
˛\n\na -> ą
(...)

You can definitely 'fix' your document by exporting it to some editable format like text or RTF, and search-replacing all the broken substrings (and then converting it to epub), but it would be nice to have it built into the conversion process.

According to TeX the way to typeset each 'accented' glyph (and there's a huge number of them) might differ between fonts, but maybe in practice it's not that bad. I see for example that your PDFs are mapping 'ą' in exactly the same way as mine -- using a string that includes two linebreaks. So there might be a common pattern, and if that's the case, it might be possible to create a reverse mapping table. There might even be some industry standard that describes mappings like 'oacute -> ´o'

I don't think Calibre supports anything like that at the moment though (?)
GrzegorzN is offline   Reply With Quote
Advert
Old 09-04-2010, 08:22 AM   #3
sulka
Member
sulka began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
Thanks for your reply. The pdf creator is GPL Ghostscript 8.56. I will try to put this into another format and then replace broken letters, but it doesn't look like a very convenient method though.
sulka is offline   Reply With Quote
Old 09-04-2010, 08:42 AM   #4
Glenndk
Member
Glenndk began at the beginning.
 
Glenndk's Avatar
 
Posts: 23
Karma: 12
Join Date: Jul 2010
Device: Kindle
Quote:
Originally Posted by sulka View Post
Hi,
...
What should I do to convert correctly? Help anyone?
Did you remember to switch on "keep ligatures" in:
Preferences ->Conversion.
?
Glenndk is offline   Reply With Quote
Old 09-04-2010, 09:13 AM   #5
sulka
Member
sulka began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
Yes, I did it both ways, with and without, no difference, the problem remains and both EPUBS look visually the same.
sulka is offline   Reply With Quote
Advert
Old 09-04-2010, 10:02 AM   #6
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by GrzegorzN View Post
Oh yeah, I've noticed the exact same issue with some PDFs in my collection. It seems that every accented Polish letter is in fact composed of 2 (or more) characters, layered on top of another (is this internally a ligature? probably not...)
It's not a ligature it's your fist idea. The PDF stores the character as two characters it draws over one another.

Quote:
Originally Posted by GrzegorzN View Post
According to TeX the way to typeset each 'accented' glyph (and there's a huge number of them) might differ between fonts, but maybe in practice it's not that bad. I see for example that your PDFs are mapping 'ą' in exactly the same way as mine -- using a string that includes two linebreaks. So there might be a common pattern, and if that's the case, it might be possible to create a reverse mapping table. There might even be some industry standard that describes mappings like 'oacute -> ´o'

I don't think Calibre supports anything like that at the moment though (?)
PDF input uses a character mapping for the parents issue. However, German, Spanish and French characters are all I added support for because that's what I had test books for and am somewhat familiar with. I can easily add other characters to the mapping.

I just need the lowercase and upper case unicode character and the two characters that the PDF uses to represent them.
user_none is offline   Reply With Quote
Old 09-04-2010, 03:37 PM   #7
sulka
Member
sulka began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
Quote:
Originally Posted by user_none View Post
I can easily add other characters to the mapping.

I just need the lowercase and upper case unicode character and the two characters that the PDF uses to represent them.
That would be great! The question is how can I get this complete info that you need? For some letters it looks simple, for others (like ą) there are some additional linebreaks... I mean, is there a way to completely, at once and for all troubling letters, find the proper mapping?
sulka is offline   Reply With Quote
Old 09-04-2010, 03:45 PM   #8
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by sulka View Post
That would be great! The question is how can I get this complete info that you need? For some letters it looks simple, for others (like ą) there are some additional linebreaks... I mean, is there a way to completely, at once and for all troubling letters, find the proper mapping?
I just need to know the characters (ordered by how they show in the document) and the letter it should be. calibre handles the line breaks and spaces when mapping. So a text like:

Code:
A B C
Quote:
Originally Posted by sulka View Post
I mean, is there a way to completely, at once and for all troubling letters, find the proper mapping?
Unfortunately no. I went though and found PDFs in German, Spanish and French and identified all non ascii characters, converted and made the mapping.

Basically, knowing the alphabet for the language you're working with, identify the non ascii characters, convert a PDF and find all of those characters and put together the mapping.
user_none is offline   Reply With Quote
Old 09-04-2010, 04:36 PM   #9
sulka
Member
sulka began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
Quote:
˙z -> ż
´s -> ś
˛a -> ą (there is an additional space before ˛ not visible in the forum...)
˛e -> ę
´n -> ń
´c -> ć
´z -> ź

˙Z -> Ż
´S -> Ś
˛A -> Ą (there is an additional space before ˛ not visible in the forum...)
˛E -> Ę
´N -> Ń
´C -> Ć
´Z -> Ź
Strange thing with ą, the first time I copied it from EPUB, it put additional line breaks, now after another conversion, there is only one space and ˛ sign...

Is that what you need?
sulka is offline   Reply With Quote
Old 09-04-2010, 05:11 PM   #10
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by sulka View Post
Is that what you need?
Yep. I've added those characters to the mapping. If you find more let me know and I'll add those as well. It should be in the next release; if you experience issues let me know.
user_none is offline   Reply With Quote
Old 09-04-2010, 05:13 PM   #11
sulka
Member
sulka began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
Thank you very much!
sulka is offline   Reply With Quote
Old 09-05-2010, 06:41 PM   #12
GrzegorzN
Junior Member
GrzegorzN began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2010
Device: Kindle 3
In my case the precise code for 'ą' seems to always be
<SPACE>˛<CR><LF><CR><LF>a

Or, in UTF-8 hex, Win-style newlines:
0x20 0xCB 0x9B 0x0D 0x0A 0x0D 0x0A 0x61

I suppose the <CR><LF> might be sensitive to EOL settings of the converter/output writer.

All remaining accented characters in my PDFs match sulka's tables.

Thanks!
GrzegorzN is offline   Reply With Quote
Old 09-11-2010, 08:24 AM   #13
GrzegorzN
Junior Member
GrzegorzN began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2010
Device: Kindle 3
I've tested the latest Calibre release (0.7.18) -- does a good job of converting accented characters in my PDFs, but 'ą' characters are now prefixed with a space. Maybe sulka will give it a go and report as well, it'd be good to know it's just my exotic PDFs are causing problems.

I enabled debug in conversion options (should've done it right at the beginning...) and I see that in the input\index.html document the 'ą' character (in the middle of a word) is represented as

&nbsp;˛<br>[CR][LF]a

However, I don't see any way of fixing that -- leading spaces might be valid word separators (if a word begins with an accented character), so they shouldn't be automatically removed. I guess I'll have to use an intermediate output format and apply some manual fixes to it, but at least the converter does most of the work for me now, so that's a big improvement
GrzegorzN is offline   Reply With Quote
Old 09-13-2010, 08:26 AM   #14
sulka
Member
sulka began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
Quote:
Originally Posted by GrzegorzN View Post
However, I don't see any way of fixing that -- leading spaces might be valid word separators (if a word begins with an accented character), so they shouldn't be automatically removed.
I'll try it and let know. Frankly, I don't recall any Polish words starting with "ą" (and this is the only letter with leading space), so there is very little chance those spaces are something else than part of "ą". Only in some artificial forms (like i.e. "Edward Ącki"), but this is extremely rare. Besides, there is comma in between, usually there is a space after comma and no space before at the same time, so every combination of [space]+[,]+[a] should be "ą".
sulka is offline   Reply With Quote
Old 09-13-2010, 01:46 PM   #15
sulka
Member
sulka began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Sep 2010
Device: Nook
OK, here we go. I see two different issues here (taken from the debugged html):

The first one:
Quote:
Niebo nad portem miało barwę ekranu monitora nastrojonego na nieistniej ący</p><p>
kanał. </p><p>
This one looks like the conversion is almost ok, there is an additional space before "ą", see the word "nieistniej ący", should be "nieistniejący".

The second one:
Quote:
<i>i wielu innym, którzy wiedz ˛</i></p><p>
<i>a, </i></p><p>
<i>za co. </i></p><p>
<b>CZ ˛</b></p><p>
<b>E Ś ´</b></p><p>
<b>C I</b></p><p>
<b>CHIBA CITY BLUES</b></p><p>
<b>Rozdział 1</b></p><p>
This one is strange, like the conversion tool does not take into consideration that the text is bold/italized/underlined, the problem is not only with "ą", but also with "Ę" and "Ć" at least...

The corrected text should look like:

Quote:
<i>i wielu innym, którzy wiedzą, </i></p><p>
<i>za co. </i></p><p>
<b>CZĘŚĆ I</b></p><p>
<b>CHIBA CITY BLUES</b></p><p>
<b>Rozdział 1</b></p><p>
sulka is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to ePub problem Dark123 Calibre 16 08-08-2010 08:09 AM
PDF to ePub (New line problem) Dark123 Calibre 3 02-13-2010 08:41 PM
Pocket Pro epub font change problem vidgej ePub 8 02-12-2010 12:30 PM
Problem with epub to lrf with changing font size and line spacing kleinbiker Calibre 4 12-17-2009 08:55 AM
Wide margins in html to epub; font size mngmt; PDF metadata dementrio Calibre 2 08-01-2009 01:33 AM


All times are GMT -4. The time now is 04:56 AM.


MobileRead.com is a privately owned, operated and funded community.