Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 10-31-2011, 07:46 PM   #1
MacEvansCB
Enthusiast
MacEvansCB began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
Unhappy double LL's in PDFs

A really annoying bug:
If and only if the main font in the PDF is sans serif (Helvetica or maybe Ariel - no embedded fonts to tell me which), when converting to anything, every case of double LL's is turned into a single L followed by a space.

Examples: collected -> col ected
call me -> cal me
intellectually -> intel ectual y
still there -> stil there

Note that if there is a space following the LL, a second space is not added.
While if the LL is in the middle of a word, a space is added.
Does anyone have any idea how much fun this is to repair, word by word???
MacEvansCB is offline   Reply With Quote
Old 10-31-2011, 07:52 PM   #2
alansplace
Grand Sorcerer
alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.
 
alansplace's Avatar
 
Posts: 5,886
Karma: 464403178
Join Date: Feb 2010
Location: 33.9388° N, 117.2716° W
Device: Kindles K-2, K-KB, PW 1 & 2, Voyage, Fire 2, 5 & HD 8, Surface 3, iPad
Cool ligatures

Quote:
Originally Posted by MacEvansCB View Post
A really annoying bug:
If and only if the main font in the PDF is sans serif (Helvetica or maybe Ariel - no embedded fonts to tell me which), when converting to anything, every case of double LL's is turned into a single L followed by a space.

Examples: collected -> col ected
call me -> cal me
intellectually -> intel ectual y
still there -> stil there

Note that if there is a space following the LL, a second space is not added.
While if the LL is in the middle of a word, a space is added.
Does anyone have any idea how much fun this is to repair, word by word???
iirc all you need to do is in look & feel, check keep ligatures before starting the conversion. someone will correct me if i'm incorrect (as i've never actually done this)
alansplace is offline   Reply With Quote
Old 10-31-2011, 07:55 PM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,782
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by MacEvansCB View Post
A really annoying bug:
If and only if the main font in the PDF is sans serif (Helvetica or maybe Ariel - no embedded fonts to tell me which), when converting to anything, every case of double LL's is turned into a single L followed by a space.

Examples: collected -> col ected
call me -> cal me
intellectually -> intel ectual y
still there -> stil there

Note that if there is a space following the LL, a second space is not added.
While if the LL is in the middle of a word, a space is added.
Does anyone have any idea how much fun this is to repair, word by word???
double ll's could be ligatures , a single character that replaces certain letter sequences.
ffi ll are just some. check the conversion settings for ligatures
theducks is online now   Reply With Quote
Old 10-31-2011, 08:17 PM   #4
MacEvansCB
Enthusiast
MacEvansCB began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
Nope... wasn't ligatures... clicked on 'Keep Ligatures' under 'Look and Feel'... exact same results.

[Alan ... like your Dresden avatar]
MacEvansCB is offline   Reply With Quote
Old 10-31-2011, 09:05 PM   #5
alansplace
Grand Sorcerer
alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.alansplace ought to be getting tired of karma fortunes by now.
 
alansplace's Avatar
 
Posts: 5,886
Karma: 464403178
Join Date: Feb 2010
Location: 33.9388° N, 117.2716° W
Device: Kindles K-2, K-KB, PW 1 & 2, Voyage, Fire 2, 5 & HD 8, Surface 3, iPad
Cool thanks

Quote:
Originally Posted by MacEvansCB View Post
Nope... wasn't ligatures... clicked on 'Keep Ligatures' under 'Look and Feel'... exact same results.

[Alan ... like your Dresden avatar]
come hang out in the The Dresden Files thread. sorry my ligatures tip didn't work
alansplace is offline   Reply With Quote
Old 10-31-2011, 09:08 PM   #6
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Thread : Missing ll's
Problem with double L's converting PDF to EPUB

I'd take a look around - sure there are more
Serpentine is offline   Reply With Quote
Old 10-31-2011, 11:11 PM   #7
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
It would be nice if folks actually read the sticky posts in this forum. Specifically in this case the Read this before Posting PDF Questions sticky post in this forum.

Quote:
Originally Posted by MacEvansCB View Post
Nope... wasn't ligatures... clicked on 'Keep Ligatures' under 'Look and Feel'... exact same results.
It is most definitely a ligature problem, calibre will handle some ligatures, but others are outside of calibre's control and thus not supported at this time.

From the Sticky post:
Quote:
Originally Posted by ldolse View Post
Various character pairs like 'ff', 'll', etc are missing from my conversion

This is probably caused by the PDF containing what are called ligatures. These occur when the publisher changes certain pairs of characters into a single character to make the text 'look better'. Common are 'll', 'fl', 'fi', 'ff', 'ffl', and 'ffi'. Unfortunately, due to a bug in the third party library Calibre uses, in many cases ligatures simply aren't supported. Several users have reported having good luck with Mobipocket Creator or Acrobat Professional for these types of files.
On a more practical note.

Quote:
Originally Posted by MacEvansCB View Post
Does anyone have any idea how much fun this is to repair, word by word???
We all do. On the rare occasion (once I believe) I decided to deal with this issue I used Sigil, opened up my epub and did a find and replace to find col ected and change it to collected through the entire document in one fell swoop. Wash, rinse and repeat for each new word.

This is still a pain and most of the time I find a different source to use for the conversion.

Last edited by DoctorOhh; 10-31-2011 at 11:21 PM.
DoctorOhh is offline   Reply With Quote
Old 11-01-2011, 02:56 PM   #8
MacEvansCB
Enthusiast
MacEvansCB began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
Get the new PDF converter finished!!!!!!

dwanthny ... I HAVE read the PDF stickie. Several times.

This is *NOT* a ligature problem. There are no problems at all with words that have any of the 'f' ligatures in them. And this is not a problem with an 'LL' ligature disappearing or turning into some other character.

I can copy/paste an offending paragraph from the PDF to TextEdit and all the double LL's copy/paste just fine AND show as two distinct characters... and, when moving the cursor thru the words in the PDF, the LL's show as two separate characters. If ligatures are single characters, are they displayed as a single character or as two separate characters in a PDF???? ... and do ligatures copy/paste as two separate characters?????

And please note that this ONLY happens with ALL conversions from a SANS SERIF font... all conversions from a SERIF font do not have this problem AT ALL. So do people generating PDFs only use ligatures with Helvetica or Arial, but not with Times???

Doing word search/replace is practically impossible, given the gigantic number of different words with both single 'L's and double 'LL's. I have started cleaning one file by doing one search/replace for "l " -> "ll" for words with an embedded "ll" and a second search/replace for "l " -> "ll " for words with "ll" at the end of the word. But both searches have to be run as "find next" followed by either "replace" or "ignore" and each hit has to be decided on individually. This is ridiculous in a 200,000 word document.

I guess I should just try copy/paste as it seems to work just as messily as Calibre conversion does.

I don't have a choice here most of the time ... I HAVE to work from PDF originals.

It would be REALLY REALLY WONDERFUL if the new PDF engine was given more priority.

Between problems like this and problems with wrap/unwrap, I have to spend WAY TOO MUCH time scrubbing thru PDF conversions.
MacEvansCB is offline   Reply With Quote
Old 11-01-2011, 03:44 PM   #9
Ethelred?
Member
Ethelred? has learned how to buy an e-book online
 
Posts: 13
Karma: 78
Join Date: Jul 2011
Device: kindle 2
I recently had this problem! I sympathize wholeheartedly. For the document I was converting I decided that losing formatting was okay, so I converted to text; strangely enough, using pdftotext worked, though if I remember correctly other pdf conversion utils like pdftohtml did not. I didn't try them all, though.

I did briefly consider writing a script to do an (ll-lossy) html conversion and check it against the text conversion, but decided it was too much work. If you come across this a lot it is an option, though.

In the worst case, the other thing I considered was using a lot of regular expressions to fix an ll-lossy file. If you like to use Word (with an RTF), it actually has halfway decent regex find/replace; otherwise I would do it against an html file with a good text editor. (LibreOffice has regex f/r too, but I've had some problems with it.) Automatically replacing things like an l followed by a space followed by some punctuation, or by an "ing" or "ed" or "er", etc., can save some time and frustration.

Last edited by Ethelred?; 11-01-2011 at 03:50 PM.
Ethelred? is offline   Reply With Quote
Old 11-01-2011, 07:37 PM   #10
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by MacEvansCB View Post
dwanthny ... I HAVE read the PDF stickie. Several times.
Excellent, did you have any success with the suggested Mobipocket Creator to convert your pdf? Or using Adobe Acrobat Pro (expensive but maybe someone with access could experiment for you).

Quote:
Originally Posted by MacEvansCB View Post
I can copy/paste an offending paragraph from the PDF to TextEdit and all the double LL's copy/paste just fine AND show as two distinct characters... and, when moving the cursor thru the words in the PDF, the LL's show as two separate characters. If ligatures are single characters, are they displayed as a single character or as two separate characters in a PDF???? ... and do ligatures copy/paste as two separate characters?????
I'll admit a certain amount of ignorance, but it is my impression that due to how PDF writes the info to the display, that what you can copy and paste and what you see are often different. Akin to placing text over images so search and define features work correctly. Again, someone else with complete and full knowledge will have to explain further.

Quote:
Originally Posted by MacEvansCB View Post
I don't have a choice here most of the time ... I HAVE to work from PDF originals.
I'm curious, why don't you have a choice? If stuck with having to use PDFs then maybe procuring Adobe Acrobat Pro might be the way to go. Or purchase a reader or tablet that can accommodate PDFs natively without converting them.

Quote:
Originally Posted by MacEvansCB View Post
It would be REALLY REALLY WONDERFUL if the new PDF engine was given more priority.
I'm sure there are those that agree with you. Personally I approve of Kovid prioritizing the new database design above the new PDF converting engine. Ligatures aside, Adobe software often has problems converting PDFs to clean html and if they have problems I hate to think of all of the other underlying problems awaiting Kovid in attempting to clean up the new PDF conversion engine.

Good Luck.
DoctorOhh is offline   Reply With Quote
Old 11-01-2011, 08:05 PM   #11
frostschutz
Linux User
frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.
 
frostschutz's Avatar
 
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
There are a thousand reasons why text extraction from PDF does not work properly - this is because PDF does not care about structure or even machine readability. All it cares about is how it looks like on a sheet paper (whether that sheet is real or just displayed in the given, fixed dimensions on screen). It's possible to construct PDF so that all text you extract comes out backwards or completely garbled, that's just how it is.

There may be an easy way to fix your particular problem but it's hard to say without having access to the PDF file itself.

If all else fails you could always try running it through OCR. As long as the text is clean in appearance this will work reasonably well.
frostschutz is offline   Reply With Quote
Old 11-01-2011, 09:28 PM   #12
dwig
Wizard
dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.
 
dwig's Avatar
 
Posts: 1,613
Karma: 6718479
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
Quote:
Originally Posted by MacEvansCB View Post
...If ligatures are single characters, are they displayed as a single character or as two separate characters in a PDF????.
If they are ligatures then they are a single character, period. Some, but not all, conversion processes will recognize these and replace them with the appropriate individual characters.

The problem you are encountering may be an "un-ligature" problem where two separate standard Ls are being placed very close together in the PDF and the conversion engine is seeing the locations as being too similar to treat them as two separate characters when it attempts to assemble the various pieces of text. It may think the two Ls are in the same place so it only places one in that spot in the output string.

Last edited by dwig; 11-01-2011 at 09:31 PM.
dwig is offline   Reply With Quote
Old 11-03-2011, 12:59 AM   #13
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,907
Karma: 47303748
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by MacEvansCB View Post
Doing word search/replace is practically impossible, given the gigantic number of different words with both single 'L's and double 'LL's. I have started cleaning one file by doing one search/replace for "l " -> "ll" for words with an embedded "ll" and a second search/replace for "l " -> "ll " for words with "ll" at the end of the word. But both searches have to be run as "find next" followed by either "replace" or "ignore" and each hit has to be decided on individually. This is ridiculous in a 200,000 word document.
I played with this a while ago and wrote a regex to do it. The rules I worked out were:

- There has to be a a vowel, "y" or an apostrophe before the double L
- The character after the double L is one of a vowel, "y", punctuation or whitespace.

Using that, I wrote the regex and it covered 99% of things. It is on another machine, so I can't include it. From memory, treated apostrophe-double L separately which simplified the regex a bit.

There are some special cases. From memory, the first time I did this, I was converting a Sci Fi or Fantasy novel that had a name that broke the above rules. I had to fix the name before the rest of the words.
davidfor is offline   Reply With Quote
Old 11-03-2011, 10:42 AM   #14
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by dwanthny View Post
Personally I approve of Kovid prioritizing the new database design above the new PDF converting engine.
I also approve of Kovid's priority. There are dozens of PDF conversion methods and programs currently available - from OCR to 3rd party conversion tools. I'd hate to see fundamental calibre design neglected just to create one more. Of course, I have the utmost confidence that Kovid's one-man effort will ultimately turn out to be better than all the other programs out there written by teams of programmers dedicated to solving this one complex problem.
Starson17 is offline   Reply With Quote
Old 11-03-2011, 12:11 PM   #15
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
i second that - there are already several well documented workarounds. - I recall that I posted one some time ago after encountering this issue - so search similar threads
cybmole is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Missing ll's chrishalliwelluk Calibre 2 12-10-2010 08:09 AM
No double LL's-PDF to EPUB Pancho Harrera Workshop 7 08-13-2010 10:28 PM
KDX: Unable to search PDFs from main screen... PDFs not indexed? unrequited Amazon Kindle 3 06-22-2009 07:59 PM
Which program for double column PDFs? mflood Sony Reader 13 02-25-2008 04:13 PM
Convert print-protected pdfs into image-based pdfs? magogo Sony Reader 3 12-04-2007 01:18 AM


All times are GMT -4. The time now is 05:44 PM.


MobileRead.com is a privately owned, operated and funded community.