double LL's in PDFs

MacEvansCB · 10-31-2011, 07:46 PM

A really annoying bug:
If and only if the main font in the PDF is sans serif (Helvetica or maybe Ariel - no embedded fonts to tell me which), when converting to anything, every case of double LL's is turned into a single L followed by a space.

Examples: collected -> col ected
call me -> cal me
intellectually -> intel ectual y
still there -> stil there

Note that if there is a space following the LL, a second space is not added.
While if the LL is in the middle of a word, a space is added.
Does anyone have any idea how much fun this is to repair, word by word???

alansplace · 10-31-2011, 07:52 PM

Quote:

Originally Posted by MacEvansCB

A really annoying bug:
If and only if the main font in the PDF is sans serif (Helvetica or maybe Ariel - no embedded fonts to tell me which), when converting to anything, every case of double LL's is turned into a single L followed by a space.

Examples: collected -> col ected
call me -> cal me
intellectually -> intel ectual y
still there -> stil there

Note that if there is a space following the LL, a second space is not added.
While if the LL is in the middle of a word, a space is added.
Does anyone have any idea how much fun this is to repair, word by word???

iirc all you need to do is in look & feel, check keep ligatures before starting the conversion. someone will correct me if i'm incorrect (as i've never actually done this)

theducks · 10-31-2011, 07:55 PM

Quote:

Originally Posted by MacEvansCB

A really annoying bug:
If and only if the main font in the PDF is sans serif (Helvetica or maybe Ariel - no embedded fonts to tell me which), when converting to anything, every case of double LL's is turned into a single L followed by a space.

Examples: collected -> col ected
call me -> cal me
intellectually -> intel ectual y
still there -> stil there

Note that if there is a space following the LL, a second space is not added.
While if the LL is in the middle of a word, a space is added.
Does anyone have any idea how much fun this is to repair, word by word???

double ll's could be ligatures , a single character that replaces certain letter sequences.
ffi ll are just some. check the conversion settings for ligatures

MacEvansCB · 10-31-2011, 08:17 PM

Nope... wasn't ligatures... clicked on 'Keep Ligatures' under 'Look and Feel'... exact same results.

[Alan ... like your Dresden avatar]

alansplace · 10-31-2011, 09:05 PM

Quote:

Originally Posted by MacEvansCB

Nope... wasn't ligatures... clicked on 'Keep Ligatures' under 'Look and Feel'... exact same results.

[Alan ... like your Dresden avatar]

come hang out in the The Dresden Files thread. sorry my ligatures tip didn't work

Serpentine · 10-31-2011, 09:08 PM

Thread : Missing ll's
Problem with double L's converting PDF to EPUB

I'd take a look around - sure there are more

DoctorOhh · 10-31-2011, 11:11 PM

It would be nice if folks actually read the sticky posts in this forum. Specifically in this case the Read this before Posting PDF Questions sticky post in this forum.

Quote:

Originally Posted by MacEvansCB

Nope... wasn't ligatures... clicked on 'Keep Ligatures' under 'Look and Feel'... exact same results.

It is most definitely a ligature problem, calibre will handle some ligatures, but others are outside of calibre's control and thus not supported at this time.

From the Sticky post:

Quote:

Originally Posted by ldolse

Various character pairs like 'ff', 'll', etc are missing from my conversion

This is probably caused by the PDF containing what are called ligatures. These occur when the publisher changes certain pairs of characters into a single character to make the text 'look better'. Common are 'll', 'fl', 'fi', 'ff', 'ffl', and 'ffi'. Unfortunately, due to a bug in the third party library Calibre uses, in many cases ligatures simply aren't supported. Several users have reported having good luck with Mobipocket Creator or Acrobat Professional for these types of files.

On a more practical note.

Quote:

Originally Posted by MacEvansCB

Does anyone have any idea how much fun this is to repair, word by word???

We all do. On the rare occasion (once I believe) I decided to deal with this issue I used Sigil, opened up my epub and did a find and replace to find col ected and change it to collected through the entire document in one fell swoop. Wash, rinse and repeat for each new word.

This is still a pain and most of the time I find a different source to use for the conversion.

MacEvansCB · 11-01-2011, 02:56 PM

dwanthny ... I HAVE read the PDF stickie. Several times.

This is *NOT* a ligature problem. There are no problems at all with words that have any of the 'f' ligatures in them. And this is not a problem with an 'LL' ligature disappearing or turning into some other character.

I can copy/paste an offending paragraph from the PDF to TextEdit and all the double LL's copy/paste just fine AND show as two distinct characters... and, when moving the cursor thru the words in the PDF, the LL's show as two separate characters. If ligatures are single characters, are they displayed as a single character or as two separate characters in a PDF???? ... and do ligatures copy/paste as two separate characters?????

And please note that this ONLY happens with ALL conversions from a SANS SERIF font... all conversions from a SERIF font do not have this problem AT ALL. So do people generating PDFs only use ligatures with Helvetica or Arial, but not with Times???

Doing word search/replace is practically impossible, given the gigantic number of different words with both single 'L's and double 'LL's. I have started cleaning one file by doing one search/replace for "l " -> "ll" for words with an embedded "ll" and a second search/replace for "l " -> "ll " for words with "ll" at the end of the word. But both searches have to be run as "find next" followed by either "replace" or "ignore" and each hit has to be decided on individually. This is ridiculous in a 200,000 word document.

I guess I should just try copy/paste as it seems to work just as messily as Calibre conversion does.

I don't have a choice here most of the time ... I HAVE to work from PDF originals.

It would be REALLY REALLY WONDERFUL if the new PDF engine was given more priority.

Between problems like this and problems with wrap/unwrap, I have to spend WAY TOO MUCH time scrubbing thru PDF conversions.

Ethelred? · 11-01-2011, 03:44 PM

I recently had this problem! I sympathize wholeheartedly. For the document I was converting I decided that losing formatting was okay, so I converted to text; strangely enough, using pdftotext worked, though if I remember correctly other pdf conversion utils like pdftohtml did not. I didn't try them all, though.

I did briefly consider writing a script to do an (ll-lossy) html conversion and check it against the text conversion, but decided it was too much work. If you come across this a lot it is an option, though.

In the worst case, the other thing I considered was using a lot of regular expressions to fix an ll-lossy file. If you like to use Word (with an RTF), it actually has halfway decent regex find/replace; otherwise I would do it against an html file with a good text editor. (LibreOffice has regex f/r too, but I've had some problems with it.) Automatically replacing things like an l followed by a space followed by some punctuation, or by an "ing" or "ed" or "er", etc., can save some time and frustration.

DoctorOhh · 11-01-2011, 07:37 PM

Quote:

Originally Posted by MacEvansCB

dwanthny ... I HAVE read the PDF stickie. Several times.

Excellent, did you have any success with the suggested Mobipocket Creator to convert your pdf? Or using Adobe Acrobat Pro (expensive but maybe someone with access could experiment for you).

Quote:

Originally Posted by MacEvansCB

I can copy/paste an offending paragraph from the PDF to TextEdit and all the double LL's copy/paste just fine AND show as two distinct characters... and, when moving the cursor thru the words in the PDF, the LL's show as two separate characters. If ligatures are single characters, are they displayed as a single character or as two separate characters in a PDF???? ... and do ligatures copy/paste as two separate characters?????

I'll admit a certain amount of ignorance, but it is my impression that due to how PDF writes the info to the display, that what you can copy and paste and what you see are often different. Akin to placing text over images so search and define features work correctly. Again, someone else with complete and full knowledge will have to explain further.

Quote:

Originally Posted by MacEvansCB

I don't have a choice here most of the time ... I HAVE to work from PDF originals.

I'm curious, why don't you have a choice? If stuck with having to use PDFs then maybe procuring Adobe Acrobat Pro might be the way to go. Or purchase a reader or tablet that can accommodate PDFs natively without converting them.

Quote:

Originally Posted by MacEvansCB

It would be REALLY REALLY WONDERFUL if the new PDF engine was given more priority.

I'm sure there are those that agree with you. Personally I approve of Kovid prioritizing the new database design above the new PDF converting engine. Ligatures aside, Adobe software often has problems converting PDFs to clean html and if they have problems I hate to think of all of the other underlying problems awaiting Kovid in attempting to clean up the new PDF conversion engine.

Good Luck.

frostschutz · 11-01-2011, 08:05 PM

There are a thousand reasons why text extraction from PDF does not work properly - this is because PDF does not care about structure or even machine readability. All it cares about is how it looks like on a sheet paper (whether that sheet is real or just displayed in the given, fixed dimensions on screen). It's possible to construct PDF so that all text you extract comes out backwards or completely garbled, that's just how it is.

There may be an easy way to fix your particular problem but it's hard to say without having access to the PDF file itself.

If all else fails you could always try running it through OCR. As long as the text is clean in appearance this will work reasonably well.

dwig · 11-01-2011, 09:28 PM

Quote:

Originally Posted by MacEvansCB

...If ligatures are single characters, are they displayed as a single character or as two separate characters in a PDF????.

If they are ligatures then they are a single character, period. Some, but not all, conversion processes will recognize these and replace them with the appropriate individual characters.

The problem you are encountering may be an "un-ligature" problem where two separate standard Ls are being placed very close together in the PDF and the conversion engine is seeing the locations as being too similar to treat them as two separate characters when it attempts to assemble the various pieces of text. It may think the two Ls are in the same place so it only places one in that spot in the output string.

davidfor · 11-03-2011, 12:59 AM

Quote:

Originally Posted by MacEvansCB

Doing word search/replace is practically impossible, given the gigantic number of different words with both single 'L's and double 'LL's. I have started cleaning one file by doing one search/replace for "l " -> "ll" for words with an embedded "ll" and a second search/replace for "l " -> "ll " for words with "ll" at the end of the word. But both searches have to be run as "find next" followed by either "replace" or "ignore" and each hit has to be decided on individually. This is ridiculous in a 200,000 word document.

I played with this a while ago and wrote a regex to do it. The rules I worked out were:

- There has to be a a vowel, "y" or an apostrophe before the double L
- The character after the double L is one of a vowel, "y", punctuation or whitespace.

Using that, I wrote the regex and it covered 99% of things. It is on another machine, so I can't include it. From memory, treated apostrophe-double L separately which simplified the regex a bit.

There are some special cases. From memory, the first time I did this, I was converting a Sci Fi or Fantasy novel that had a name that broke the above rules. I had to fix the name before the rest of the words.

Starson17 · 11-03-2011, 10:42 AM

Quote:

Originally Posted by dwanthny

Personally I approve of Kovid prioritizing the new database design above the new PDF converting engine.

I also approve of Kovid's priority. There are dozens of PDF conversion methods and programs currently available - from OCR to 3rd party conversion tools. I'd hate to see fundamental calibre design neglected just to create one more. Of course, I have the utmost confidence that Kovid's one-man effort will ultimately turn out to be better than all the other programs out there written by teams of programmers dedicated to solving this one complex problem.

cybmole · 11-03-2011, 12:11 PM

i second that - there are already several well documented workarounds. - I recall that I posted one some time ago after encountering this issue - so search similar threads

10-31-2011, 07:46 PM	#1
MacEvansCB Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2010 Location: Somewhere in Iowa Device: Nook Color	double LL's in PDFs A really annoying bug: If and only if the main font in the PDF is sans serif (Helvetica or maybe Ariel - no embedded fonts to tell me which), when converting to anything, every case of double LL's is turned into a single L followed by a space. Examples: collected -> col ected call me -> cal me intellectually -> intel ectual y still there -> stil there Note that if there is a space following the LL, a second space is not added. While if the LL is in the middle of a word, a space is added. Does anyone have any idea how much fun this is to repair, word by word???

11-01-2011, 02:56 PM	#8
MacEvansCB Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2010 Location: Somewhere in Iowa Device: Nook Color	Get the new PDF converter finished!!!!!! dwanthny ... I HAVE read the PDF stickie. Several times. This is NOT a ligature problem. There are no problems at all with words that have any of the 'f' ligatures in them. And this is not a problem with an 'LL' ligature disappearing or turning into some other character. I can copy/paste an offending paragraph from the PDF to TextEdit and all the double LL's copy/paste just fine AND show as two distinct characters... and, when moving the cursor thru the words in the PDF, the LL's show as two separate characters. If ligatures are single characters, are they displayed as a single character or as two separate characters in a PDF???? ... and do ligatures copy/paste as two separate characters????? And please note that this ONLY happens with ALL conversions from a SANS SERIF font... all conversions from a SERIF font do not have this problem AT ALL. So do people generating PDFs only use ligatures with Helvetica or Arial, but not with Times??? Doing word search/replace is practically impossible, given the gigantic number of different words with both single 'L's and double 'LL's. I have started cleaning one file by doing one search/replace for "l " -> "ll" for words with an embedded "ll" and a second search/replace for "l " -> "ll " for words with "ll" at the end of the word. But both searches have to be run as "find next" followed by either "replace" or "ignore" and each hit has to be decided on individually. This is ridiculous in a 200,000 word document. I guess I should just try copy/paste as it seems to work just as messily as Calibre conversion does. I don't have a choice here most of the time ... I HAVE to work from PDF originals. It would be REALLY REALLY WONDERFUL if the new PDF engine was given more priority. Between problems like this and problems with wrap/unwrap, I have to spend WAY TOO MUCH time scrubbing thru PDF conversions.

11-01-2011, 03:44 PM	#9
Ethelred? Member Posts: 13 Karma: 78 Join Date: Jul 2011 Device: kindle 2	I recently had this problem! I sympathize wholeheartedly. For the document I was converting I decided that losing formatting was okay, so I converted to text; strangely enough, using pdftotext worked, though if I remember correctly other pdf conversion utils like pdftohtml did not. I didn't try them all, though. I did briefly consider writing a script to do an (ll-lossy) html conversion and check it against the text conversion, but decided it was too much work. If you come across this a lot it is an option, though. In the worst case, the other thing I considered was using a lot of regular expressions to fix an ll-lossy file. If you like to use Word (with an RTF), it actually has halfway decent regex find/replace; otherwise I would do it against an html file with a good text editor. (LibreOffice has regex f/r too, but I've had some problems with it.) Automatically replacing things like an l followed by a space followed by some punctuation, or by an "ing" or "ed" or "er", etc., can save some time and frustration. Last edited by Ethelred?; 11-01-2011 at 03:50 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Missing ll's	chrishalliwelluk	Calibre	2	12-10-2010 08:09 AM
No double LL's-PDF to EPUB	Pancho Harrera	Workshop	7	08-13-2010 10:28 PM
KDX: Unable to search PDFs from main screen... PDFs not indexed?	unrequited	Amazon Kindle	3	06-22-2009 07:59 PM
Which program for double column PDFs?	mflood	Sony Reader	13	02-25-2008 04:13 PM
Convert print-protected pdfs into image-based pdfs?	magogo	Sony Reader	3	12-04-2007 01:18 AM

10-31-2011, 08:17 PM	#4
MacEvansCB Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2010 Location: Somewhere in Iowa Device: Nook Color	Nope... wasn't ligatures... clicked on 'Keep Ligatures' under 'Look and Feel'... exact same results. [Alan ... like your Dresden avatar]

10-31-2011, 09:08 PM	#6
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	Thread : Missing ll's Problem with double L's converting PDF to EPUB I'd take a look around - sure there are more

11-01-2011, 08:05 PM	#11
frostschutz Linux User Posts: 2,279 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	There are a thousand reasons why text extraction from PDF does not work properly - this is because PDF does not care about structure or even machine readability. All it cares about is how it looks like on a sheet paper (whether that sheet is real or just displayed in the given, fixed dimensions on screen). It's possible to construct PDF so that all text you extract comes out backwards or completely garbled, that's just how it is. There may be an easy way to fix your particular problem but it's hard to say without having access to the PDF file itself. If all else fails you could always try running it through OCR. As long as the text is clean in appearance this will work reasonably well.

11-03-2011, 12:11 PM	#15
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	i second that - there are already several well documented workarounds. - I recall that I posted one some time ago after encountering this issue - so search similar threads