10-31-2011, 07:46 PM | #1 |
Enthusiast
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
|
double LL's in PDFs
A really annoying bug:
If and only if the main font in the PDF is sans serif (Helvetica or maybe Ariel - no embedded fonts to tell me which), when converting to anything, every case of double LL's is turned into a single L followed by a space. Examples: collected -> col ected call me -> cal me intellectually -> intel ectual y still there -> stil there Note that if there is a space following the LL, a second space is not added. While if the LL is in the middle of a word, a space is added. Does anyone have any idea how much fun this is to repair, word by word??? |
10-31-2011, 07:52 PM | #2 | |
Grand Sorcerer
Posts: 5,886
Karma: 464403178
Join Date: Feb 2010
Location: 33.9388° N, 117.2716° W
Device: Kindles K-2, K-KB, PW 1 & 2, Voyage, Fire 2, 5 & HD 8, Surface 3, iPad
|
ligatures
Quote:
|
|
10-31-2011, 07:55 PM | #3 | |
Well trained by Cats
Posts: 29,782
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
ffi ll are just some. check the conversion settings for ligatures |
|
10-31-2011, 08:17 PM | #4 |
Enthusiast
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
|
Nope... wasn't ligatures... clicked on 'Keep Ligatures' under 'Look and Feel'... exact same results.
[Alan ... like your Dresden avatar] |
10-31-2011, 09:05 PM | #5 | |
Grand Sorcerer
Posts: 5,886
Karma: 464403178
Join Date: Feb 2010
Location: 33.9388° N, 117.2716° W
Device: Kindles K-2, K-KB, PW 1 & 2, Voyage, Fire 2, 5 & HD 8, Surface 3, iPad
|
thanks
Quote:
|
|
10-31-2011, 09:08 PM | #6 |
Evangelist
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Thread : Missing ll's
Problem with double L's converting PDF to EPUB I'd take a look around - sure there are more |
10-31-2011, 11:11 PM | #7 | |||
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
It would be nice if folks actually read the sticky posts in this forum. Specifically in this case the Read this before Posting PDF Questions sticky post in this forum.
Quote:
From the Sticky post: Quote:
Quote:
This is still a pain and most of the time I find a different source to use for the conversion. Last edited by DoctorOhh; 10-31-2011 at 11:21 PM. |
|||
11-01-2011, 02:56 PM | #8 |
Enthusiast
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
|
Get the new PDF converter finished!!!!!!
dwanthny ... I HAVE read the PDF stickie. Several times.
This is *NOT* a ligature problem. There are no problems at all with words that have any of the 'f' ligatures in them. And this is not a problem with an 'LL' ligature disappearing or turning into some other character. I can copy/paste an offending paragraph from the PDF to TextEdit and all the double LL's copy/paste just fine AND show as two distinct characters... and, when moving the cursor thru the words in the PDF, the LL's show as two separate characters. If ligatures are single characters, are they displayed as a single character or as two separate characters in a PDF???? ... and do ligatures copy/paste as two separate characters????? And please note that this ONLY happens with ALL conversions from a SANS SERIF font... all conversions from a SERIF font do not have this problem AT ALL. So do people generating PDFs only use ligatures with Helvetica or Arial, but not with Times??? Doing word search/replace is practically impossible, given the gigantic number of different words with both single 'L's and double 'LL's. I have started cleaning one file by doing one search/replace for "l " -> "ll" for words with an embedded "ll" and a second search/replace for "l " -> "ll " for words with "ll" at the end of the word. But both searches have to be run as "find next" followed by either "replace" or "ignore" and each hit has to be decided on individually. This is ridiculous in a 200,000 word document. I guess I should just try copy/paste as it seems to work just as messily as Calibre conversion does. I don't have a choice here most of the time ... I HAVE to work from PDF originals. It would be REALLY REALLY WONDERFUL if the new PDF engine was given more priority. Between problems like this and problems with wrap/unwrap, I have to spend WAY TOO MUCH time scrubbing thru PDF conversions. |
11-01-2011, 03:44 PM | #9 |
Member
Posts: 13
Karma: 78
Join Date: Jul 2011
Device: kindle 2
|
I recently had this problem! I sympathize wholeheartedly. For the document I was converting I decided that losing formatting was okay, so I converted to text; strangely enough, using pdftotext worked, though if I remember correctly other pdf conversion utils like pdftohtml did not. I didn't try them all, though.
I did briefly consider writing a script to do an (ll-lossy) html conversion and check it against the text conversion, but decided it was too much work. If you come across this a lot it is an option, though. In the worst case, the other thing I considered was using a lot of regular expressions to fix an ll-lossy file. If you like to use Word (with an RTF), it actually has halfway decent regex find/replace; otherwise I would do it against an html file with a good text editor. (LibreOffice has regex f/r too, but I've had some problems with it.) Automatically replacing things like an l followed by a space followed by some punctuation, or by an "ing" or "ed" or "er", etc., can save some time and frustration. Last edited by Ethelred?; 11-01-2011 at 03:50 PM. |
11-01-2011, 07:37 PM | #10 | |||
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Excellent, did you have any success with the suggested Mobipocket Creator to convert your pdf? Or using Adobe Acrobat Pro (expensive but maybe someone with access could experiment for you).
Quote:
Quote:
Quote:
Good Luck. |
|||
11-01-2011, 08:05 PM | #11 |
Linux User
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
|
There are a thousand reasons why text extraction from PDF does not work properly - this is because PDF does not care about structure or even machine readability. All it cares about is how it looks like on a sheet paper (whether that sheet is real or just displayed in the given, fixed dimensions on screen). It's possible to construct PDF so that all text you extract comes out backwards or completely garbled, that's just how it is.
There may be an easy way to fix your particular problem but it's hard to say without having access to the PDF file itself. If all else fails you could always try running it through OCR. As long as the text is clean in appearance this will work reasonably well. |
11-01-2011, 09:28 PM | #12 | |
Wizard
Posts: 1,613
Karma: 6718479
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
|
Quote:
The problem you are encountering may be an "un-ligature" problem where two separate standard Ls are being placed very close together in the PDF and the conversion engine is seeing the locations as being too similar to treat them as two separate characters when it attempts to assemble the various pieces of text. It may think the two Ls are in the same place so it only places one in that spot in the output string. Last edited by dwig; 11-01-2011 at 09:31 PM. |
|
11-03-2011, 12:59 AM | #13 | |
Grand Sorcerer
Posts: 24,907
Karma: 47303748
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
- There has to be a a vowel, "y" or an apostrophe before the double L - The character after the double L is one of a vowel, "y", punctuation or whitespace. Using that, I wrote the regex and it covered 99% of things. It is on another machine, so I can't include it. From memory, treated apostrophe-double L separately which simplified the regex a bit. There are some special cases. From memory, the first time I did this, I was converting a Sci Fi or Fantasy novel that had a name that broke the above rules. I had to fix the name before the rest of the words. |
|
11-03-2011, 10:42 AM | #14 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I also approve of Kovid's priority. There are dozens of PDF conversion methods and programs currently available - from OCR to 3rd party conversion tools. I'd hate to see fundamental calibre design neglected just to create one more. Of course, I have the utmost confidence that Kovid's one-man effort will ultimately turn out to be better than all the other programs out there written by teams of programmers dedicated to solving this one complex problem.
|
11-03-2011, 12:11 PM | #15 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
i second that - there are already several well documented workarounds. - I recall that I posted one some time ago after encountering this issue - so search similar threads
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Missing ll's | chrishalliwelluk | Calibre | 2 | 12-10-2010 08:09 AM |
No double LL's-PDF to EPUB | Pancho Harrera | Workshop | 7 | 08-13-2010 10:28 PM |
KDX: Unable to search PDFs from main screen... PDFs not indexed? | unrequited | Amazon Kindle | 3 | 06-22-2009 07:59 PM |
Which program for double column PDFs? | mflood | Sony Reader | 13 | 02-25-2008 04:13 PM |
Convert print-protected pdfs into image-based pdfs? | magogo | Sony Reader | 3 | 12-04-2007 01:18 AM |