![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
pdf to epub - trouble with FL
I try convert on default settings & see all words like flights being messed up. seems that f is replaced with a strange compound character.
thus flights becomes ϐlights. if I tick the keep ligatures option then flights becomes blights. so is there an option which will fix this ? PS some f characters are ok, thus some f words are OK, I cannot yet suss the rule that causes some f to be plan f and others to be not, in the PDF. I can go ahead & patch up with sigil, as this source is potentially better than my existing lit source of same novel, but there could be other weird characters that I've not spotted yet. sigil finds &replaces 1050 instances of ϐ - what is this ϐ thing anyway ??? Last edited by cybmole; 03-28-2011 at 03:44 AM. |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
fl is sometimes a ligature, the same as ff, ll, etc. i.e. just like those others it's broken and there's nothing to be done about it.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
hmm I understand double letter ligature thing ( we had a long discssion on that a while back ) ,
but in this book, words like fight convert to ϐight so it's an issue with (some) single letters also. - yet fill, flip are OK ! fixing up with sigil seems to have done the trick though, unless I come across other glitches once I am reading the conversion. |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
![]() Hey, you asked ![]() |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
I think if I went back and re-ran the conversion with a correct search replace regex - find ϐ, replace with f - then I could get it to work in calibre - but Ive patched it in sigil now. |
|
![]() |
![]() |
![]() |
#7 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,613
Karma: 6718541
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
|
Quote:
There are some PDFs that are created by scanning a book and using OCR to create a text layer while retaining the scanned image layer. When you view these in Acrobat Reader you see the scanned layer, but when you do a word search or selection, you access the "hidden" text layer. Try saving the PDF as text from Acrobat Reader and examine the resulting file. If the f's are betas or esszetts then the fault is in the PDF and its creation. |
|
![]() |
![]() |
![]() |
#8 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
good call - let me test & report....
okI did 3 tests using the same sentence - copy from adobe, paste 1) to Word, 2) to notepad++ 3) directly to here. in all 3 tests the leading f simply vanished from my test sentence i.e. flights became lights There was no smoking on shuttle lights. [ c.f. flights became ϐlights in calibre epub output). so calibre actually did a better job than word in that the calibre output was fixable in regex, whereas distinguishing lost f in word would be impossible so would ALL pdf convert programs fall at this hurdle ??? Last edited by cybmole; 03-29-2011 at 02:08 AM. |
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,613
Karma: 6718541
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
|
Quote:
Another experiment: 1. open the txt file exported from Acrobat Reader in Notepad++ 2. change the encoding to UTF-8 using the entry on the Encoding menu 3. save the file as TXT 4. convert using Calibre. |
|
![]() |
![]() |
![]() |
#10 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
|
On this particular file I would say that the answer is yes.
It sounds like the file in question was created by scanning in image on the book and then applying OCR technology to create the underlying text. In this case some characters were not recognised correctly at the OCR stage. Unless you had some way of re-applying the OCR step (and doing a better job than the original program) then all conversion programs are going to fail with this PDF file. Many (possibly the majority) PDF file are created from the original word processed document. In such a case the PDF file does not have the overlaying image and the underlying text is complete so a conversion program has a chance. However PDF conversion is still a little fraught ever with files created this way because of tricks that PDF does (ligatures, absolute placement of text, special symbols, etc) that a conversion program can struggle to understand and convert sensibly. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
pdf to epub trouble | LittleRach | Calibre | 3 | 09-30-2010 09:38 PM |
Trouble Converting PDF? | federalbetrayal | Calibre | 1 | 09-28-2010 07:35 PM |
iPhone PDF Trouble | steffen4567 | Apple Devices | 2 | 09-04-2010 11:01 PM |
Trouble with a large PDF | ccowie | Calibre | 5 | 10-08-2009 09:58 PM |
Trouble with DRM ePub | JSWolf | Sony Reader | 12 | 07-28-2008 08:16 PM |