07-20-2016, 05:49 AM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: Jul 2016
Device: none
|
First line of chapter is "squashed"
I'm (trying to) convert an existing PDF into an eBook using Calibre v1.48.
Most of the process works fairly well, except for the first line of each chapter. The first line is "squashed", that is, the whitespace is all removed and the words all run together. Only on the first line of each chapter, all other paragraphs are fine. Heuristics are switched on, and it seems not to matter which settings I use, the first line is always squashed. I've a suspicion this has to do with the first 3 lines in the PDF using a drop cap, as it's the only thing I can think of which is unique to the first line of each chapter. The 2nd and 3rd lines (which follow the drop cap in the PDF) appear just fine with their expected word spacings. I've tried using "-d dirpath" and all of the (debug) output files have the squashed text in them already, so I suspect it's the parsing of the drop cap in the original PDF, somehow... Thanks in advance for any ideas that might be a cause/fix for this. |
07-20-2016, 09:59 AM | #2 | |
Well trained by Cats
Posts: 29,792
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
In this case. You have a line-height: <some value less than 1.2 (a typical value)> This was probably inherited from a Dropcap or other decoration in the original. Find the Paragraph class in the CSS and fix (or remove that line) PDF is a terrible source as it is a paste-up format, where the commands can be anywhere on the page |
|
Advert | |
|
07-23-2016, 05:55 AM | #3 | |||
Junior Member
Posts: 3
Karma: 10
Join Date: Jul 2016
Device: none
|
thanks for the feedback/suggestion. I'm not sure we're quite talking about the same thing though. The simple text which is extracted from the PDF is already squashed together inside an XML/HTML tag. What this means (to me) is that by the time I get to the parsed first line, it is not possible, (via CSS or whatever), to identify the individual words anymore (except when a human reads the line). This is the text/line I'm interested in:
Quote:
Quote:
Quote:
|
|||
07-23-2016, 07:16 AM | #4 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There is no automated way to fix that -- the problem is that the PDF format contains no words, so when converting it into text one has to guess where there are word boundaries, based on the spacing between individual characters. That guessing occassionally fails, as in this case.
|
07-25-2016, 05:19 AM | #5 | |
Junior Member
Posts: 3
Karma: 10
Join Date: Jul 2016
Device: none
|
Ok, thanks for the info. I'll just have to fix it manually and go from there.
Many thanks. Quote:
|
|
Advert | |
|
Tags |
dropcap, ebook, first line, pdf, squashed |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help! How can I remove "x mins left in chapter" from kindle app. | RCR | Android Devices | 6 | 08-27-2013 02:13 PM |
Feature Request: configurable space setting for "Insert blank line" in "Look & Feel" | therealjoeblow | Calibre | 15 | 07-25-2011 03:14 PM |
inserting a "ruled Line" /chapter and page breaks | tscamera | Calibre | 3 | 01-05-2011 04:47 PM |
First chapter of Marilynne Robinson's "Home" (pdf) | Seabound | Deals and Resources (No Self-Promotion or Affiliate Links) | 0 | 10-19-2008 03:13 AM |