Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 07-20-2016, 05:49 AM   #1
richardfoley
Junior Member
richardfoley began at the beginning.
 
richardfoley's Avatar
 
Posts: 3
Karma: 10
Join Date: Jul 2016
Device: none
First line of chapter is "squashed"

I'm (trying to) convert an existing PDF into an eBook using Calibre v1.48.

Most of the process works fairly well, except for the first line of each chapter. The first line is "squashed", that is, the whitespace is all removed and the words all run together. Only on the first line of each chapter, all other paragraphs are fine.

Heuristics are switched on, and it seems not to matter which settings I use, the first line is always squashed. I've a suspicion this has to do with the first 3 lines in the PDF using a drop cap, as it's the only thing I can think of which is unique to the first line of each chapter. The 2nd and 3rd lines (which follow the drop cap in the PDF) appear just fine with their expected word spacings.

I've tried using "-d dirpath" and all of the (debug) output files have the squashed text in them already, so I suspect it's the parsing of the drop cap in the original PDF, somehow...

Thanks in advance for any ideas that might be a cause/fix for this.
richardfoley is offline   Reply With Quote
Old 07-20-2016, 09:59 AM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,792
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by richardfoley View Post
I'm (trying to) convert an existing PDF into an eBook using Calibre v1.48.

Most of the process works fairly well, except for the first line of each chapter. The first line is "squashed", that is, the whitespace is all removed and the words all run together. Only on the first line of each chapter, all other paragraphs are fine.

Heuristics are switched on, and it seems not to matter which settings I use, the first line is always squashed. I've a suspicion this has to do with the first 3 lines in the PDF using a drop cap, as it's the only thing I can think of which is unique to the first line of each chapter. The 2nd and 3rd lines (which follow the drop cap in the PDF) appear just fine with their expected word spacings.

I've tried using "-d dirpath" and all of the (debug) output files have the squashed text in them already, so I suspect it's the parsing of the drop cap in the original PDF, somehow...

Thanks in advance for any ideas that might be a cause/fix for this.
I use the Editor to fix whatever ills.

In this case. You have a line-height: <some value less than 1.2 (a typical value)>
This was probably inherited from a Dropcap or other decoration in the original. Find the Paragraph class in the CSS and fix (or remove that line)
PDF is a terrible source as it is a paste-up format, where the commands can be anywhere on the page
theducks is online now   Reply With Quote
Advert
Old 07-23-2016, 05:55 AM   #3
richardfoley
Junior Member
richardfoley began at the beginning.
 
richardfoley's Avatar
 
Posts: 3
Karma: 10
Join Date: Jul 2016
Device: none
thanks for the feedback/suggestion. I'm not sure we're quite talking about the same thing though. The simple text which is extracted from the PDF is already squashed together inside an XML/HTML tag. What this means (to me) is that by the time I get to the parsed first line, it is not possible, (via CSS or whatever), to identify the individual words anymore (except when a human reads the line). This is the text/line I'm interested in:

Quote:
TaniaandIstartedfromtheTegernseeBahnhofandtramped
This is the output of a grep of the (-d) files to demonstrate what I mean, where you can see that the rest of the text is parsed correctly:

Quote:
pub@thpad ~/test>grep -nriC3 TaniaandI *
input/index.html-338-<a name=17></a><img src="index-17_1.png"/><br>
input/index.html-339-<b>Tegernsee*to*Schliersee</b><br>
input/index.html-340-<i>Richard*Foley</i><br>
input/index.html:341:TaniaandIstartedfromtheTegernseeBah nhofandtramped<br>
input/index.html-342-along*the*easy*trail*up*into*the*base*of*the*steep *forest*above<br>
input/index.html-343-the*lake.*After*a*short*while,*we*reached*the*poin t*where*a<br>
input/index.html-344-smaller*trail*headed*up*the*hill*and*we*stopped*he re*briefly*to*allow<br>
--
parsed/index.html-133-
parsed/index.html-134-<h2 title="**Tegernsee to Schliersee**, *Richard Foley*"><b>Tegernsee to Schliersee</b></h2>
parsed/index.html-135-<h3 class="sigilNotInTOC"><i>Richard Foley</i></h3>
parsed/index.html:136:<p>TaniaandIstartedfromtheTegernsee Bahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
parsed/index.html-137-<p>17</p>
parsed/index.html-138-<p><img src="index-18_1.png"/></p>
parsed/index.html-139-<p>We made a steady pace, and ahead of us could see the small party which had passed us earlier, stopping for a short break. We passed them with the usual pleasantries, the smiles and the have*a*niceday's, which one commonly exchanges with fellow alpine hikers.</p>
--
processed/index.html-135-
processed/index.html-136-<h2 title="**Tegernsee to Schliersee**, *Richard Foley*" class="calibre1"><b class="calibre2">Tegernsee to Schliersee</b></h2>
processed/index.html-137-<h3 class="sigilNotInTOC"><i class="calibre4">Richard Foley</i></h3>
processed/index.html:138:<p class="calibre3">TaniaandIstartedfromtheTegernseeB ahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
processed/index.html-139-<p class="calibre3">17</p>
processed/index.html-140-<p class="calibre3"><img src="index-18_1.png" class="calibre7"/></p>
processed/index.html-141-<p class="calibre3">We made a steady pace, and ahead of us could see the small party which had passed us earlier, stopping for a short break. We passed them with the usual pleasantries, the smiles and the have*a*niceday's, which one commonly exchanges with fellow alpine hikers.</p>
--
structure/index.html-133-
structure/index.html-134-<h2 title="**Tegernsee to Schliersee**, *Richard Foley*" style="page-break-before:always"><b>Tegernsee to Schliersee</b></h2>
structure/index.html-135-<h3 class="sigilNotInTOC"><i>Richard Foley</i></h3>
structure/index.html:136:<p>TaniaandIstartedfromtheTegernsee Bahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
structure/index.html-137-<p>17</p>
structure/index.html-138-<p><img src="index-18_1.png"/></p>
structure/index.html-139-<p>We made a steady pace, and ahead of us could see the small party which had passed us earlier, stopping for a short break. We passed them with the usual pleasantries, the smiles and the have*a*niceday's, which one commonly exchanges with fellow alpine hikers.</p>
pub@thpad ~/test>grep -nriC1 TaniaandI *
input/index.html-340-<i>Richard*Foley</i><br>
input/index.html:341:TaniaandIstartedfromtheTegernseeBah nhofandtramped<br>
input/index.html-342-along*the*easy*trail*up*into*the*base*of*the*steep *forest*above<br>
--
parsed/index.html-135-<h3 class="sigilNotInTOC"><i>Richard Foley</i></h3>
parsed/index.html:136:<p>TaniaandIstartedfromtheTegernsee Bahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
parsed/index.html-137-<p>17</p>
--
processed/index.html-137-<h3 class="sigilNotInTOC"><i class="calibre4">Richard Foley</i></h3>
processed/index.html:138:<p class="calibre3">TaniaandIstartedfromtheTegernseeB ahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
processed/index.html-139-<p class="calibre3">17</p>
--
structure/index.html-135-<h3 class="sigilNotInTOC"><i>Richard Foley</i></h3>
structure/index.html:136:<p>TaniaandIstartedfromtheTegernsee Bahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
structure/index.html-137-<p>17</p>
Quote:
Originally Posted by theducks View Post
I use the Editor to fix whatever ills.

In this case. You have a line-height: <some value less than 1.2 (a typical value)>
This was probably inherited from a Dropcap or other decoration in the original. Find the Paragraph class in the CSS and fix (or remove that line)
PDF is a terrible source as it is a paste-up format, where the commands can be anywhere on the page
richardfoley is offline   Reply With Quote
Old 07-23-2016, 07:16 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There is no automated way to fix that -- the problem is that the PDF format contains no words, so when converting it into text one has to guess where there are word boundaries, based on the spacing between individual characters. That guessing occassionally fails, as in this case.
kovidgoyal is offline   Reply With Quote
Old 07-25-2016, 05:19 AM   #5
richardfoley
Junior Member
richardfoley began at the beginning.
 
richardfoley's Avatar
 
Posts: 3
Karma: 10
Join Date: Jul 2016
Device: none
Ok, thanks for the info. I'll just have to fix it manually and go from there.

Many thanks.

Quote:
Originally Posted by kovidgoyal View Post
There is no automated way to fix that -- the problem is that the PDF format contains no words, so when converting it into text one has to guess where there are word boundaries, based on the spacing between individual characters. That guessing occassionally fails, as in this case.
richardfoley is offline   Reply With Quote
Advert
Reply

Tags
dropcap, ebook, first line, pdf, squashed


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help! How can I remove "x mins left in chapter" from kindle app. RCR Android Devices 6 08-27-2013 02:13 PM
Feature Request: configurable space setting for "Insert blank line" in "Look & Feel" therealjoeblow Calibre 15 07-25-2011 03:14 PM
inserting a "ruled Line" /chapter and page breaks tscamera Calibre 3 01-05-2011 04:47 PM
First chapter of Marilynne Robinson's "Home" (pdf) Seabound Deals and Resources (No Self-Promotion or Affiliate Links) 0 10-19-2008 03:13 AM


All times are GMT -4. The time now is 05:04 AM.


MobileRead.com is a privately owned, operated and funded community.