View Single Post
Old 07-23-2016, 05:55 AM   #3
richardfoley
Junior Member
richardfoley began at the beginning.
 
richardfoley's Avatar
 
Posts: 3
Karma: 10
Join Date: Jul 2016
Device: none
thanks for the feedback/suggestion. I'm not sure we're quite talking about the same thing though. The simple text which is extracted from the PDF is already squashed together inside an XML/HTML tag. What this means (to me) is that by the time I get to the parsed first line, it is not possible, (via CSS or whatever), to identify the individual words anymore (except when a human reads the line). This is the text/line I'm interested in:

Quote:
TaniaandIstartedfromtheTegernseeBahnhofandtramped
This is the output of a grep of the (-d) files to demonstrate what I mean, where you can see that the rest of the text is parsed correctly:

Quote:
pub@thpad ~/test>grep -nriC3 TaniaandI *
input/index.html-338-<a name=17></a><img src="index-17_1.png"/><br>
input/index.html-339-<b>Tegernsee*to*Schliersee</b><br>
input/index.html-340-<i>Richard*Foley</i><br>
input/index.html:341:TaniaandIstartedfromtheTegernseeBah nhofandtramped<br>
input/index.html-342-along*the*easy*trail*up*into*the*base*of*the*steep *forest*above<br>
input/index.html-343-the*lake.*After*a*short*while,*we*reached*the*poin t*where*a<br>
input/index.html-344-smaller*trail*headed*up*the*hill*and*we*stopped*he re*briefly*to*allow<br>
--
parsed/index.html-133-
parsed/index.html-134-<h2 title="**Tegernsee to Schliersee**, *Richard Foley*"><b>Tegernsee to Schliersee</b></h2>
parsed/index.html-135-<h3 class="sigilNotInTOC"><i>Richard Foley</i></h3>
parsed/index.html:136:<p>TaniaandIstartedfromtheTegernsee Bahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
parsed/index.html-137-<p>17</p>
parsed/index.html-138-<p><img src="index-18_1.png"/></p>
parsed/index.html-139-<p>We made a steady pace, and ahead of us could see the small party which had passed us earlier, stopping for a short break. We passed them with the usual pleasantries, the smiles and the have*a*niceday's, which one commonly exchanges with fellow alpine hikers.</p>
--
processed/index.html-135-
processed/index.html-136-<h2 title="**Tegernsee to Schliersee**, *Richard Foley*" class="calibre1"><b class="calibre2">Tegernsee to Schliersee</b></h2>
processed/index.html-137-<h3 class="sigilNotInTOC"><i class="calibre4">Richard Foley</i></h3>
processed/index.html:138:<p class="calibre3">TaniaandIstartedfromtheTegernseeB ahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
processed/index.html-139-<p class="calibre3">17</p>
processed/index.html-140-<p class="calibre3"><img src="index-18_1.png" class="calibre7"/></p>
processed/index.html-141-<p class="calibre3">We made a steady pace, and ahead of us could see the small party which had passed us earlier, stopping for a short break. We passed them with the usual pleasantries, the smiles and the have*a*niceday's, which one commonly exchanges with fellow alpine hikers.</p>
--
structure/index.html-133-
structure/index.html-134-<h2 title="**Tegernsee to Schliersee**, *Richard Foley*" style="page-break-before:always"><b>Tegernsee to Schliersee</b></h2>
structure/index.html-135-<h3 class="sigilNotInTOC"><i>Richard Foley</i></h3>
structure/index.html:136:<p>TaniaandIstartedfromtheTegernsee Bahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
structure/index.html-137-<p>17</p>
structure/index.html-138-<p><img src="index-18_1.png"/></p>
structure/index.html-139-<p>We made a steady pace, and ahead of us could see the small party which had passed us earlier, stopping for a short break. We passed them with the usual pleasantries, the smiles and the have*a*niceday's, which one commonly exchanges with fellow alpine hikers.</p>
pub@thpad ~/test>grep -nriC1 TaniaandI *
input/index.html-340-<i>Richard*Foley</i><br>
input/index.html:341:TaniaandIstartedfromtheTegernseeBah nhofandtramped<br>
input/index.html-342-along*the*easy*trail*up*into*the*base*of*the*steep *forest*above<br>
--
parsed/index.html-135-<h3 class="sigilNotInTOC"><i>Richard Foley</i></h3>
parsed/index.html:136:<p>TaniaandIstartedfromtheTegernsee Bahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
parsed/index.html-137-<p>17</p>
--
processed/index.html-137-<h3 class="sigilNotInTOC"><i class="calibre4">Richard Foley</i></h3>
processed/index.html:138:<p class="calibre3">TaniaandIstartedfromtheTegernseeB ahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
processed/index.html-139-<p class="calibre3">17</p>
--
structure/index.html-135-<h3 class="sigilNotInTOC"><i>Richard Foley</i></h3>
structure/index.html:136:<p>TaniaandIstartedfromtheTegernsee Bahnhofandtramped along the easy trail up into the base of the steep forest above the lake. After a short while, we reached the point where a smaller trail headed up the hill and we stopped here briefly to allow a small group to pass before removing our clothes and putting them in our rucksacks. It was refreshing to feel the forest air over our entire skin, and we followed the main trail as it zig*zagged up through the trees.</p>
structure/index.html-137-<p>17</p>
Quote:
Originally Posted by theducks View Post
I use the Editor to fix whatever ills.

In this case. You have a line-height: <some value less than 1.2 (a typical value)>
This was probably inherited from a Dropcap or other decoration in the original. Find the Paragraph class in the CSS and fix (or remove that line)
PDF is a terrible source as it is a paste-up format, where the commands can be anywhere on the page
richardfoley is offline   Reply With Quote