Quote:
Originally Posted by exaltedwombat
The point is, the export from PDF to docx retains ALL the paragraph indentations. If they could only be marked in some way...?
|
By "retain paragraph indentation", do you mean this in your example image?
paragraph 1+2 + "false" paragraph 3
Code:
<p>[...] in front of the gas fire.</p>
INDENT-IS-HERE -----> <p>After a 'famous' Sunday lunch, [...]</p>
NO-INDENT-HERE -> <p>usual. Pete struggled with the rucksack [...]</p>
You can visually SEE the indent, so you believe that all those are correct (paragraph 1+2)?
And the non-indented paragraphs (paragraph 3) should be merged?
* * *
No real way to tell unless you share the DOCX.
If you are using Word, you may be able to apply a Style to all the non-indented paragraphs by using
Select > Select All Text With Similar Formatting.
See my Side Note in the middle of this post.
When you export to HTML, you may then be able to use that Style to easily merge paragraphs together.
PS. Why not attach your images using MobileRead attachments? Hosting them on an external site just asks for trouble (so many dead links/images in old threads).
Quote:
Originally Posted by exaltedwombat
Yes, I have Toxaris's EPUB Tools. I'm afraid the Postprocess OCR function adds a lot of spurious scenebreaks.
|
It shouldn't.
In EPUBTools, under
Options > Postprocess OCR, there's a checkbox for:
Scene detection (whitelines) +
Detection Distanct (pt)
Perhaps you have that accidentally checked? By default it's off.
Quote:
Originally Posted by exaltedwombat
(And then, with this 500 page book, attempting to Generate EPUB fails with an 'out of memory' error on this powerful PC with 24GB RAM, but that's another problem.)
|
... PM me the PDF+DOCX and I could take a look.
Also, how exactly are you getting this PDF into DOCX?
Calibre convert? Using Word's built-in method?
Quote:
Originally Posted by exaltedwombat
Calibre, with Heuristic Processing turned on, does a rather better job, but there will still be a dozen false paragraph breaks in each chapter needing manual intervention.
|
As you know, PDF is absolutely awful as an Input format.
Any of these methods are going to bring a ton of false positives and will require manual cleanup...
(
See Hitch's discussion a few weeks ago about converting PDF professionally.)
Quote:
Originally Posted by exaltedwombat
Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.
|
It should be in the Worskhop Forum... it's not dealing with Sigil, or EPUB really.
No big deal. Everyone important who would have seen it here will see it there.