View Single Post
Old 12-10-2018, 10:39 PM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by exaltedwombat View Post
The point is, the export from PDF to docx retains ALL the paragraph indentations. If they could only be marked in some way...?
By "retain paragraph indentation", do you mean this in your example image?

paragraph 1+2 + "false" paragraph 3

Code:
<p>[...] in front of the gas fire.</p>

INDENT-IS-HERE -----> <p>After a 'famous' Sunday lunch, [...]</p>

NO-INDENT-HERE -> <p>usual. Pete struggled with the rucksack [...]</p>
You can visually SEE the indent, so you believe that all those are correct (paragraph 1+2)?

And the non-indented paragraphs (paragraph 3) should be merged?

* * *

No real way to tell unless you share the DOCX.

If you are using Word, you may be able to apply a Style to all the non-indented paragraphs by using Select > Select All Text With Similar Formatting. See my Side Note in the middle of this post.

When you export to HTML, you may then be able to use that Style to easily merge paragraphs together.

PS. Why not attach your images using MobileRead attachments? Hosting them on an external site just asks for trouble (so many dead links/images in old threads).

Quote:
Originally Posted by exaltedwombat View Post
Yes, I have Toxaris's EPUB Tools. I'm afraid the Postprocess OCR function adds a lot of spurious scenebreaks.
It shouldn't.

In EPUBTools, under Options > Postprocess OCR, there's a checkbox for:

Scene detection (whitelines) + Detection Distanct (pt)

Perhaps you have that accidentally checked? By default it's off.

Quote:
Originally Posted by exaltedwombat View Post
(And then, with this 500 page book, attempting to Generate EPUB fails with an 'out of memory' error on this powerful PC with 24GB RAM, but that's another problem.)
... PM me the PDF+DOCX and I could take a look.

Also, how exactly are you getting this PDF into DOCX?

Calibre convert? Using Word's built-in method?

Quote:
Originally Posted by exaltedwombat View Post
Calibre, with Heuristic Processing turned on, does a rather better job, but there will still be a dozen false paragraph breaks in each chapter needing manual intervention.
As you know, PDF is absolutely awful as an Input format.

Any of these methods are going to bring a ton of false positives and will require manual cleanup...

(See Hitch's discussion a few weeks ago about converting PDF professionally.)

Quote:
Originally Posted by exaltedwombat View Post
Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.
It should be in the Worskhop Forum... it's not dealing with Sigil, or EPUB really.

No big deal. Everyone important who would have seen it here will see it there.

Last edited by Tex2002ans; 12-10-2018 at 10:49 PM.
Tex2002ans is offline   Reply With Quote