Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 12-10-2018, 08:14 PM   #1
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Converting from PDF

I'm given a PDF to turn into EPUB.

Usual problem. Nice clean text transfer. But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts.

But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?

https://imgur.com/a/dHa4HoM

exaltedwombat is offline   Reply With Quote
Old 12-10-2018, 09:39 PM   #2
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by exaltedwombat View Post
But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts.

But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?
One word:

Toxaris's EPUB Tools

Postprocess OCR should combine broken sentences.

Note: Although it looks like you have Fiction, quotation marks at the end of lines may be tricky.

* * *

Or I use these Regex to combine:

Note: DO NOT Replace All. Only do these one-by-one.

Regex #1

This searches for hyphen at the end of line:

Search: -</p>\s+<p>
Replace: (nothing)

Before:

Spoiler:
Code:
<p>This is a sen-</p>

<p>tence.</p>


After:

Spoiler:
Code:
<p>This is a sentence.</p>


Regex #2

This searches for paragraphs that DON'T end on a closing punctuation mark (period, exclamation point, question mark, [...]).

Search: ([^>””\?\!\.])</p>\s+<p>
Replace: \1(put-a-space-here)

Before:

Spoiler:
Code:
<p>This is a</p>

<p>sentence.</p>

<p>One, Two, Three,</p>

<p>Four.</p>


After:

Spoiler:
Code:
<p>This is a sentence.</p>

<p>One, Two, Three, Four.</p>


Regex #3

This searches for a lowercase letter in the beginning of a paragraph.

Search: <p>[a-z]

Before:

Spoiler:
Code:
<p>this is an example.</p>

<p>of broken paragraph.</p>


That'll lead you most of the way there, but the rest will have to be manually checked/corrected. And edge cases which can be either/or (like :) will have to be checked.

But like I mentioned in the Note, Fiction + quotation marks gets a bit trickier. You can tweak Regex #2 above to catch some of those cases, but Fiction requires a lot more manual checking.

Last edited by Tex2002ans; 12-10-2018 at 09:42 PM.
Tex2002ans is offline   Reply With Quote
Old 12-10-2018, 09:57 PM   #3
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
@Tex2002ans, Thanks for the response.

Yes, I have Toxaris's EPUB Tools. I'm afraid the Postprocess OCR function adds a lot of spurious scenebreaks. (And then, with this 500 page book, attempting to Generate EPUB fails with an 'out of memory' error on this powerful PC with 24GB RAM, but that's another problem.) Calibre, with Heuristic Processing turned on, does a rather better job, but there will still be a dozen false paragraph breaks in each chapter needing manual intervention.

The point is, the export from PDF to docx retains ALL the paragraph indentations. If they could only be marked in some way...?
exaltedwombat is offline   Reply With Quote
Old 12-10-2018, 10:06 PM   #4
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
I'll give you a little more time, but unless something that ties this to Sigil comes up pretty soon (something more than "but I'm making an epub"), I'm going to move this thread elsewhere.

Last edited by DiapDealer; 12-10-2018 at 10:21 PM.
DiapDealer is offline   Reply With Quote
Old 12-10-2018, 10:30 PM   #5
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.
exaltedwombat is offline   Reply With Quote
Old 12-10-2018, 10:37 PM   #6
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
We suggest the tools all the time, but in passing. Converting PDFs to DOCX to epub utilizing Tox's epub tools simply doesn't have even a tenuous connection to Sigil. You're welcome to continue the discussion in the Workshop forum, where it belongs.
DiapDealer is offline   Reply With Quote
Old 12-10-2018, 10:39 PM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by exaltedwombat View Post
The point is, the export from PDF to docx retains ALL the paragraph indentations. If they could only be marked in some way...?
By "retain paragraph indentation", do you mean this in your example image?

paragraph 1+2 + "false" paragraph 3

Code:
<p>[...] in front of the gas fire.</p>

INDENT-IS-HERE -----> <p>After a 'famous' Sunday lunch, [...]</p>

NO-INDENT-HERE -> <p>usual. Pete struggled with the rucksack [...]</p>
You can visually SEE the indent, so you believe that all those are correct (paragraph 1+2)?

And the non-indented paragraphs (paragraph 3) should be merged?

* * *

No real way to tell unless you share the DOCX.

If you are using Word, you may be able to apply a Style to all the non-indented paragraphs by using Select > Select All Text With Similar Formatting. See my Side Note in the middle of this post.

When you export to HTML, you may then be able to use that Style to easily merge paragraphs together.

PS. Why not attach your images using MobileRead attachments? Hosting them on an external site just asks for trouble (so many dead links/images in old threads).

Quote:
Originally Posted by exaltedwombat View Post
Yes, I have Toxaris's EPUB Tools. I'm afraid the Postprocess OCR function adds a lot of spurious scenebreaks.
It shouldn't.

In EPUBTools, under Options > Postprocess OCR, there's a checkbox for:

Scene detection (whitelines) + Detection Distanct (pt)

Perhaps you have that accidentally checked? By default it's off.

Quote:
Originally Posted by exaltedwombat View Post
(And then, with this 500 page book, attempting to Generate EPUB fails with an 'out of memory' error on this powerful PC with 24GB RAM, but that's another problem.)
... PM me the PDF+DOCX and I could take a look.

Also, how exactly are you getting this PDF into DOCX?

Calibre convert? Using Word's built-in method?

Quote:
Originally Posted by exaltedwombat View Post
Calibre, with Heuristic Processing turned on, does a rather better job, but there will still be a dozen false paragraph breaks in each chapter needing manual intervention.
As you know, PDF is absolutely awful as an Input format.

Any of these methods are going to bring a ton of false positives and will require manual cleanup...

(See Hitch's discussion a few weeks ago about converting PDF professionally.)

Quote:
Originally Posted by exaltedwombat View Post
Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.
It should be in the Worskhop Forum... it's not dealing with Sigil, or EPUB really.

No big deal. Everyone important who would have seen it here will see it there.

Last edited by Tex2002ans; 12-10-2018 at 10:49 PM.
Tex2002ans is offline   Reply With Quote
Old 12-11-2018, 08:25 PM   #8
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by exaltedwombat View Post
Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.
Everything that Tex has told you is accurate. There is, quite simply, no way that we have ever found that will do what you want. You're asking the computer to use human reasoning, and I've yet to see anything that even comes close; Toxaris' plugin is the best thing I've seen. And I've found nothing else that compares to it, either.

Hitch
Hitch is offline   Reply With Quote
Old 12-12-2018, 12:30 PM   #9
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Quote:
Originally Posted by Tex2002ans View Post
You can visually SEE the indent, so you believe that all those are correct (paragraph 1+2)?

And the non-indented paragraphs (paragraph 3) should be merged?

* * *

No real way to tell unless you share the DOCX.
Thank you @Tex2002ans for your interest.

Sorry if I was not clear. Yes, the indented paragraphs, as seen in the docx, are the correct ones. Because I HAVE checked! All I need is a way of getting the information about where they occur across into an EPUB.

This is a very simple (though long) narrative text. The only problem with conversion is isolating the paragraphs that Word indents from those it doesn't.

Thanks for any further ideas!

Last edited by exaltedwombat; 12-12-2018 at 03:06 PM.
exaltedwombat is offline   Reply With Quote
Old 12-13-2018, 06:26 AM   #10
Vroni
Banned
Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'Vroni knows the difference between 'who' and 'whom'
 
Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
Im using this regex

([^\.\?!«>…])</p>\s*<p[^>]*>

and replace it with

\1

But... this is NOT a replace all magic regex, you gave to go through one at a time. May be you need to add several other chatacters to the allowed ones like : or other quotation marks
Vroni is offline   Reply With Quote
Old 12-15-2018, 09:27 AM   #11
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Thank you @Vroni. Again, this is an answer to a different question.

The correct paragraphs are visible on-screen, in the docx version. I need a method of conversion to EPUB that recognises those indents, as different to the other paragraph marks, like the one splitting 'than usual' in my example, which I'll repost.

https://imgur.com/a/dHa4HoM
exaltedwombat is offline   Reply With Quote
Old 12-15-2018, 06:03 PM   #12
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by exaltedwombat View Post
Thank you @Vroni. Again, this is an answer to a different question.

The correct paragraphs are visible on-screen, in the docx version. I need a method of conversion to EPUB that recognises those indents, as different to the other paragraph marks, like the one splitting 'than usual' in my example, which I'll repost.

https://imgur.com/a/dHa4HoM
Well, as I said previously, in our decade of trying to do this, there is no method that does that. No method that emulates human intelligence. Any conversion recognizes all the paragraphs--right or wrong. The indent means nothing to anything that's not a human, IME.

Hitch
Hitch is offline   Reply With Quote
Old 12-15-2018, 07:55 PM   #13
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
I'm still not getting this across, am I!

A computer HAS recognised the indented paragraphs. Acrobat has OCR'd the PDF and spat out a DOCX that has the correct paragraphs indented. Word knows where the correct paragraphs are. But it won't share this information across the next stage of conversion to EPUB.
exaltedwombat is offline   Reply With Quote
Old 12-15-2018, 09:18 PM   #14
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,550
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Have you tried the Mammoth HTML->DOCX converter, either direct or via DiapDealer's DOCXImport wrapper plugin for Sigil; it provides the wherewithal to map Word's Style definitions to CSS definitions, but not much use if the DOCX is full of in-line styles

Or have you tried opening the PDF in Word itself. I am often pleasantly surprised at the quality of the DOCX it (Word 16) creates. That is, the DOCX will have a set of sensible Styles.

BR

Last edited by BetterRed; 12-15-2018 at 09:21 PM.
BetterRed is offline   Reply With Quote
Old 12-15-2018, 09:52 PM   #15
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by exaltedwombat View Post
I'm still not getting this across, am I!

A computer HAS recognised the indented paragraphs. Acrobat has OCR'd the PDF and spat out a DOCX that has the correct paragraphs indented. Word knows where the correct paragraphs are. But it won't share this information across the next stage of conversion to EPUB.
I give up. Everyone here has done this hundreds or thousands of times, right? YES, I get it, Word recognizes the indents. So? When you import the HTML to ePUB, the style SHOULD remain. Are you saying that it doesn't????

Your original post said that:

Quote:
The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?
Right? When you import the exported HTML, from the Docx, the styles WILL remain. But that doesn't solve the spurious excess, wrong-location paragraph marks.

If that's NOT the problem--if for some reason, your styles are NOT exporting, then I don't understand the issue. HOW are you getting the content from docx to ePUB, then??

Hitch
Hitch is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PRS-950 Converting a PDF? Klankster Sony Reader 5 01-07-2011 01:11 AM
Converting PDF dan1chris2 Sony Reader 1 12-08-2010 05:44 AM
Converting PDF cantona Amazon Kindle 8 06-10-2010 06:53 AM
Converting PDF cantona General Discussions 3 06-01-2010 11:53 AM
PDF Converting Help Akumag2 Calibre 0 09-04-2009 06:27 PM


All times are GMT -4. The time now is 03:48 AM.


MobileRead.com is a privately owned, operated and funded community.