Converting from PDF

exaltedwombat · 12-10-2018, 08:14 PM

I'm given a PDF to turn into EPUB.

Usual problem. Nice clean text transfer. But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts.

But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?

https://imgur.com/a/dHa4HoM

Tex2002ans · 12-10-2018, 09:39 PM

Quote:

Originally Posted by exaltedwombat

But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts.

But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?

One word:

Toxaris's EPUB Tools

Postprocess OCR should combine broken sentences.

Note: Although it looks like you have Fiction, quotation marks at the end of lines may be tricky.

* * *

Or I use these Regex to combine:

Note: DO NOT Replace All. Only do these one-by-one.

Regex #1

This searches for hyphen at the end of line:

Search: -\s+
Replace: (nothing)

Before:

Spoiler:

After:

Spoiler:

Regex #2

This searches for paragraphs that DON'T end on a closing punctuation mark (period, exclamation point, question mark, [...]).

Search: ([^>””\?\!\.])\s+
Replace: \1(put-a-space-here)

Before:

Spoiler:

After:

Spoiler:

Regex #3

This searches for a lowercase letter in the beginning of a paragraph.

Search: [a-z]

Before:

Spoiler:

That'll lead you most of the way there, but the rest will have to be manually checked/corrected. And edge cases which can be either/or (like :) will have to be checked.

But like I mentioned in the Note, Fiction + quotation marks gets a bit trickier. You can tweak Regex #2 above to catch some of those cases, but Fiction requires a lot more manual checking.

exaltedwombat · 12-10-2018, 09:57 PM

@Tex2002ans, Thanks for the response.

Yes, I have Toxaris's EPUB Tools. I'm afraid the Postprocess OCR function adds a lot of spurious scenebreaks. (And then, with this 500 page book, attempting to Generate EPUB fails with an 'out of memory' error on this powerful PC with 24GB RAM, but that's another problem.) Calibre, with Heuristic Processing turned on, does a rather better job, but there will still be a dozen false paragraph breaks in each chapter needing manual intervention.

The point is, the export from PDF to docx retains ALL the paragraph indentations. If they could only be marked in some way...?

DiapDealer · 12-10-2018, 10:06 PM

I'll give you a little more time, but unless something that ties this to Sigil comes up pretty soon (something more than "but I'm making an epub"), I'm going to move this thread elsewhere.

exaltedwombat · 12-10-2018, 10:30 PM

Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.

DiapDealer · 12-10-2018, 10:37 PM

We suggest the tools all the time, but in passing. Converting PDFs to DOCX to epub utilizing Tox's epub tools simply doesn't have even a tenuous connection to Sigil. You're welcome to continue the discussion in the Workshop forum, where it belongs.

Tex2002ans · 12-10-2018, 10:39 PM

Quote:

Originally Posted by exaltedwombat

The point is, the export from PDF to docx retains ALL the paragraph indentations. If they could only be marked in some way...?

By "retain paragraph indentation", do you mean this in your example image?

paragraph 1+2 + "false" paragraph 3

Code:

<p>[...] in front of the gas fire.</p>

INDENT-IS-HERE -----> <p>After a 'famous' Sunday lunch, [...]</p>

NO-INDENT-HERE -> <p>usual. Pete struggled with the rucksack [...]</p>

You can visually SEE the indent, so you believe that all those are correct (paragraph 1+2)?

And the non-indented paragraphs (paragraph 3) should be merged?

* * *

No real way to tell unless you share the DOCX.

If you are using Word, you may be able to apply a Style to all the non-indented paragraphs by using Select > Select All Text With Similar Formatting. See my Side Note in the middle of this post.

When you export to HTML, you may then be able to use that Style to easily merge paragraphs together.

PS. Why not attach your images using MobileRead attachments? Hosting them on an external site just asks for trouble (so many dead links/images in old threads).

Quote:

Originally Posted by exaltedwombat

Yes, I have Toxaris's EPUB Tools. I'm afraid the Postprocess OCR function adds a lot of spurious scenebreaks.

It shouldn't.

In EPUBTools, under Options > Postprocess OCR, there's a checkbox for:

Scene detection (whitelines) + Detection Distanct (pt)

Perhaps you have that accidentally checked? By default it's off.

Quote:

Originally Posted by exaltedwombat

(And then, with this 500 page book, attempting to Generate EPUB fails with an 'out of memory' error on this powerful PC with 24GB RAM, but that's another problem.)

... PM me the PDF+DOCX and I could take a look.

Also, how exactly are you getting this PDF into DOCX?

Calibre convert? Using Word's built-in method?

Quote:

Originally Posted by exaltedwombat

Calibre, with Heuristic Processing turned on, does a rather better job, but there will still be a dozen false paragraph breaks in each chapter needing manual intervention.

As you know, PDF is absolutely awful as an Input format.

Any of these methods are going to bring a ton of false positives and will require manual cleanup...

(See Hitch's discussion a few weeks ago about converting PDF professionally.)

Quote:

Originally Posted by exaltedwombat

Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.

It should be in the Worskhop Forum... it's not dealing with Sigil, or EPUB really.

No big deal. Everyone important who would have seen it here will see it there.

Hitch · 12-11-2018, 08:25 PM

Quote:

Originally Posted by exaltedwombat

Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.

Everything that Tex has told you is accurate. There is, quite simply, no way that we have ever found that will do what you want. You're asking the computer to use human reasoning, and I've yet to see anything that even comes close; Toxaris' plugin is the best thing I've seen. And I've found nothing else that compares to it, either.

Hitch

exaltedwombat · 12-12-2018, 12:30 PM

Quote:

Originally Posted by Tex2002ans

You can visually SEE the indent, so you believe that all those are correct (paragraph 1+2)?

And the non-indented paragraphs (paragraph 3) should be merged?

* * *

No real way to tell unless you share the DOCX.

Thank you @Tex2002ans for your interest.

Sorry if I was not clear. Yes, the indented paragraphs, as seen in the docx, are the correct ones. Because I HAVE checked! All I need is a way of getting the information about where they occur across into an EPUB.

This is a very simple (though long) narrative text. The only problem with conversion is isolating the paragraphs that Word indents from those it doesn't.

Thanks for any further ideas!

Vroni · 12-13-2018, 06:26 AM

Im using this regex

([^\.\?!«>…])\s*<p[^>]*>

and replace it with

\1

But... this is NOT a replace all magic regex, you gave to go through one at a time. May be you need to add several other chatacters to the allowed ones like : or other quotation marks

exaltedwombat · 12-15-2018, 09:27 AM

Thank you @Vroni. Again, this is an answer to a different question.

The correct paragraphs are visible on-screen, in the docx version. I need a method of conversion to EPUB that recognises those indents, as different to the other paragraph marks, like the one splitting 'than usual' in my example, which I'll repost.

https://imgur.com/a/dHa4HoM

Hitch · 12-15-2018, 06:03 PM

Quote:

Originally Posted by exaltedwombat

Thank you @Vroni. Again, this is an answer to a different question.

The correct paragraphs are visible on-screen, in the docx version. I need a method of conversion to EPUB that recognises those indents, as different to the other paragraph marks, like the one splitting 'than usual' in my example, which I'll repost.

https://imgur.com/a/dHa4HoM

Well, as I said previously, in our decade of trying to do this, there is no method that does that. No method that emulates human intelligence. Any conversion recognizes all the paragraphs--right or wrong. The indent means nothing to anything that's not a human, IME.

Hitch

exaltedwombat · 12-15-2018, 07:55 PM

I'm still not getting this across, am I!

A computer HAS recognised the indented paragraphs. Acrobat has OCR'd the PDF and spat out a DOCX that has the correct paragraphs indented. Word knows where the correct paragraphs are. But it won't share this information across the next stage of conversion to EPUB.

BetterRed · 12-15-2018, 09:18 PM

Have you tried the Mammoth HTML->DOCX converter, either direct or via DiapDealer's DOCXImport wrapper plugin for Sigil; it provides the wherewithal to map Word's Style definitions to CSS definitions, but not much use if the DOCX is full of in-line styles

Or have you tried opening the PDF in Word itself. I am often pleasantly surprised at the quality of the DOCX it (Word 16) creates. That is, the DOCX will have a set of sensible Styles.

BR

Hitch · 12-15-2018, 09:52 PM

Quote:

Originally Posted by exaltedwombat

I'm still not getting this across, am I!

A computer HAS recognised the indented paragraphs. Acrobat has OCR'd the PDF and spat out a DOCX that has the correct paragraphs indented. Word knows where the correct paragraphs are. But it won't share this information across the next stage of conversion to EPUB.

I give up. Everyone here has done this hundreds or thousands of times, right? YES, I get it, Word recognizes the indents. So? When you import the HTML to ePUB, the style SHOULD remain. Are you saying that it doesn't????

Your original post said that:

Quote:

The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks?

Right? When you import the exported HTML, from the Docx, the styles WILL remain. But that doesn't solve the spurious excess, wrong-location paragraph marks.

If that's NOT the problem--if for some reason, your styles are NOT exporting, then I don't understand the issue. HOW are you getting the content from docx to ePUB, then??

Hitch

12-10-2018, 10:06 PM	#4
DiapDealer Grand Sorcerer Posts: 27,546 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I'll give you a little more time, but unless something that ties this to Sigil comes up pretty soon (something more than "but I'm making an epub"), I'm going to move this thread elsewhere. Last edited by DiapDealer; 12-10-2018 at 10:21 PM.

12-13-2018, 06:26 AM	#10
Vroni Banned Posts: 168 Karma: 10010 Join Date: Oct 2018 Device: Tolino/PRS 650/Tablet	Im using this regex ([^\.\?!«>…])</p>\s<p[^>]> and replace it with \1 But... this is NOT a replace all magic regex, you gave to go through one at a time. May be you need to add several other chatacters to the allowed ones like : or other quotation marks

12-15-2018, 09:18 PM	#14
BetterRed null operator (he/him) Posts: 20,553 Karma: 26954694 Join Date: Mar 2012 Location: Sydney Australia Device: none	Have you tried the Mammoth HTML->DOCX converter, either direct or via DiapDealer's DOCXImport wrapper plugin for Sigil; it provides the wherewithal to map Word's Style definitions to CSS definitions, but not much use if the DOCX is full of in-line styles Or have you tried opening the PDF in Word itself. I am often pleasantly surprised at the quality of the DOCX it (Word 16) creates. That is, the DOCX will have a set of sensible Styles. BR Last edited by BetterRed; 12-15-2018 at 09:21 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PRS-950 Converting a PDF?	Klankster	Sony Reader	5	01-07-2011 01:11 AM
Converting PDF	dan1chris2	Sony Reader	1	12-08-2010 05:44 AM
Converting PDF	cantona	Amazon Kindle	8	06-10-2010 06:53 AM
Converting PDF	cantona	General Discussions	3	06-01-2010 11:53 AM
PDF Converting Help	Akumag2	Calibre	0	09-04-2009 06:27 PM

12-10-2018, 08:14 PM	#1
exaltedwombat Guru Posts: 878 Karma: 2457540 Join Date: Nov 2011 Device: none	Converting from PDF I'm given a PDF to turn into EPUB. Usual problem. Nice clean text transfer. But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts. But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks? https://imgur.com/a/dHa4HoM

12-10-2018, 09:57 PM	#3
exaltedwombat Guru Posts: 878 Karma: 2457540 Join Date: Nov 2011 Device: none	@Tex2002ans, Thanks for the response. Yes, I have Toxaris's EPUB Tools. I'm afraid the Postprocess OCR function adds a lot of spurious scenebreaks. (And then, with this 500 page book, attempting to Generate EPUB fails with an 'out of memory' error on this powerful PC with 24GB RAM, but that's another problem.) Calibre, with Heuristic Processing turned on, does a rather better job, but there will still be a dozen false paragraph breaks in each chapter needing manual intervention. The point is, the export from PDF to docx retains ALL the paragraph indentations. If they could only be marked in some way...?

12-10-2018, 10:30 PM	#5
exaltedwombat Guru Posts: 878 Karma: 2457540 Join Date: Nov 2011 Device: none	Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.

12-10-2018, 10:37 PM	#6
DiapDealer Grand Sorcerer Posts: 27,546 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	We suggest the tools all the time, but in passing. Converting PDFs to DOCX to epub utilizing Tox's epub tools simply doesn't have even a tenuous connection to Sigil. You're welcome to continue the discussion in the Workshop forum, where it belongs.

12-15-2018, 09:27 AM	#11
exaltedwombat Guru Posts: 878 Karma: 2457540 Join Date: Nov 2011 Device: none	Thank you @Vroni. Again, this is an answer to a different question. The correct paragraphs are visible on-screen, in the docx version. I need a method of conversion to EPUB that recognises those indents, as different to the other paragraph marks, like the one splitting 'than usual' in my example, which I'll repost. https://imgur.com/a/dHa4HoM

12-15-2018, 07:55 PM	#13
exaltedwombat Guru Posts: 878 Karma: 2457540 Join Date: Nov 2011 Device: none	I'm still not getting this across, am I! A computer HAS recognised the indented paragraphs. Acrobat has OCR'd the PDF and spat out a DOCX that has the correct paragraphs indented. Word knows where the correct paragraphs are. But it won't share this information across the next stage of conversion to EPUB.