12-10-2018, 08:14 PM | #1 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Converting from PDF
I'm given a PDF to turn into EPUB.
Usual problem. Nice clean text transfer. But extra paragraph markers appear. The laborious job of checking against the original and removing the spurious ones starts. But I can export to docx. The correct paragraph indents survive! How do I tell a conversion process that THESE are the paragraphs I want, ignore those excess paragraph marks? https://imgur.com/a/dHa4HoM |
12-10-2018, 09:39 PM | #2 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Toxaris's EPUB Tools Postprocess OCR should combine broken sentences. Note: Although it looks like you have Fiction, quotation marks at the end of lines may be tricky. * * * Or I use these Regex to combine: Note: DO NOT Replace All. Only do these one-by-one. Regex #1 This searches for hyphen at the end of line: Search: -</p>\s+<p> Replace: (nothing) Before: Spoiler:
After: Spoiler:
Regex #2 This searches for paragraphs that DON'T end on a closing punctuation mark (period, exclamation point, question mark, [...]). Search: ([^>””\?\!\.])</p>\s+<p> Replace: \1(put-a-space-here) Before: Spoiler:
After: Spoiler:
Regex #3 This searches for a lowercase letter in the beginning of a paragraph. Search: <p>[a-z] Before: Spoiler:
That'll lead you most of the way there, but the rest will have to be manually checked/corrected. And edge cases which can be either/or (like :) will have to be checked. But like I mentioned in the Note, Fiction + quotation marks gets a bit trickier. You can tweak Regex #2 above to catch some of those cases, but Fiction requires a lot more manual checking. Last edited by Tex2002ans; 12-10-2018 at 09:42 PM. |
|
12-10-2018, 09:57 PM | #3 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
@Tex2002ans, Thanks for the response.
Yes, I have Toxaris's EPUB Tools. I'm afraid the Postprocess OCR function adds a lot of spurious scenebreaks. (And then, with this 500 page book, attempting to Generate EPUB fails with an 'out of memory' error on this powerful PC with 24GB RAM, but that's another problem.) Calibre, with Heuristic Processing turned on, does a rather better job, but there will still be a dozen false paragraph breaks in each chapter needing manual intervention. The point is, the export from PDF to docx retains ALL the paragraph indentations. If they could only be marked in some way...? |
12-10-2018, 10:06 PM | #4 |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I'll give you a little more time, but unless something that ties this to Sigil comes up pretty soon (something more than "but I'm making an epub"), I'm going to move this thread elsewhere.
Last edited by DiapDealer; 12-10-2018 at 10:21 PM. |
12-10-2018, 10:30 PM | #5 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Look on it as a thread about Toxaris's EPUB Tools? They aren't directly connected with Sigil, but we give discussion of them houseroom. I'm looking for something that does their job, only better. Something that detects the indents I can SEE in a docx file and gets them into Sigil.
|
12-10-2018, 10:37 PM | #6 |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
We suggest the tools all the time, but in passing. Converting PDFs to DOCX to epub utilizing Tox's epub tools simply doesn't have even a tenuous connection to Sigil. You're welcome to continue the discussion in the Workshop forum, where it belongs.
|
12-10-2018, 10:39 PM | #7 | |||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
paragraph 1+2 + "false" paragraph 3 Code:
<p>[...] in front of the gas fire.</p> INDENT-IS-HERE -----> <p>After a 'famous' Sunday lunch, [...]</p> NO-INDENT-HERE -> <p>usual. Pete struggled with the rucksack [...]</p> And the non-indented paragraphs (paragraph 3) should be merged? * * * No real way to tell unless you share the DOCX. If you are using Word, you may be able to apply a Style to all the non-indented paragraphs by using Select > Select All Text With Similar Formatting. See my Side Note in the middle of this post. When you export to HTML, you may then be able to use that Style to easily merge paragraphs together. PS. Why not attach your images using MobileRead attachments? Hosting them on an external site just asks for trouble (so many dead links/images in old threads). Quote:
In EPUBTools, under Options > Postprocess OCR, there's a checkbox for: Scene detection (whitelines) + Detection Distanct (pt) Perhaps you have that accidentally checked? By default it's off. Quote:
Also, how exactly are you getting this PDF into DOCX? Calibre convert? Using Word's built-in method? Quote:
Any of these methods are going to bring a ton of false positives and will require manual cleanup... (See Hitch's discussion a few weeks ago about converting PDF professionally.) Quote:
No big deal. Everyone important who would have seen it here will see it there. Last edited by Tex2002ans; 12-10-2018 at 10:49 PM. |
|||||
12-11-2018, 08:25 PM | #8 | |
Bookmaker & Cat Slave
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Hitch |
|
12-12-2018, 12:30 PM | #9 | |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Quote:
Sorry if I was not clear. Yes, the indented paragraphs, as seen in the docx, are the correct ones. Because I HAVE checked! All I need is a way of getting the information about where they occur across into an EPUB. This is a very simple (though long) narrative text. The only problem with conversion is isolating the paragraphs that Word indents from those it doesn't. Thanks for any further ideas! Last edited by exaltedwombat; 12-12-2018 at 03:06 PM. |
|
12-13-2018, 06:26 AM | #10 |
Banned
Posts: 168
Karma: 10010
Join Date: Oct 2018
Device: Tolino/PRS 650/Tablet
|
Im using this regex
([^\.\?!«>…])</p>\s*<p[^>]*> and replace it with \1 But... this is NOT a replace all magic regex, you gave to go through one at a time. May be you need to add several other chatacters to the allowed ones like : or other quotation marks |
12-15-2018, 09:27 AM | #11 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Thank you @Vroni. Again, this is an answer to a different question.
The correct paragraphs are visible on-screen, in the docx version. I need a method of conversion to EPUB that recognises those indents, as different to the other paragraph marks, like the one splitting 'than usual' in my example, which I'll repost. https://imgur.com/a/dHa4HoM |
12-15-2018, 06:03 PM | #12 | |
Bookmaker & Cat Slave
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Hitch |
|
12-15-2018, 07:55 PM | #13 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
I'm still not getting this across, am I!
A computer HAS recognised the indented paragraphs. Acrobat has OCR'd the PDF and spat out a DOCX that has the correct paragraphs indented. Word knows where the correct paragraphs are. But it won't share this information across the next stage of conversion to EPUB. |
12-15-2018, 09:18 PM | #14 |
null operator (he/him)
Posts: 20,553
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Have you tried the Mammoth HTML->DOCX converter, either direct or via DiapDealer's DOCXImport wrapper plugin for Sigil; it provides the wherewithal to map Word's Style definitions to CSS definitions, but not much use if the DOCX is full of in-line styles
Or have you tried opening the PDF in Word itself. I am often pleasantly surprised at the quality of the DOCX it (Word 16) creates. That is, the DOCX will have a set of sensible Styles. BR Last edited by BetterRed; 12-15-2018 at 09:21 PM. |
12-15-2018, 09:52 PM | #15 | ||
Bookmaker & Cat Slave
Posts: 11,460
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Your original post said that: Quote:
If that's NOT the problem--if for some reason, your styles are NOT exporting, then I don't understand the issue. HOW are you getting the content from docx to ePUB, then?? Hitch |
||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PRS-950 Converting a PDF? | Klankster | Sony Reader | 5 | 01-07-2011 01:11 AM |
Converting PDF | dan1chris2 | Sony Reader | 1 | 12-08-2010 05:44 AM |
Converting PDF | cantona | Amazon Kindle | 8 | 06-10-2010 06:53 AM |
Converting PDF | cantona | General Discussions | 3 | 06-01-2010 11:53 AM |
PDF Converting Help | Akumag2 | Calibre | 0 | 09-04-2009 06:27 PM |