12-16-2018, 01:18 AM | #16 |
null operator (he/him)
Posts: 20,569
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@exaltedwombat - the screenshot in first post, presumably from Word, is riddled with anomalies - I've marked those I can see at a glance with my eyeballs, afaik conversion by almost any means will faithfully convert them to epub :
Given you have the document in Word, it should only take a short while to fix most of the anomalies with simple Word macros and Tox's ePub Tools. I hazard it's a scanned PDF. As has so often been said - getting a perfect conversion of a scanned PDF is tedious. Hitch has explained how they do it in her business on numerous occasions - she probably has it tucked away in her paste buffer BR |
12-16-2018, 01:34 PM | #17 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
I am struggling to figure out what's being asked. My last take on this is that when he puts the text into Sigil (?), he's NOT seeing indents. If that's the case, it's simply the Styles. (Or, if he's copy-pasting into Sigil, same thing--the styles have to be created/set up in the CSS.) If that's not it, then I still don't understand the question. I thought it was "how do I NOT import the broken paragraph pilcrows/codes," but his last post indicates that isn't the question. So....at this point, your guess is as good as mine. Hitch |
|
Advert | |
|
12-16-2018, 03:42 PM | #18 | |
null operator (he/him)
Posts: 20,569
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
I suspect he's missed seeing the tiny second paragraph break (pilcrow) where I have written "scene break?", that would explain his complaint about EPub Tools OCR Postprocess inserting spurious scene breaks. And those triple spaces (···) are probably missing paragraph breaks The one I highlighted that reads: ". . . one."···"Oh, shit!" almost certainly is. BR Last edited by BetterRed; 12-29-2018 at 05:25 PM. |
|
12-16-2018, 04:01 PM | #19 |
Imperfect Perfectionist
Posts: 464
Karma: 724664
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
I think, the OP wants to connect the invalid linebreaks/paragraph breaks. If that assumption is correct, I don't think there is any possible way of coercing Word to do that while exporting. But there are some tools available to do some of the heavy work (but not necessarily for free):
TransTools for Word has an "UnBreaker" tool. It lists all the linebreaks, which are (probably) not valid, and the user can go through the list and mark those he thinks should be rectified. - (TransTools is a suite of VBA-macros, and as such rather slow. So it should be tried on a small subset of the file in question first to see if it's usable) Pepito Cleaner (for OpenOffice/Libreoffice) has something similar. (It's broke on LibreOffice 6.1, but should work on 6.0) If the PDF is something with a text-layer, the OP might try Softmakers FlexiPDF (it must be the Pro version!) to export the textlayer. It does a hell of a job figuring out the correct paragraph breaks. It has a somewhat steep price (I've never regretted buying it, though), but another software vendor, AShampoo, has leased it from Softmaker and sells it sometimes for 20$ or so. Just some 0.02 cents suggestions - hope it might help. Regards, Kim |
12-16-2018, 04:17 PM | #20 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Long time no see. If the FlexiPDF works, then $80 is cheap, at least for anyone working with pdfs regularly. Hell, I'd be willing to try it, if it can even be used, let's say, only on fiction PDFs. You have the pro version, presumably? (FYI, I think the OP is saying, in a roundabout way, that the export to ePUB is resulting in block-style paras. I know, it sounds daft, but that's my latest interpretation. When I tried my last to discuss broken paras, he sounded exasperated ["I'm still not getting this across, am I?"], so I don't think it's the broken paras. I hope we find out soon!) Hitch |
|
Advert | |
|
12-16-2018, 04:30 PM | #21 | |
Imperfect Perfectionist
Posts: 464
Karma: 724664
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
Regarding the question from the OP, it would be SO nice to have a sample of the actual file, since none of us here is clairvoyant (I'm not, AFAIK ) Regards, Kim |
|
12-16-2018, 05:01 PM | #22 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
ETA: Uh, FlexiPDF exports every PAGE in a PDF, as a Word file? (update: figured out how to work around that.) No, thanks. I downloaded and tried the export...hopefully, I've missed something obvious, but who the hell would want that? And--can't say I've ever seen THIS before, it puts a paragraph mark BEFORE each paragraph, in addition to after each paragraph. Kim, have you actually used this? Honestly, at the moment, it seems that the native Adobe export would be far, far cleaner than this. Are you using this? ETA2: and every paragraph came out, in a TABLE? Oy, run, do not walk, from that product. What a disaster! Maybe I'm a dimwit, and I used it wrongly, but so far, it's a bloody mess. Hitch Last edited by Hitch; 12-16-2018 at 09:14 PM. Reason: Tried FlexiPDF... |
|
12-17-2018, 11:43 AM | #23 |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Update 2 on the FlexiPDF program
So, sports fans:
I ran a test, using the FlexiPDF, and the short answer is, however Kim is using it, is 100% different than what we do here. Gotta be. Not only did I have to screw around to get it to export the entire document, but every time I do, the resulting 62 MB Word file (from a 400-page, super-clean PDF novel) crashes Word, causing serious errors. Maybe it works better in simpler environments, but... On those pages that I did manage to export, without crashing, as a test, every paragraph is put inside a TABLE. I did a "save as...Word" export, from the same PDF, and sure, I got broken paragraphs, but I also got the body text styles, etc. The Acrobat export was far, far superior to the FlexiPdf export. In short, again--Kim must be using it very, very, VERY differently than we do, because this is not a program that I'd use, having seen the results. I"Ve tried probably 10x different "convert from PDF" programs, and honestly, this ranks amongst the worst. None of them are good, of course; that's the bottom line. But this one, putting each paragraph in a table? That's a whole new low. I did buy the $25 TransTools suite; I figure if it's a disaster, I can afford to lose the money. I'm going to test it against the broken paras in the from-Acrobat export I did as part of the test. I'll report back how that does on broken paragraphs because that is a feature I could really use. Hitch |
12-17-2018, 01:34 PM | #24 | |
Imperfect Perfectionist
Posts: 464
Karma: 724664
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
That said, I just tried to export the textlayer from an OCR'ed pdf I've got from the Royal Library of Copenhagen, to Word, and using the standard settings, I'll admit it stinks. But if you press "Format" on the right bottom of the export dialogue, you can alter the standard settings. I removed everything, except "Text Output" and "De-hyphenate" for the Word-export, and got a nice Word-doc with none of the issues, you mention. Not perfect (because the OCR from the Royal Library is not perfect), but very, very usable. You'll probably have to fiddle with the settings to get exactly what you want, but I think you might be a little too fast condemning FlexiPDF. Regards, Kim |
|
12-17-2018, 04:30 PM | #25 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Well, yes, exporting to plain text would certainly make things less complicated, but even the HTML export I tried was lame-ish. I shall try it again, to see what I get. I certainly don't want to send it to the Guillotine prematurely, but...we'll see! Hitch |
|
12-29-2018, 03:34 PM | #26 | |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Quote:
Also, I *never* use the export function for PDF. I rather re-OCR them. The results are much better than that. It was also mentioned that you can fine tune the scene detection. |
|
12-29-2018, 04:29 PM | #27 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
I did want to say, the Unbreaker in TransTools is simply AMAZEBALLS, and we all love it here. So, well done on that one! We are appreciative for the referral. LOVE IT. (Still not wild about the FlexiPDF, but that may just be a matter of taste.) Hitch |
|
12-30-2018, 03:22 PM | #28 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
It sounds very interesting, so I am going to take a look at it to see if I can take some of the ideas from it and implement that into the tools. Their language tool also looks interesting and I wonder how they get certain information from Word.
|
12-30-2018, 03:44 PM | #29 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Typety-type-type-type (continuing sentence) ENTER ENTER typety-type-type-type-ENTER ENTER between lines--not paragraphs, and then does NOT hit another "enter" between paragraphs, it's deaf dumb and mute. (People who treat Word like a typewriter and type "manuscript style" with 2-line spacing, manually.) It's helpless, but so is every other automated program, even the ones we've written in-house. {shrug}. Without true AI, I don't know how anyone would address that one. Sure, you could look for lower-case letters, etc., but then, figuring out new paragraphs would be a mother. Ya know? (Well, yes, YOU do know, better than most.) Hitch |
|
12-31-2018, 06:36 PM | #30 | |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Quote:
Pepito cleaner is not broken. The extension icon is not displayed in the toolsbar but you can open it with "Edition/Pepito cleaner" and it works as usual. 0.01 cent. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PRS-950 Converting a PDF? | Klankster | Sony Reader | 5 | 01-07-2011 01:11 AM |
Converting PDF | dan1chris2 | Sony Reader | 1 | 12-08-2010 05:44 AM |
Converting PDF | cantona | Amazon Kindle | 8 | 06-10-2010 06:53 AM |
Converting PDF | cantona | General Discussions | 3 | 06-01-2010 11:53 AM |
PDF Converting Help | Akumag2 | Calibre | 0 | 09-04-2009 06:27 PM |