Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 08-19-2019, 09:42 PM   #1
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,553
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
PDF to ePub conversion

Quote:
Originally Posted by DNSB View Post
. . . since PDF is one of the worst formats to convert from . . .


In the OPs last post he wrote "I converted the book from PDF with Sigil." Is that possible - if so how?


@joebob2 - most PDF's are created from something else - I've only known one person who wrote Postscript on a clean slate. They typically start life as WP or DTP files from programs such as Word, InDesign, Writer etc. If you can get hold of such a file that might be a better place to start.

BR

Last edited by BetterRed; 08-20-2019 at 01:00 AM.
BetterRed is offline   Reply With Quote
Old 08-19-2019, 11:37 PM   #2
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,311
Karma: 145435140
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by BetterRed View Post


In the OPs last post he wrote "I converted the book from PDF with Sigil." Is that possible - if so how?
I don't think it is possible to convert PDF to epub using Sigil. I did run into one author who attempted to copy/paste pages from one of her old books into BookView as that was the only electronic format for that book she was able to obtain when the rights reverted. The output of that was a right mess.
DNSB is online now   Reply With Quote
Advert
Old 08-20-2019, 12:59 AM   #3
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,553
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by DNSB View Post
I don't think it is possible to convert PDF to epub using Sigil. I did run into one author who attempted to copy/paste pages from one of her old books into BookView as that was the only electronic format for that book she was able to obtain when the rights reverted. The output of that was a right mess.
Ah yes, that one came up in a recent discussion. I don't think of page-by-page coffee/pasta as a conversion technique, I think of it as a 'there must be a better way than this" technique

PDF conversion must be amongst the top 5 topics at MR.

BR
BetterRed is offline   Reply With Quote
Old 08-20-2019, 11:36 AM   #4
joebob2a
Member
joebob2a began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jun 2019
Device: epub
Conversion maze

Quote:
Originally Posted by DNSB View Post
I don't think it is possible to convert PDF to epub using Sigil. I did run into one author who attempted to copy/paste pages from one of her old books into BookView as that was the only electronic format for that book she was able to obtain when the rights reverted. The output of that was a right mess.
This book has come to be through a pretty roundabout process, as you might suspect. It was originally written in M$ Word/OpenOffice, sucked into Quark Express, and then output as PDF in print form. Many corrections had happened between the original word processing files and the Quark files. I used a web utility (https://www.online-convert.com/) to get from PDF back to Word, but then I had the page header and footers to worry about, not to mention typesetting issues like no space after periods and embedded hyphens. I seriously don't want to go back to that! I have the original Quark source, but I haven't found a conversion tool to get it out of that format.

On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it. I'm tinkering enough right now, I may go that direction.
joebob2a is offline   Reply With Quote
Old 08-20-2019, 12:35 PM   #5
lumpynose
Wizard
lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.lumpynose ought to be getting tired of karma fortunes by now.
 
Posts: 1,086
Karma: 6719822
Join Date: Jul 2012
Device: Palm Pilot M105
Quote:
Originally Posted by joebob2a View Post
On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it. I'm tinkering enough right now, I may go that direction.
That's what I've done when I've "transcribed" a short story from an old magazine when the PDF scans are on archive.org. Exceedingly tedious. In that case it's probably has more errors since the magazine has faded and the paper's brown and the typesetting can be dodgy.
lumpynose is offline   Reply With Quote
Advert
Old 08-20-2019, 12:59 PM   #6
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,208
Karma: 16534692
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
Quote:
Originally Posted by BetterRed View Post
I don't think of page-by-page coffee/pasta as a conversion technique
This made me smile (it's been a slow day). Are you using predictive text by any chance?
jackie_w is offline   Reply With Quote
Old 08-20-2019, 07:40 PM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by joebob2a View Post
This book has come to be through a pretty roundabout process, as you might suspect. It was originally written in M$ Word/OpenOffice, sucked into Quark Express, and then output as PDF in print form. Many corrections had happened between the original word processing files and the Quark files.
So the Quark file is the up-to-date version?

Quote:
Originally Posted by joebob2a View Post
I used a web utility [...] to get from PDF back to Word, but then I had the page header and footers to worry about, not to mention typesetting issues like no space after periods and embedded hyphens.
A more robust OCR program (like Finereader) would avoid most of those issues.

Quote:
Originally Posted by joebob2a View Post
I have the original Quark source, but I haven't found a conversion tool to get it out of that format.
What's the file extension on the Quark file? QXD?

Do you happen to know which version of Quark it used?

(And ~ when this book was published?)

I only worked on one QXD file many years ago, and surprisingly, LibreOffice was able to open it. It still required a lot of elbow grease, but it was a huge step up from having to OCR from scratch.

Quote:
Originally Posted by joebob2a View Post
On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it.
... no. Just no.

You lose all important formatting information (bold/italics/superscript), and underneath-the-surface is just as important as the text itself.

And depending on how the PDF was put together, that copy/paste itself might introduce a massive amount of issues as well (like the hard hyphens issue you mentioned).

You'll spend more time cleaning up all those errors than if you just worked from much cleaner OCR in the first place.

Last edited by Tex2002ans; 08-20-2019 at 07:52 PM.
Tex2002ans is offline   Reply With Quote
Old 08-20-2019, 07:45 PM   #8
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,553
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by joebob2a View Post
This book has come to be through a pretty roundabout process, as you might suspect. It was originally written in M$ Word/OpenOffice, sucked into Quark Express, and then output as PDF in print form. Many corrections had happened between the original word processing files and the Quark files. I used a web utility (https://www.online-convert.com/) to get from PDF back to Word, but then I had the page header and footers to worry about, not to mention typesetting issues like no space after periods and embedded hyphens. I seriously don't want to go back to that! I have the original Quark source, but I haven't found a conversion tool to get it out of that format.

On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it. I'm tinkering enough right now, I may go that direction.
I had QuarkXpress in my "Word, InDD, Writer list" - but I took it out on the basis of 'surely not'.

You can open PDF files directly in MS Word 2016/19, the result can be surprisingly good - but I suspect that's because the documents I'm thinking of were originally typed into Word by someone who didn't regard it as a Remington portable. An ex QuarkXpress PDF might not fare so well.

BR
BetterRed is offline   Reply With Quote
Old 08-20-2019, 07:50 PM   #9
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,553
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by jackie_w View Post
This made me smile (it's been a slow day). Are you using predictive text by any chance?
No, none of that AI crap - I turn it off. IIRC the coffee/pasta word play comes from my MIT/DEC days, along with bang, crunch, snail and hat

BR
BetterRed is offline   Reply With Quote
Old 08-22-2019, 12:16 PM   #10
joebob2a
Member
joebob2a began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jun 2019
Device: epub
Grinding through it

Quote:
Originally Posted by Tex2002ans View Post
So the Quark file is the up-to-date version?
No, I've already invested significant time in cleaning up the PDF. The epub version is pretty close to where I want it, but it has all these technical issues. that the validators don't like.

Quote:
What's the file extension on the Quark file? QXD?

Do you happen to know which version of Quark it used?
The source files are .qxd files. I know it was generated on a Mac. Unknown as to version, but it's more than ten years old.

Quote:
(And ~ when this book was published?)
It went to print in 2009, just as the e-book revolution was turning the corner. I'm working on an e-book version because there's a surge in demand, and I just want it out there.


Quote:
I only worked on one QXD file many years ago, and surprisingly, LibreOffice was able to open it. It still required a lot of elbow grease, but it was a huge step up from having to OCR from scratch.

... no. Just no.
Amen to the No. LibreOffice wanted to turn the PDF files into graphics -- each page an image. The QXP files looked like random bits in LibreOffice.

Quote:
And depending on how the PDF was put together, that copy/paste itself might introduce a massive amount of issues as well (like the hard hyphens issue you mentioned).

You'll spend more time cleaning up all those errors than if you just worked from much cleaner OCR in the first place.
At this point, I'm just looking for something to fix the validation errors. I'm tempted to edit the html files in a text editor with group replace to correct the flagged errors, but I need to know the correct replacement for each of those errors. I had an earlier post talking about how Sigil was having trouble consolidating HTML files. Calibre was able to merge the files without breaking things, so that's now a viable option. I now have one html file for each of the eight major chapters, as opposed to dozens.

What's surprising to me is that there are all these great conversion utilities, yet nothing that addresses the validator errors.

Thanks again for all the help. I'll keep plugging on this.
joebob2a is offline   Reply With Quote
Old 08-22-2019, 02:31 PM   #11
joebob2a
Member
joebob2a began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jun 2019
Device: epub
Wading through the errors

I'm still awaiting moderation for another post, but in the meantime I've tried tackling the errors as a group. Here's what I've found, plus a question:

I was getting errors because the html files were named html rather than xhtml. Consolidating the files (thanks, Calibre) made that a simple matter. So those are gone.

Next, I have many instances of this error:
Error while parsing file: element "h3" not allowed here; expected the element end-tag, text or element "a", "abbr", "area", "audio", "b", "bdi", "bdo", "br", "button", "canvas", "cite", "code", "command", "datalist", "del", "dfn", "em", "embed", "epub:switch", "i", "iframe", "img", "input", "ins", "kbd", "keygen", "label", "link", "map", "mark", "meta", "meter", "ns1:math", "ns2:svg", "object", "output", "progress", "q", "ruby", "s", "samp", "script", "select", "small", "span", "strong", "sub", "sup", "textarea", "time", "u", "var", "video" or "wbr" (with xmlns:ns1="http://www.w3.org/1998/Math/MathML" xmlns:ns2="http://www.w3.org/2000/svg")
In the code, I saw that Italic tags were outside the H3 tags like this:

Code:
<i> 
<h3 id="sigil_toc_id_106">August 17, 1984 </h3>
</i>
Moving the Italic tags inside the H3 made the error disappear. I believe Sigil generated the code. That would see to be a pretty easy fix in the Sigil code, but it's going to be a chore doing it manually. But at least I know what it is.
joebob2a is offline   Reply With Quote
Old 08-22-2019, 06:44 PM   #12
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,897
Karma: 128597114
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by joebob2a View Post
This book has come to be through a pretty roundabout process, as you might suspect. It was originally written in M$ Word/OpenOffice, sucked into Quark Express, and then output as PDF in print form. Many corrections had happened between the original word processing files and the Quark files. I used a web utility (https://www.online-convert.com/) to get from PDF back to Word, but then I had the page header and footers to worry about, not to mention typesetting issues like no space after periods and embedded hyphens. I seriously don't want to go back to that! I have the original Quark source, but I haven't found a conversion tool to get it out of that format.

On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it. I'm tinkering enough right now, I may go that direction.
What you need to do is go back to whoever exported that PDF from QuarkExpress and get that exported as ePub (Qurk does export to ePub). Then you use Sigil or Calibre to fix it up from there. That's the best solution possible.
JSWolf is offline   Reply With Quote
Old 08-22-2019, 07:13 PM   #13
joebob2a
Member
joebob2a began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jun 2019
Device: epub
Technical cleanup?

Quote:
Originally Posted by JSWolf View Post
What you need to do is go back to whoever exported that PDF from QuarkExpress and get that exported as ePub (Qurk does export to ePub). Then you use Sigil or Calibre to fix it up from there. That's the best solution possible.
Unfortunately, that's not really an option at this point. I also have little faith that the niggling technical errors will be corrected. It appears that Sigil and Calibre both focus on the visual presentation, which is great as far as it goes, but the technical issues that will keep a book out of Smashwords and Google Play relate to structure. If there is a tool -- commercial is an option -- to clean those up, I'm interested. It appears that the EPUB validator stops reading when it hits some errors, so the list expands and contracts based on what kind of errors show up.

Ideas welcome!
joebob2a is offline   Reply With Quote
Old 08-22-2019, 09:05 PM   #14
joebob2a
Member
joebob2a began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jun 2019
Device: epub
Just following up. It appears that I've resolved many of those errors just looking at the HTML format. However, now I'm getting errors in files related to the TOC. This is the only error type I have:

Code:
Type 	File 	 	 	Line 	Position 	Message
ERROR 	OEBPS/toc.ncx 	 	23 	63 	  	Fragment identifier is not defined.
The toc.ncx is an xml file. This is what that line looks like:
Code:
        </navLabel>
        <content src="Text/Pt0_Intro.xhtml#sigil_toc_id_282"/>
      </navPoint>
There is a similar line for every TOC entry. I've done everything I can think of, including blowing away the TOC and regenerating it. These are the only errors that the EPUB Validator throws. So I guess I'm closer.

Good night!
joebob2a is offline   Reply With Quote
Old 08-22-2019, 11:21 PM   #15
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by joebob2a View Post
If there is a tool -- commercial is an option -- to clean those up, I'm interested.
Contact somebody who has Quark and can deal with QXD files directly... or contact somebody who can run that export the PDF more cleanly than your current methods...

There are professionals around here...

Did you try opening the QXD with LibreOffice? Does it open it? Or is the QXD created in a newer Quark?
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
epub 2 PDF conversion with OCR in PDF possible? hobi2000 Conversion 2 03-25-2019 03:20 AM
conversion from pdf to epub help slushbilly Workshop 1 01-31-2011 08:07 AM
pdf -> epub conversion cristobalmx Calibre 1 12-12-2010 04:06 AM
PDF to EPUB Conversion LuchoResto General Discussions 1 11-19-2010 04:54 PM
PDF to EPUB conversion jfontana Calibre 2 03-17-2010 03:09 AM


All times are GMT -4. The time now is 05:25 PM.


MobileRead.com is a privately owned, operated and funded community.