Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 05-16-2020, 06:38 AM   #1
Shohreh
Connoisseur
Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'
 
Posts: 50
Karma: 42650
Join Date: Jan 2016
Device: none
Question Recommended clean-up before HTML → EPUB?

Hello,

I'd like to concat a bunch of web pages into a single EPUB to read on my e-reader.

I tried pandoc, but it's very slow and pretty much freezes my computer, so I tried Calibre which at least kept my computer responsive:

Code:
copy /b *.html full.html

pandoc -o full.epub  full.html

"C:\Program Files\Calibre2\ebook-convert.exe" full.html full.epub
Regardless, how do you clean up HTML files before joining them into a single file? Any good practices?

"-h" returns a bewildering number of otptions.

Alternatively, what about first converting HTML files into simpler layouts (Markdown?) before joining them into a single file, and calling an HTML to EPUB converter?

Thank you.

Last edited by Shohreh; 05-16-2020 at 06:48 AM.
Shohreh is offline   Reply With Quote
Old 05-16-2020, 08:08 AM   #2
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 2,013
Karma: 12681704
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2 & Air/Surface Pro/Kindle PW
I think the answer would depend on how comfortable you are working with the raw HTML code...

If you are OK with it, then Sigil (and I'm pretty sure Calibre) has a 'merge' feature that will remove the separate headers/footers and leave them combined into a single file. That process works well if the css is similar - or you make the css similar before merging.

If you are saying "what's raw html code" then I would suggest leaving the pages separate. You can still bundle them into an ePub - there is no requirement to have everything as a single page. It is actually more preferred to keep the files in ePubs separated logically, such as chapters. Both Sigil and Calibre editor can perform this function admirably. When you read the ePub, with the pages as separate files, it just requires a swipe/tap when transitioning from one file to the next.
Turtle91 is offline   Reply With Quote
Old 05-16-2020, 08:25 AM   #3
Shohreh
Connoisseur
Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'
 
Posts: 50
Karma: 42650
Join Date: Jan 2016
Device: none
Thanks. I'm used to working with HTML with Python.

I'm looking for a way to automate the process, and end up with pages that are as clean as possible on e-readers.

Can an EPUB contain multiple HTML pages?

--
Edit: Yup.

https://www.reddit.com/r/Calibre/com...a_single_epub/

https://manual.calibre-ebook.com/faq...specific-order

Last edited by Shohreh; 05-16-2020 at 08:35 AM.
Shohreh is offline   Reply With Quote
Old 05-18-2020, 05:33 AM   #4
najgori
Klak
najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.
 
najgori's Avatar
 
Posts: 140
Karma: 148812
Join Date: Sep 2011
Location: Belgrade, Serbia
Device: many
Before starting to work on epub I prefer to clean HTML to basic tags without any styling.
Step 1 is to open epub in Calibre editor, delete all css files, go to Remove unused css rules tool
Step 2 is Custom cleaner plus in Sigil for the rest of bookmarks, ids...
From Sigil you can export (x)html to editor of your choice or continue working in Sigil editor.
najgori is offline   Reply With Quote
Old 05-19-2020, 07:02 AM   #5
Shohreh
Connoisseur
Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'Shohreh understands when you whisper 'The dog barks at midnight.'
 
Posts: 50
Karma: 42650
Join Date: Jan 2016
Device: none
Thanks. I'll look into automating the process with a Python script.
Shohreh is offline   Reply With Quote
Old 05-19-2020, 07:32 AM   #6
Doitsu
Wizard
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 4,902
Karma: 16234093
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by Shohreh View Post
Thanks. I'll look into automating the process with a Python script.
Have a look at BeautifulSoup. It's a Python library for manipulating (X)HTML files.
Doitsu is offline   Reply With Quote
Old 05-19-2020, 10:10 AM   #7
BobC
Guru
BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.
 
Posts: 650
Karma: 2902178
Join Date: Dec 2008
Location: Lancashire, U.K.
Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +
Quote:
Originally Posted by najgori View Post
Before starting to work on epub I prefer to clean HTML to basic tags without any styling.
Step 1 is to open epub in Calibre editor, delete all css files, go to Remove unused css rules tool
Step 2 is Custom cleaner plus in Sigil for the rest of bookmarks, ids...
From Sigil you can export (x)html to editor of your choice or continue working in Sigil editor.
If you delete all css files then remove css rules you may well have got rid of important formatting. Some books, instead of
Code:
 <i> some italics></i>
will instead use something like
Code:
<span class="it"> some italics </span>
and use
Code:
.it {
 font-style: italic;
}
in styles.css or in css defined in the file header.

by using a brute-force technique as suggested all italics would be lost. There may also be "headers" that are just paragraphs styled to centred, bold and larger than the main text.

Look before you carry out drastic surgery.

BobC
BobC is offline   Reply With Quote
Old 05-19-2020, 04:47 PM   #8
najgori
Klak
najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.najgori is considered a powerful good luck charm in most of the civilised world.
 
najgori's Avatar
 
Posts: 140
Karma: 148812
Join Date: Sep 2011
Location: Belgrade, Serbia
Device: many
You are right about italics. I don't care about the headers, though.
najgori is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Creating epub/kepub books (docx→epub/kepub via MS Word→Calibre) SJC-Caron ePub 18 04-21-2016 11:10 AM
Clean HTML from word For EPub holdit ePub 10 10-21-2013 07:00 AM
Clean HTML from word holdit Workshop 6 10-09-2013 05:20 PM
How to Clean/Strip HTML from epub file? Jimbo724 General Discussions 9 12-12-2012 11:22 AM
Best way to get clean HTML JSWolf Kindle Formats 18 04-02-2009 11:00 AM


All times are GMT -4. The time now is 03:54 AM.


MobileRead.com is a privately owned, operated and funded community.