Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 02-19-2010, 04:24 AM   #1
iodine9176
Junior Member
iodine9176 began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2010
Device: stanza
Make chapters for a document/HTML

Does anyone have ideas on how to make chapters for a master document? That is, if i have a pdf or doc, how to split it to different txt files for different chapters?

iodine9176 is offline   Reply With Quote
Old 02-19-2010, 07:36 AM   #2
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,188
Karma: 727236
Join Date: Sep 2009
Device: PRS-505
What are you using to create your ePubs?

Atlantis Word Processor automatically splits the output ePub files when it comes across a paragraph with a Heading style (and will insert that paragraph into the ToC). Sigil allows you to chose exactly where to split the file by inserting a chapter break mark (elements tagged with <h1>,<h2>, etc will be inserted into the ToC independently of the chapter breaks).
charleski is offline   Reply With Quote
Old 02-19-2010, 07:28 PM   #3
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Quote:
Originally Posted by iodine9176 View Post
Does anyone have ideas on how to make chapters for a master document? That is, if i have a pdf or doc, how to split it to different txt files for different chapters?

OK, I'm rather confused. The title of the thread is about HTML documents, and it's in the ePub forum. But then in your post, you ask about splitting PDF and DOC files into TXT files? Why would you want to change PDF or DOC to TXT? What does this have to do with HTML or ePub? Please be more specific about what you're trying to do.

I usually use <H2>...</H2> tags for chapter headings in my HTML code. (Or perhaps <H2 class="chaptertitle">...</H2>, etc. Calibre allows you to set the XPath expression for chapter detection, but if memory serves, it default setting will pick up H2 tags. It'll do the splitting for you, at least with normal settings.
frabjous is offline   Reply With Quote
Old 02-20-2010, 05:48 AM   #4
iodine9176
Junior Member
iodine9176 began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2010
Device: stanza
Quote:
Originally Posted by frabjous View Post
OK, I'm rather confused. The title of the thread is about HTML documents, and it's in the ePub forum. But then in your post, you ask about splitting PDF and DOC files into TXT files? Why would you want to change PDF or DOC to TXT? What does this have to do with HTML or ePub? Please be more specific about what you're trying to do.

I usually use <H2>...</H2> tags for chapter headings in my HTML code. (Or perhaps <H2 class="chaptertitle">...</H2>, etc. Calibre allows you to set the XPath expression for chapter detection, but if memory serves, it default setting will pick up H2 tags. It'll do the splitting for you, at least with normal settings.
Thx for your reply.
Actually i am using Ecub as a compiler to generate epub file. As for ecub, it can only imports plain txt files or HTML file. Therefore i wonder will there be a way to make chapters for a pdf or doc , then i can use ecub to compile epub. But actually i am more interested in how to split a pdf or doc into different xhtml files according to chapters .
iodine9176 is offline   Reply With Quote
Old 02-20-2010, 08:15 AM   #5
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,188
Karma: 727236
Join Date: Sep 2009
Device: PRS-505
Here's a very simple word macro to select chapters and paste them into a new document. Place the cursor at the start of the first chapter, run it, save the new doc, run again, etc.

It assumes that chapters are properly marked with one paragraph at the start which has the Heading 1 style. You may need to edit it to adapt it to the particular formatting your document uses (if you have problems with that, then you're probably better-off just using cut-and-paste).

Code:
Sub Macro6()
'
' Extract Chapter
'
'
    Selection.MoveRight Unit:=wdSentence, Count:=2, Extend:=wdExtend
    Selection.Extend
    Selection.Find.ClearFormatting
    Selection.Find.Style = ActiveDocument.Styles("Heading 1")

    With Selection.Find
        .Text = ""
        .Replacement.Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute
    Selection.ExtendMode = False
    Selection.MoveLeft Unit:=wdSentence, Count:=2, Extend:=wdExtend
    Selection.Copy
    Selection.MoveRight Unit:=wdSentence, Count:=1
    Application.Documents.Add
    Selection.Paste
    
End Sub
charleski is offline   Reply With Quote
Old 02-20-2010, 11:25 AM   #6
brewt
Boo-Frickety-Hoo-Erizer
brewt will become famous soon enoughbrewt will become famous soon enoughbrewt will become famous soon enoughbrewt will become famous soon enoughbrewt will become famous soon enoughbrewt will become famous soon enough
 
brewt's Avatar
 
Posts: 254
Karma: 686
Join Date: Oct 2007
Device: SONY PRS 350!
Yay! Someone else brought up Word!

My turn:

This macro will detect the word "Chapter", change it to H1, and add a line above it:

Quote:
Sub Chapters()
'
' Chapters Macro
' Macro recorded 7/30/2008 by brewt
'
Selection.Find.ClearFormatting
With Selection.Find
.Text = "^pChapter "
.Replacement.Text = "^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute
Selection.MoveRight Unit:=wdCharacter, Count:=1
Selection.HomeKey Unit:=wdLine
Selection.MoveLeft Unit:=wdCharacter, Count:=1
Selection.MoveRight Unit:=wdCharacter, Count:=1
Selection.TypeParagraph
Selection.MoveUp Unit:=wdLine, Count:=1
Selection.ClearFormatting
Selection.InlineShapes.AddHorizontalLineStandard
Selection.MoveDown Unit:=wdLine, Count:=1
Selection.Style = ActiveDocument.Styles("Heading 1")
With ActiveDocument.Styles("Heading 1").ParagraphFormat
.LeftIndent = InchesToPoints(0)
.RightIndent = InchesToPoints(0)
.SpaceBefore = 12
.SpaceBeforeAuto = False
.SpaceAfter = 3
.SpaceAfterAuto = False
.LineSpacingRule = wdLineSpaceSingle
.Alignment = wdAlignParagraphJustify
.WidowControl = True
.KeepWithNext = True
.KeepTogether = False
.PageBreakBefore = True
.NoLineNumber = False
.Hyphenation = False
.FirstLineIndent = InchesToPoints(0)
.OutlineLevel = wdOutlineLevel1
.CharacterUnitLeftIndent = 0
.CharacterUnitRightIndent = 0
.CharacterUnitFirstLineIndent = 0
.LineUnitBefore = 0
.LineUnitAfter = 0
End With
ActiveDocument.Styles("Heading 1").NoSpaceBetweenParagraphsOfSameStyle = _
False
With ActiveDocument.Styles("Heading 1")
'.AutomaticallyUpdate = True
.BaseStyle = "Normal"
.NextParagraphStyle = "Normal"
End With
End Sub
Once you've gotten through the content with that, use the Word "Table of Contents" function to put the TOC where you want it.

Calibre, eCub, and Sigil will recognize and use the word-generated toc.

This proves quick & easy if you assign the macro to a alternate keystroke.

-bjc
brewt is offline   Reply With Quote
Old 02-20-2010, 01:57 PM   #7
iodine9176
Junior Member
iodine9176 began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2010
Device: stanza
Quote:
Originally Posted by charleski View Post
Here's a very simple word macro to select chapters and paste them into a new document. Place the cursor at the start of the first chapter, run it, save the new doc, run again, etc.

It assumes that chapters are properly marked with one paragraph at the start which has the Heading 1 style. You may need to edit it to adapt it to the particular formatting your document uses (if you have problems with that, then you're probably better-off just using cut-and-paste).

Code:
Sub Macro6()
'
' Extract Chapter
'
'
    Selection.MoveRight Unit:=wdSentence, Count:=2, Extend:=wdExtend
    Selection.Extend
    Selection.Find.ClearFormatting
    Selection.Find.Style = ActiveDocument.Styles("Heading 1")

    With Selection.Find
        .Text = ""
        .Replacement.Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute
    Selection.ExtendMode = False
    Selection.MoveLeft Unit:=wdSentence, Count:=2, Extend:=wdExtend
    Selection.Copy
    Selection.MoveRight Unit:=wdSentence, Count:=1
    Application.Documents.Add
    Selection.Paste
    
End Sub

Thx for your macro!!!!
It is really cool, i have also written a similar macro but yours is better than mine. Really thx for your macro.

But i encounter a new problem, actually i am dealing with a document with image in it. Therefore i can't seperate the chapters and save it to txt but need to save as xhtml.
But word seems can't save as xhtml files. I ve tried to save as html but can't import those htm files in ecub.

Are there any ways to save doc as xhtml ?
iodine9176 is offline   Reply With Quote
Old 02-20-2010, 02:47 PM   #8
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Quote:
Originally Posted by iodine9176 View Post

Are there any ways to save doc as xhtml ?
You can save it as (filtered) HTML, which should be good enough for ePub. If for some reason you need it as XHTML, just open the HTML and change the doctype at the top:

e.g., from

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

to
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

and rename the file to end in .xhtml.

There's nothing in HTML that can't be used in XHTML.
frabjous is offline   Reply With Quote
Old 02-20-2010, 03:08 PM   #9
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,188
Karma: 727236
Join Date: Sep 2009
Device: PRS-505
Saving them as filtered html should be fine, but you'll need to make a few changes afterwards. Open the files in Notepad++ and check the head at the top of the file. If you have
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
there, then Word saved it in ANSI coding. Go to the Format menu and select 'Convert to UTF-8' and change 'windows-1252' to 'utf-8' . You'll also probably want to delete all the unneeded @font-face definitions Word will have inserted at the top.
charleski is offline   Reply With Quote
Old 02-21-2010, 01:16 AM   #10
iodine9176
Junior Member
iodine9176 began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2010
Device: stanza
Quote:
Originally Posted by charleski View Post
Saving them as filtered html should be fine, but you'll need to make a few changes afterwards. Open the files in Notepad++ and check the head at the top of the file. If you have
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
there, then Word saved it in ANSI coding. Go to the Format menu and select 'Convert to UTF-8' and change 'windows-1252' to 'utf-8' . You'll also probably want to delete all the unneeded @font-face definitions Word will have inserted at the top.

That's exactly the problem i faced. I save the files as filtered html but can't be imported in ecub. Thx for the solution. But are there any ways to tidy up the html files 'cause they are too messy, too much unwant information
iodine9176 is offline   Reply With Quote
Old 02-21-2010, 09:33 AM   #11
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,188
Karma: 727236
Join Date: Sep 2009
Device: PRS-505
Actually, I realised that Word's html export has a few other flaws. You'll probably want to run the html through HTML Tidy or something similar to fix all the flaws (mostly, Word fails to put quotes around attribute values). Notepadd++'s TextFX plugin can do the HTML Tidy job for you.

Word does add a lot of needless fluff, like spans to define the language, those are a pain to remove in Notepad++ as its regex engine doesn't handle newlines or non-greedy matches. Sigil, OTOH, has a regex engine that will remove them easily - set regular expression and minimal matching, Find string
<span xml:lang="EN-US" lang="EN-US">(.*)</span>
Replace string
\1
Sigil also automatically does the HTML Tidy xhtml conversion for you.
charleski is offline   Reply With Quote
Old 02-21-2010, 09:37 AM   #12
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Also consider installing AbiWord or OpenOffice and use one of those to convert the .doc/.docx to .html. You can also convert through Google Docs. All are free, and do a better job. Switching away from reliance on Microsoft is always good for society. That company has too much power for anyone's good.
frabjous is offline   Reply With Quote
Old 02-23-2010, 03:24 PM   #13
iodine9176
Junior Member
iodine9176 began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2010
Device: stanza
Quote:
Originally Posted by charleski View Post
Actually, I realised that Word's html export has a few other flaws. You'll probably want to run the html through HTML Tidy or something similar to fix all the flaws (mostly, Word fails to put quotes around attribute values). Notepadd++'s TextFX plugin can do the HTML Tidy job for you.

Word does add a lot of needless fluff, like spans to define the language, those are a pain to remove in Notepad++ as its regex engine doesn't handle newlines or non-greedy matches. Sigil, OTOH, has a regex engine that will remove them easily - set regular expression and minimal matching, Find string
<span xml:lang="EN-US" lang="EN-US">(.*)</span>
Replace string
\1
Sigil also automatically does the HTML Tidy xhtml conversion for you.
I agree that Sigil is good to tidy up and convert the html files. However, the problem is that i have many chapters in my doc. It is hard to convert the html files and tidy up them one by one. Are there any programmatic methods which can do all the conversion and tidying work all toether?
iodine9176 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Order of Chapters in HTML->ePub alias_neo Calibre 9 05-16-2011 12:55 PM
Hacks Is there a way to add multible chapters to make one book 18Coaster Amazon Kindle 4 09-10-2010 08:40 PM
how to make chapters? rysiu Calibre 9 05-31-2010 11:28 AM
chapters (HTML-files) not showing up erik5000 ePub 1 12-21-2009 05:22 PM
Multi-html files as chapters... WigglePig Sony Reader 5 09-16-2008 05:06 AM


All times are GMT -4. The time now is 04:34 PM.


MobileRead.com is a privately owned, operated and funded community.