View Full Version : Make chapters for a document/HTML


iodine9176
02-19-2010, 03:24 AM
Does anyone have ideas on how to make chapters for a master document? That is, if i have a pdf or doc, how to split it to different txt files for different chapters?

:):)

charleski
02-19-2010, 06:36 AM
What are you using to create your ePubs?

Atlantis Word Processor automatically splits the output ePub files when it comes across a paragraph with a Heading style (and will insert that paragraph into the ToC). Sigil allows you to chose exactly where to split the file by inserting a chapter break mark (elements tagged with <h1>,<h2>, etc will be inserted into the ToC independently of the chapter breaks).

frabjous
02-19-2010, 06:28 PM
Does anyone have ideas on how to make chapters for a master document? That is, if i have a pdf or doc, how to split it to different txt files for different chapters?

:):)

OK, I'm rather confused. The title of the thread is about HTML documents, and it's in the ePub forum. But then in your post, you ask about splitting PDF and DOC files into TXT files? Why would you want to change PDF or DOC to TXT? What does this have to do with HTML or ePub? Please be more specific about what you're trying to do.

I usually use <H2>...</H2> tags for chapter headings in my HTML code. (Or perhaps <H2 class="chaptertitle">...</H2>, etc. Calibre allows you to set the XPath expression for chapter detection, but if memory serves, it default setting will pick up H2 tags. It'll do the splitting for you, at least with normal settings.

iodine9176
02-20-2010, 04:48 AM
OK, I'm rather confused. The title of the thread is about HTML documents, and it's in the ePub forum. But then in your post, you ask about splitting PDF and DOC files into TXT files? Why would you want to change PDF or DOC to TXT? What does this have to do with HTML or ePub? Please be more specific about what you're trying to do.

I usually use <H2>...</H2> tags for chapter headings in my HTML code. (Or perhaps <H2 class="chaptertitle">...</H2>, etc. Calibre allows you to set the XPath expression for chapter detection, but if memory serves, it default setting will pick up H2 tags. It'll do the splitting for you, at least with normal settings.

Thx for your reply.
Actually i am using Ecub as a compiler to generate epub file. As for ecub, it can only imports plain txt files or HTML file. Therefore i wonder will there be a way to make chapters for a pdf or doc , then i can use ecub to compile epub. But actually i am more interested in how to split a pdf or doc into different xhtml files according to chapters .
:):)

charleski
02-20-2010, 07:15 AM
Here's a very simple word macro to select chapters and paste them into a new document. Place the cursor at the start of the first chapter, run it, save the new doc, run again, etc.

It assumes that chapters are properly marked with one paragraph at the start which has the Heading 1 style. You may need to edit it to adapt it to the particular formatting your document uses (if you have problems with that, then you're probably better-off just using cut-and-paste).


Sub Macro6()
'
' Extract Chapter
'
'
Selection.MoveRight Unit:=wdSentence, Count:=2, Extend:=wdExtend
Selection.Extend
Selection.Find.ClearFormatting
Selection.Find.Style = ActiveDocument.Styles("Heading 1")

With Selection.Find
.Text = ""
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute
Selection.ExtendMode = False
Selection.MoveLeft Unit:=wdSentence, Count:=2, Extend:=wdExtend
Selection.Copy
Selection.MoveRight Unit:=wdSentence, Count:=1
Application.Documents.Add
Selection.Paste

End Sub

brewt
02-20-2010, 10:25 AM
Yay! Someone else brought up Word!

My turn:

This macro will detect the word "Chapter", change it to H1, and add a line above it:


Sub Chapters()
'
' Chapters Macro
' Macro recorded 7/30/2008 by brewt
'
Selection.Find.ClearFormatting
With Selection.Find
.Text = "^pChapter "
.Replacement.Text = "^p"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute
Selection.MoveRight Unit:=wdCharacter, Count:=1
Selection.HomeKey Unit:=wdLine
Selection.MoveLeft Unit:=wdCharacter, Count:=1
Selection.MoveRight Unit:=wdCharacter, Count:=1
Selection.TypeParagraph
Selection.MoveUp Unit:=wdLine, Count:=1
Selection.ClearFormatting
Selection.InlineShapes.AddHorizontalLineStandard
Selection.MoveDown Unit:=wdLine, Count:=1
Selection.Style = ActiveDocument.Styles("Heading 1")
With ActiveDocument.Styles("Heading 1").ParagraphFormat
.LeftIndent = InchesToPoints(0)
.RightIndent = InchesToPoints(0)
.SpaceBefore = 12
.SpaceBeforeAuto = False
.SpaceAfter = 3
.SpaceAfterAuto = False
.LineSpacingRule = wdLineSpaceSingle
.Alignment = wdAlignParagraphJustify
.WidowControl = True
.KeepWithNext = True
.KeepTogether = False
.PageBreakBefore = True
.NoLineNumber = False
.Hyphenation = False
.FirstLineIndent = InchesToPoints(0)
.OutlineLevel = wdOutlineLevel1
.CharacterUnitLeftIndent = 0
.CharacterUnitRightIndent = 0
.CharacterUnitFirstLineIndent = 0
.LineUnitBefore = 0
.LineUnitAfter = 0
End With
ActiveDocument.Styles("Heading 1").NoSpaceBetweenParagraphsOfSameStyle = _
False
With ActiveDocument.Styles("Heading 1")
'.AutomaticallyUpdate = True
.BaseStyle = "Normal"
.NextParagraphStyle = "Normal"
End With
End Sub


Once you've gotten through the content with that, use the Word "Table of Contents" function to put the TOC where you want it.

Calibre, eCub, and Sigil will recognize and use the word-generated toc.

This proves quick & easy if you assign the macro to a alternate keystroke.

-bjc

iodine9176
02-20-2010, 12:57 PM
Here's a very simple word macro to select chapters and paste them into a new document. Place the cursor at the start of the first chapter, run it, save the new doc, run again, etc.

It assumes that chapters are properly marked with one paragraph at the start which has the Heading 1 style. You may need to edit it to adapt it to the particular formatting your document uses (if you have problems with that, then you're probably better-off just using cut-and-paste).


Sub Macro6()
'
' Extract Chapter
'
'
Selection.MoveRight Unit:=wdSentence, Count:=2, Extend:=wdExtend
Selection.Extend
Selection.Find.ClearFormatting
Selection.Find.Style = ActiveDocument.Styles("Heading 1")

With Selection.Find
.Text = ""
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute
Selection.ExtendMode = False
Selection.MoveLeft Unit:=wdSentence, Count:=2, Extend:=wdExtend
Selection.Copy
Selection.MoveRight Unit:=wdSentence, Count:=1
Application.Documents.Add
Selection.Paste

End Sub



Thx for your macro!!!!
It is really cool, i have also written a similar macro but yours is better than mine. Really thx for your macro.

But i encounter a new problem, actually i am dealing with a document with image in it. Therefore i can't seperate the chapters and save it to txt but need to save as xhtml.
But word seems can't save as xhtml files. I ve tried to save as html but can't import those htm files in ecub.

Are there any ways to save doc as xhtml ?

frabjous
02-20-2010, 01:47 PM
Are there any ways to save doc as xhtml ?

You can save it as (filtered) HTML, which should be good enough for ePub. If for some reason you need it as XHTML, just open the HTML and change the doctype at the top:

e.g., from

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

to
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

and rename the file to end in .xhtml.

There's nothing in HTML that can't be used in XHTML.

charleski
02-20-2010, 02:08 PM
Saving them as filtered html should be fine, but you'll need to make a few changes afterwards. Open the files in Notepad++ and check the head at the top of the file. If you have
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
there, then Word saved it in ANSI coding. Go to the Format menu and select 'Convert to UTF-8' and change 'windows-1252' to 'utf-8' . You'll also probably want to delete all the unneeded @font-face definitions Word will have inserted at the top.

iodine9176
02-21-2010, 12:16 AM
Saving them as filtered html should be fine, but you'll need to make a few changes afterwards. Open the files in Notepad++ and check the head at the top of the file. If you have
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
there, then Word saved it in ANSI coding. Go to the Format menu and select 'Convert to UTF-8' and change 'windows-1252' to 'utf-8' . You'll also probably want to delete all the unneeded @font-face definitions Word will have inserted at the top.


That's exactly the problem i faced. I save the files as filtered html but can't be imported in ecub. Thx for the solution. But are there any ways to tidy up the html files 'cause they are too messy, too much unwant information

charleski
02-21-2010, 08:33 AM
Actually, I realised that Word's html export has a few other flaws. You'll probably want to run the html through HTML Tidy (http://www.ibm.com/developerworks/library/x-tiptidy.html) or something similar to fix all the flaws (mostly, Word fails to put quotes around attribute values). Notepadd++'s TextFX plugin can do the HTML Tidy job for you.

Word does add a lot of needless fluff, like spans to define the language, those are a pain to remove in Notepad++ as its regex engine doesn't handle newlines or non-greedy matches. Sigil, OTOH, has a regex engine that will remove them easily - set regular expression and minimal matching, Find string
<span xml:lang="EN-US" lang="EN-US">(.*)</span>
Replace string
\1
Sigil also automatically does the HTML Tidy xhtml conversion for you.

frabjous
02-21-2010, 08:37 AM
Also consider installing AbiWord (http://www.abisource.org) or OpenOffice (http://openoffice.org) and use one of those to convert the .doc/.docx to .html. You can also convert through Google Docs. (http://docs.google.com) All are free, and do a better job. Switching away from reliance on Microsoft is always good for society. That company has too much power for anyone's good.

iodine9176
02-23-2010, 02:24 PM
Actually, I realised that Word's html export has a few other flaws. You'll probably want to run the html through HTML Tidy (http://www.ibm.com/developerworks/library/x-tiptidy.html) or something similar to fix all the flaws (mostly, Word fails to put quotes around attribute values). Notepadd++'s TextFX plugin can do the HTML Tidy job for you.

Word does add a lot of needless fluff, like spans to define the language, those are a pain to remove in Notepad++ as its regex engine doesn't handle newlines or non-greedy matches. Sigil, OTOH, has a regex engine that will remove them easily - set regular expression and minimal matching, Find string
<span xml:lang="EN-US" lang="EN-US">(.*)</span>
Replace string
\1
Sigil also automatically does the HTML Tidy xhtml conversion for you.

I agree that Sigil is good to tidy up and convert the html files. However, the problem is that i have many chapters in my doc. It is hard to convert the html files and tidy up them one by one. Are there any programmatic methods which can do all the conversion and tidying work all toether?