MobileRead Forums - View Single Post

kacir · 02-01-2009, 02:52 PM

For such books I use Vim script
(www.vim.org - a very powerful text editor)

You can write a command in Vim saying
"find every line NOT ending with dot, question mark, exclamation point or closing quote, optionally followed by a space character and join it with the next line"
:vglobal/[.!?"']\s*$/join
I often abbreviate the above command this way:
:v/[.!?"']\s*$/j
That is it.

You can also say:
"find every line ending with .!?" and enter an empty line after it"
"find every line shorter than (let's say) 50 characters and enter an empty line after it"
"find two empty lines and replace it with one empty line"
"Join paragraphs"
"delete empty lines"
That should take care about formatting 99 percent of excessive newline characters.

You have to tweak the above steps for a particular book, because every single misformated book is unique.

You can also try to have a look at the html file and try to distinguish between wanted and unwanted line breaks. Most often, unfortunately, the html file is generated by MSWord. MSWord is THE most horrible tool for producing html format.
You can also try to process html file with a program html_tidy http://www.w3.org/People/Raggett/tidy/

02-01-2009, 02:52 PM	#4
kacir Wizard Posts: 3,450 Karma: 10484861 Join Date: May 2006 Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20	For such books I use Vim script (www.vim.org - a very powerful text editor) You can write a command in Vim saying "find every line NOT ending with dot, question mark, exclamation point or closing quote, optionally followed by a space character and join it with the next line" :vglobal/[.!?"']\s$/join I often abbreviate the above command this way: :v/[.!?"']\s$/j That is it. You can also say: "find every line ending with .!?" and enter an empty line after it" "find every line shorter than (let's say) 50 characters and enter an empty line after it" "find two empty lines and replace it with one empty line" "Join paragraphs" "delete empty lines" That should take care about formatting 99 percent of excessive newline characters. You have to tweak the above steps for a particular book, because every single misformated book is unique. You can also try to have a look at the html file and try to distinguish between wanted and unwanted line breaks. Most often, unfortunately, the html file is generated by MSWord. MSWord is THE most horrible tool for producing html format. You can also try to process html file with a program html_tidy http://www.w3.org/People/Raggett/tidy/