Beginner: Converting lit to epub-linefeed h*ll

PhyrePhox · 12-28-2009, 03:17 PM

I'm a bit overwhelmed at the discussions here, and I've not been able to find a "how-to" guide that starts from a fundamental point, so I'll try a thread here and hope I'm not too far off topic.
I got a Sony PRS-300 for Christmas. I have a series of books that are in .lit format. I added these files to Calibre and let it do whatever to them to let them work on my Reader. I find that there is a full space after each line, hard hyphenations, and occasional blank pages. The TOC is empty, and the metadata appears weird, i.e. it's sorting by the author's first name. I tried opening the file in Sigil, and it apparently doesn't like it when you "select all" on my iMac 2.16GHz with Snow Leopard. I installed eCub, noticed that every line has it's own entry in the XHTML files, so I tried to remove all the

Code:

<p class="MsoPlainText">

lines for an entire paragraph, and compiled the epub. Then, calibre couldn't read it.
What I'm trying to do is see what the book would look like without the linefeeds, except perhaps at the paragraph. I don't care too much about justification at this time. I would like to build the TOC, or find it in the original lit and migrate it, but I can't seem to open the lit file with anything that runs on Mac. I know that Sony only sees the metadata if it's added in a certain way, do I need to jiggle the settings in Calibre's converter? Or, do I need to go Windows to fix this in the lit file beforehand?

Slash5 · 12-29-2009, 09:02 AM

I have a Windows VB app I wrote to remove extra breaks in Epubs. It creates paragraphs at a user set interval - not as good as the proper paragraphs but at least it is readable.
If you are interested I can send it to you.

jackie_w · 12-29-2009, 10:58 AM

Hi PhyrePhox,

I may be wrong but it looks as if your source LIT file is poor. It looks as if it's been created from a plain text file in MSWord which has not had its "hard line breaks" removed before conversion to LIT. My advice would be to try to clean up the source rather than editing the EPUB.

This is the approach I would take.

Extract the HTML from the LIT by switching on the Debug option during the Calibre conversion of LIT to EPUB:-
[Convert] - [Debug] and specify a directory to receive the Debug files.
Once the conversion is finished ignore the EPUB it created but go to the Debug directory you specified and look in the Input subdir. The extracted source HTML will be in there.
You can then use your editor-of-choice to tidy up the source before re-importing to Calibre.

If you need more help I would be happy to take a look at your LIT file and give more specific help based on what I see.

PhyrePhox · 12-31-2009, 09:54 AM

Quote:

Originally Posted by Slash5

I have a Windows VB app I wrote to remove extra breaks in Epubs.

Thank you for this. If I give up on editing on the Mac I will contact you for this script.

Quote:

Originally Posted by jackie_w

Hi PhyrePhox,

I may be wrong but it looks as if your source LIT file is poor...

Thank you jackie_w, I was able to get at the "raw" html for this lit using your suggestion. It appears that the original HTML was produced by Word, which has a reputation for producing gnarly code. A poor source indeed!

I now have a single html file, the "title" is set to the first chapter name, with very short lines (probably page width from a Word doc), with "</p>" and a hard return on each line. Not fun, but at least I have a place to work from now.

Is there a summary somewhere here of what html tags are meaningful for ebooks? Also, how can I feed the resulting html back into Calibre to convert to epub?

jackie_w · 12-31-2009, 05:24 PM

Hi PhyrePhox, Me again ...

Quote:

Originally Posted by PhyrePhox

It appears that the original HTML was produced by Word, which has a reputation for producing gnarly code. A poor source indeed!

In my opinion, MSWord only produces poor HTML if you let it. HTML output can be greatly improved by

Using MSWord styles correctly.
Removing any incorrect hard line breaks before saving.
Making sure the file is saved as type "WebPage-Filtered" to get simpler HTML without some of the MS "excess baggage".

Quote:

Originally Posted by PhyrePhox

Is there a summary somewhere here of what html tags are meaningful for ebooks? Also, how can I feed the resulting html back into Calibre to convert to epub?

I'm afraid I know nothing about editing on a Mac as I have PC/Windows setup, but if you used MSWord as your editor-of-choice these would be the steps I'd take. Perhaps some of it can be "translated" into Mac steps.

Open a new blank Word doc and import the Calibre-output HTML file you've already got.
Try to remove the hard line breaks using the editor's Find-and-Replace for mass changes. If you're lucky, the "real" end-of-paragraphs may have a blank line immediately following, or the "real" start-of-paragraphs may have some leading blank spaces. I could elaborate on this if it was relevant to your particular file.
Use one (or more) of the Word built-in Heading styles (e.g. Heading 2) to mark your chapter headings. Any paragraphs styled as "Heading 2" in Word are created with
<h2> ... </h2> tags in the HTML output.
Similarly, "Heading 1" creates <h1>...</h1> tags etc. Calibre can use these <h1>, <h2> etc tags during conversion to EPUB to specify the TOC.

Any paragraph styled as "Normal" in Word outputs as
<p class=MsoNormal>...</p> in the HTML output.

Any paragraph styled as "Normal (Web)" in Word outputs as
<p>...</p> in the HTML output.

Any paragraph styled as "Plain Text" in Word outputs as
<p class="MsoPlainText">...</p> in the HTML output -- which you've already come across. I'd restyle all of these as "Normal" or "Normal (Web)"

Any text marked as Italic or Bold in Word is output as
<i>...</i> or <b>...</b> in the HTML output.

I tend to use <h1> for Book Title and Author and <h2>, <h3> for Chapters, Sub-titles.
Save the doc as HTML (as detailed above)
If you're proficient with CSS files I'd then open up the HTML file in a text editor and remove everything between the <style>...</style> tags and put in a link to an external CSS file which would contain all the styling I wanted, e.g. lines like :-
body {font-size: 100%; font-family: serif; ... ...}
h1 {...}
h2 {...}
p {...}
.MsoNormal {text-indent: 1.5em; ...}

If you're not good with CSS then leave the HTML alone.
Once you're happy with the HTML then reimport to Calibre by drag-and-drop in the normal way or via the Edit-Metadata feature if you've already set up the book's metadata. Calibre will zip up the HTML file with any linked CSS file and/or images.
Convert away ... Don't forget to specify the appropriate h1 h2 h3 levels in the "Structure detection" option.

Anyway, that's enough from me for the time being. I don't know how much is relevant for your circumstances but feel free to ask if you think I could help.

Happy New Year

tyche · 01-01-2010, 10:50 AM

Jackie_w, that is some good info. I'll just add a few of my usual tricks for fixing messed up formatting. Even a good .lit file needs some massaging to make a good epub. I find it's worth a little effort to fix up a text before reading it. And while I'm reading one book and I can be working on another. Once you get better at it, you can fix even the most messed up text in about 30 mins.

Do a search for " " or whatever is used for the spoken text. This will find lines with two different people talking (The end of the first person and the beginning of the second). Then break the lines up so the story flows better. I hate when two people are talking on the same line :0

Another big one is attaching broken paragraphs. The obvious detection is the fact that a hard return is followed by a lowercase letter (and vice versa). Using Word, a regex you can find these and either mass change the results or just find and fix. For example, things like lyrics or special messages would get caught in this find but you wouldn't want to attach the lines.

ex. a regex search for a hard return, ^13, and then a lowercase letter [a-z] would be ^13([a-z])

With the parenthesizes you can do a macro replace of what it found. Replace like this would be ' \1' --without the '. i.e. a space then \1. This will remove the return with a space and the lowercase letter it found will be added back to make the line join up.

Doing the search the other way, ([a-z])^13 helps find broken lines as well as missing endings like periods. It's replace format would be '\1 ' without the '

Then clean up the extra spacing by searching for ^p^p (2 returns, or as many as you are looking for) and replace with ^p. You can then select the whole text and do margin and line spacing as well to something you like.

In Word I find it better to save a copy as html (filtered). Even with the crappy MS additions, Calibre will build a very accurate result in epub. You can also copy & paste it into Open Office and save the result in .html and it will have even less baggage but I've not seen any benefit in the resulting epub. Or even use something like notepad++ and with some experience, wipe out all the extraneous html tagging. I usually leave it at the ms word filtered unless I want a standard .html file.

Be sure to look at the Calibre options for removing spaces between paragraphs. Even with your html page looking right, this can help fix extra spaces from creeping in.

Good luck!

rogue_ronin · 01-02-2010, 04:49 PM

You could try something like this...

m a r

ToutSuite · 04-07-2010, 04:03 AM

Hi All -

I'm also a beginner, and trying to get my head around this. I have a problem almost exactly the opposite of the original poster - no line breaks seem to be translated over when converting from LIT to ePub with one of my files. Following the advice to save debug files, I opened the .htm file in the Input directory with Dreamweaver and saw that while the text appeared correctly, the entire body of the text was encapsulated in a single <pre> tag. I assume this can't possibly be right.

How would you suggest I go about correcting this?

Thanks in advance!

Alex

charleski · 04-07-2010, 11:18 AM

Quote:

Originally Posted by PhyrePhox

I'm a bit overwhelmed at the discussions here, and I've not been able to find a "how-to" guide that starts from a fundamental point

The first step is to learn xhtml (html with a few additional strictures) and css 2.0, which are really very easy. I picked up everything I needed to know by reading a few articles at W3Schools and looking at ePubs I downloaded from here and feedbooks.

Quote:

I added these files to Calibre and let it do whatever to them to let them work on my Reader. I find that there is a full space after each line, hard hyphenations, and occasional blank pages. The TOC is empty, and the metadata appears weird, i.e. it's sorting by the author's first name.

Try ticking the 'Remove spacing between paragraphs' box on the 'Look and Feel' page of the conversion dialog. Calibre's automatic conversion system is the best of a bad lot but it needs a lot of work to produce truly acceptable output and it's highly dependent on the quality of the source.

Quote:

every line has it's own entry in the XHTML files, so I tried to remove all the

Code:

<p class="MsoPlainText">

lines for an entire paragraph, and compiled the epub. Then, calibre couldn't read it.

The paragraph tags aren't causing extra line-feeds, they're an essential part of the markup and you can't remove them. You need to take a look at the css styles and set the top and bottom margins of MsoPlainText to 0.

Quote:

I would like to build the TOC, or find it in the original lit and migrate it, but I can't seem to open the lit file with anything that runs on Mac. I know that Sony only sees the metadata if it's added in a certain way, do I need to jiggle the settings in Calibre's converter? Or, do I need to go Windows to fix this in the lit file beforehand?

Your best bet is to enable debug output in calibre - go to the 'Debug' panel of the conversion dialog and enter a folder for it to place debugging information. Calibre will then decode the lit and write out the html for you (several versions of it, corresponding to the various stages of the pipeline). Load the input html into Sigil and go to code view, find the css file (that Calibre will also have written out for you) and paste that code into the style section in the header. Search for the chapters and set the tags to a heading style (h1, h2 etc) and Sigil will automatically compile the ToC for you. Go to Book view, place the cursor before each chapter and press the Chapter Break button to tell Sigil where to split the file. You set the metadata in Sigil by using the Metadata Editor from the Tools menu. Then edit the css to clean up the errors in the source that are causing these linebreaks.

You can get halfway there if you just want to stick with Calibre. Set the metadata correctly for the lit file before performing the conversion, and Calibre will use the corrected values for the ePub. Tick the checkbox I referred to earlier and tick the 'Preprocess input file...' box on the structure detection page. If it's still not picking up the chapters properly then you'll have to look at the debug output and work out an Xpath expression that will catch them.

charleski · 04-07-2010, 11:31 AM

Quote:

Originally Posted by ToutSuite

Following the advice to save debug files, I opened the .htm file in the Input directory with Dreamweaver and saw that while the text appeared correctly, the entire body of the text was encapsulated in a single <pre> tag. I assume this can't possibly be right.

It ain't, your source was coded by a monkey.

The only way to fix this is to search through the text and try to find what sort of marker was being used to indicate paragraphs instead of marking the text properly. This depends on the species of monkey who prepared the original. Old-World monkeys may do this by inserting a number of spaces at the start of a new paragraph to simulate a text indent (the use of a <pre> tag suggests this might be what you're dealing with). New-World monkeys may use <br> tags to indicate paragraphs.

Once you've discovered what sort of ridiculous scheme they used, you'll need to search for these elements and replace them with </p><p> pairs in order to reconstruct a proper markup (obviously you'll need to check the start and end of text blocks as well and add the proper opening or closing tags).

Luckily, publishers have generally phased out the use of monkeys to encode their ebooks, but many still try to save money by employing chimpanzees and gorillas. A few are grudgingly starting to employ actual human beings to do this.

AbominableDavid · 04-07-2010, 03:14 PM

Quote:

Originally Posted by tyche

Another big one is attaching broken paragraphs. The obvious detection is the fact that a hard return is followed by a lowercase letter (and vice versa). Using Word, a regex you can find these and either mass change the results or just find and fix. For example, things like lyrics or special messages would get caught in this find but you wouldn't want to attach the lines.

ex. a regex search for a hard return, ^13, and then a lowercase letter [a-z] would be ^13([a-z])

With the parenthesizes you can do a macro replace of what it found. Replace like this would be ' \1' --without the '. i.e. a space then \1. This will remove the return with a space and the lowercase letter it found will be added back to make the line join up.

Doing the search the other way, ([a-z])^13 helps find broken lines as well as missing endings like periods. It's replace format would be '\1 ' without the '

I've found many times badly formatted books that break paragraphs on commas, semicolons, and other punctuation marks. I use a regex something like this in Notepad++: [^."!?]</p>

This finds paragraphs that end in anything other than a period, a quote mark, an exclamation point, or a question mark. Of course, it has to be modified if the text uses curly quote marks, single quotes, or some other tag (like </span>, for instance) between the end of the text and the </p>

12-28-2009, 03:17 PM	#1
PhyrePhox Enthusiast Posts: 26 Karma: 10 Join Date: Dec 2009 Location: Toronto Device: Kobo Forma, Kobo Aura H2O Edition 2	*Beginner: Converting lit to epub-linefeed hll** I'm a bit overwhelmed at the discussions here, and I've not been able to find a "how-to" guide that starts from a fundamental point, so I'll try a thread here and hope I'm not too far off topic. I got a Sony PRS-300 for Christmas. I have a series of books that are in .lit format. I added these files to Calibre and let it do whatever to them to let them work on my Reader. I find that there is a full space after each line, hard hyphenations, and occasional blank pages. The TOC is empty, and the metadata appears weird, i.e. it's sorting by the author's first name. I tried opening the file in Sigil, and it apparently doesn't like it when you "select all" on my iMac 2.16GHz with Snow Leopard. I installed eCub, noticed that every line has it's own entry in the XHTML files, so I tried to remove all the Code: <p class="MsoPlainText"> lines for an entire paragraph, and compiled the epub. Then, calibre couldn't read it. What I'm trying to do is see what the book would look like without the linefeeds, except perhaps at the paragraph. I don't care too much about justification at this time. I would like to build the TOC, or find it in the original lit and migrate it, but I can't seem to open the lit file with anything that runs on Mac. I know that Sony only sees the metadata if it's added in a certain way, do I need to jiggle the settings in Calibre's converter? Or, do I need to go Windows to fix this in the lit file beforehand?

12-29-2009, 10:58 AM	#3
jackie_w Grand Sorcerer Posts: 6,199 Karma: 16228558 Join Date: Sep 2009 Location: UK Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3	Hi PhyrePhox, I may be wrong but it looks as if your source LIT file is poor. It looks as if it's been created from a plain text file in MSWord which has not had its "hard line breaks" removed before conversion to LIT. My advice would be to try to clean up the source rather than editing the EPUB. This is the approach I would take. Extract the HTML from the LIT by switching on the Debug option during the Calibre conversion of LIT to EPUB:- [Convert] - [Debug] and specify a directory to receive the Debug files. Once the conversion is finished ignore the EPUB it created but go to the Debug directory you specified and look in the Input subdir. The extracted source HTML will be in there. You can then use your editor-of-choice to tidy up the source before re-importing to Calibre. If you need more help I would be happy to take a look at your LIT file and give more specific help based on what I see.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Mass Converting LIT, RTF, & PDF to ePUB	Tom2112	ePub	8	01-11-2010 01:14 AM
Question about converting DRM lit to epub	weeziepepper	ePub	3	12-17-2009 10:52 AM
converting .lit to mobi	rick98761	Amazon Kindle	8	07-08-2009 10:28 PM
Converting LIT to LRF Woes (or: Trouble with Images in LIT Files)	JEMelby	Sony Reader	0	07-27-2007 09:18 PM
Need help while converting to .LIT	jungelbobo	Workshop	1	05-03-2006 05:51 AM

12-29-2009, 09:02 AM	#2
Slash5 Member Posts: 13 Karma: 64 Join Date: Nov 2009 Location: S. Ontario, Canada Device: Jetbook, Sony PRS-505	I have a Windows VB app I wrote to remove extra breaks in Epubs. It creates paragraphs at a user set interval - not as good as the proper paragraphs but at least it is readable. If you are interested I can send it to you.

01-01-2010, 10:50 AM	#6
tyche Addict Posts: 227 Karma: 2530 Join Date: Dec 2009 Device: PRS-505, iPad	Jackie_w, that is some good info. I'll just add a few of my usual tricks for fixing messed up formatting. Even a good .lit file needs some massaging to make a good epub. I find it's worth a little effort to fix up a text before reading it. And while I'm reading one book and I can be working on another. Once you get better at it, you can fix even the most messed up text in about 30 mins. Do a search for " " or whatever is used for the spoken text. This will find lines with two different people talking (The end of the first person and the beginning of the second). Then break the lines up so the story flows better. I hate when two people are talking on the same line :0 Another big one is attaching broken paragraphs. The obvious detection is the fact that a hard return is followed by a lowercase letter (and vice versa). Using Word, a regex you can find these and either mass change the results or just find and fix. For example, things like lyrics or special messages would get caught in this find but you wouldn't want to attach the lines. ex. a regex search for a hard return, ^13, and then a lowercase letter [a-z] would be ^13([a-z]) With the parenthesizes you can do a macro replace of what it found. Replace like this would be ' \1' --without the '. i.e. a space then \1. This will remove the return with a space and the lowercase letter it found will be added back to make the line join up. Doing the search the other way, ([a-z])^13 helps find broken lines as well as missing endings like periods. It's replace format would be '\1 ' without the ' Then clean up the extra spacing by searching for ^p^p (2 returns, or as many as you are looking for) and replace with ^p. You can then select the whole text and do margin and line spacing as well to something you like. In Word I find it better to save a copy as html (filtered). Even with the crappy MS additions, Calibre will build a very accurate result in epub. You can also copy & paste it into Open Office and save the result in .html and it will have even less baggage but I've not seen any benefit in the resulting epub. Or even use something like notepad++ and with some experience, wipe out all the extraneous html tagging. I usually leave it at the ms word filtered unless I want a standard .html file. Be sure to look at the Calibre options for removing spaces between paragraphs. Even with your html page looking right, this can help fix extra spaces from creeping in. Good luck!

01-02-2010, 04:49 PM	#7
rogue_ronin Banned Posts: 475 Karma: 796 Join Date: Sep 2008 Location: Honolulu Device: Nokia 770 (fbreader)	You could try something like this... m a r

04-07-2010, 04:03 AM	#8
ToutSuite Junior Member Posts: 3 Karma: 10 Join Date: Apr 2010 Device: iPad	Hi All - I'm also a beginner, and trying to get my head around this. I have a problem almost exactly the opposite of the original poster - no line breaks seem to be translated over when converting from LIT to ePub with one of my files. Following the advice to save debug files, I opened the .htm file in the Input directory with Dreamweaver and saw that while the text appeared correctly, the entire body of the text was encapsulated in a single <pre> tag. I assume this can't possibly be right. How would you suggest I go about correcting this? Thanks in advance! Alex