Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 08-19-2010, 03:21 PM   #1
Wintersdark
Junior Member
Wintersdark began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
Removing unnecessary line breaks in books.

I have a great many books that were converted from .lit's to .epubs to be Stanza friendly - I read on my iPhone.

My problem is that a very large number of my books (hundreds) have line breaks scattered throughout paragraphs. Now, I convert these epubs to another format and manually edit them to remove the line breaks, but this is very impractical given the number of books.

Is there a way to have Calibre apply a regex search/replace to remove these line breaks on conversion so I could bulk-convert everything at once?

I figure that searching for "^13([a-z])" and replacing with " \1" will work often enough to make the text readable at least. There are instances where it will miss, but it's really good enough - though I'm open to ways to do it better of course.

Could I use the header/footer removal settings in calibre's Convert Books regex's to acheive this? Or is that removal only, not replacing?

Thanks!!

Derrick
Wintersdark is offline   Reply With Quote
Old 08-19-2010, 03:29 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There's no way to do this, short of writing a plugin to modify how calibre processes the files. Or use the command line tools. Output to txt run your regex and convert to final format via a script
kovidgoyal is offline   Reply With Quote
Advert
Old 08-19-2010, 03:34 PM   #3
Wintersdark
Junior Member
Wintersdark began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
Post

Oh, dear. Right.

Ok, I extracted a sample epub, and it seems it's decided that each line is a paragraph. So, a sample of the raw text is:

Code:
      <p class="MsoPlainText">He stared at the warm blackness, half closing his eye, then opening it again, </p>
      <p class="MsoPlainText">wide. Over on his left, in front, was a narrow smear of murky light in the air, </p>
      <p class="MsoPlainText">which at first he could make no sense of. The light danced, a flickering glow. </p>
      <p class="MsoPlainText">Then gradually he began to sort out details of the room.</p>
      <p class="MsoPlainText">Or half room. It was big, high ceilinged. There was no furniture, but the floor </p>
      <p class="MsoPlainText">was carpeted. Across the room, from wall to wall, hung some kind of thick </p>
      <p class="MsoPlainText">curtain. Two curtains, actually, pulled together. Hence that chink of light in </p>
      <p class="MsoPlainText">the center where the inner folds of the two draperies didn't quite meet.</p>
This makes things much more difficult. Still, removing all the paragraphs except where the </p> occurs immediately following a " or . would do the job - not perfect, there'd still be a few paragraph breaks where there shouldn't, but at least conversation would be split up nicely and there wouldn't be paragraph breaks mid-sentence.

Any advice?
Wintersdark is offline   Reply With Quote
Old 08-19-2010, 03:43 PM   #4
Wintersdark
Junior Member
Wintersdark began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
Quote:
Originally Posted by kovidgoyal View Post
There's no way to do this, short of writing a plugin to modify how calibre processes the files. Or use the command line tools. Output to txt run your regex and convert to final format via a script
Alas. Is there any chance that there's a somewhat central library of plugins anywhere, in case someone has already made one to do this? Writing my own is somewhat beyond me. It seems like it would be a fairly common problem, as it occurs in pretty much every lit -> epub conversion.

Edit: Or, failing that, at least to help clean up the lit>epub conversion in the future?

Last edited by Wintersdark; 08-19-2010 at 03:46 PM.
Wintersdark is offline   Reply With Quote
Old 08-19-2010, 03:45 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
No it doesn't occur in every lit to epub. It only occurs if your original lit file has text formatted like that. And no, as far as I'm aware no one has written such a plugin.
kovidgoyal is offline   Reply With Quote
Advert
Old 08-19-2010, 03:47 PM   #6
Wintersdark
Junior Member
Wintersdark began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
Quote:
Originally Posted by kovidgoyal View Post
No it doesn't occur in every lit to epub. It only occurs if your original lit file has text formatted like that. And no, as far as I'm aware no one has written such a plugin.
Ok, thanks for your help!
Wintersdark is offline   Reply With Quote
Old 08-19-2010, 03:52 PM   #7
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,731
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Try converting the lit to txt, then the txt to epub. Calibre's txt input plugin knows how to construct paragraphs from long lines or sets of shorter lines.
chaley is offline   Reply With Quote
Old 08-19-2010, 03:53 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Quote:
Originally Posted by chaley View Post
Try converting the lit to txt, then the txt to epub. Calibre's txt input plugin knows how to construct paragraphs from long lines or sets of shorter lines.
Doing that will mean he'll lose all character formatting (italics, bold, etc). IIRC the TXT output plugin doesn't preserve those.
kovidgoyal is offline   Reply With Quote
Old 08-19-2010, 05:43 PM   #9
Wintersdark
Junior Member
Wintersdark began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
Quote:
Originally Posted by kovidgoyal View Post
Doing that will mean he'll lose all character formatting (italics, bold, etc). IIRC the TXT output plugin doesn't preserve those.
This isn't a problem. While character formatting is nice, having readable text at all is nicer.

You're right, it's not all .lit -> .epub, I'd (incorrectly) assumed that as several books I checked all suffered the same problem. Further investigation shows that's not the case - good news!

I tried converting to text and back, but the way it's formatted I basically get each paragraph followed by a pair of CR/LF's. So, converting directly back to epub doesn't help.

However, as it's not every book, I'm just addressing it on a case by case basis with Notepad++ as I go. If I were still running linux, I'd mass convert them all to text and figure out how to script applying the regex replace to them, but I have no idea of how to go about that in windows.

Unfortunately with Notepad++ you cannot use \r\n in regex expressions (who knows why), but you *can* replace (with "extended" searches) all the CRLF pairs with a unique identifier (I used QQQQ) then simply replace all .QQQQ and "QQQQ with \r\n\r\n, then all remaining QQQQ's with spaces. It's sort of a pain in the ass to have to do it one at a time, but it works at least.

If anyone knows a better tool to do this with - one that can macro the operations; or apply a regex directly, or better yet be applied in bulk, in windows, with a minimum of hassle for one not used to dealing with these things, I'd love to hear about it. But, even if not, this does work.


As a feature request for Calibre I'd definitely like to see, for this and other formatting issues, the ability to apply a regex directly in the conversion options (or some such easily accessible place). It would really help people cleaning up poor source material when converting to their ereader format of choice.
Wintersdark is offline   Reply With Quote
Old 08-19-2010, 05:49 PM   #10
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,731
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by Wintersdark View Post
If I were still running linux, I'd mass convert them all to text and figure out how to script applying the regex replace to them, but I have no idea of how to go about that in windows.
Two suggestions:

1) Use the cygwin tools. You can get windows versions of all the standard Unix stuff such as bash, sed, egrep, etc. Couple that with vim (windows native) and you should have what you need.

2) Run linux in a VM under windows, and do the work there. I do this frequently with virtualbox, but I am sure that the other VM managers work fine as well.
chaley is offline   Reply With Quote
Old 08-19-2010, 11:26 PM   #11
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
This was done for pdf a while back, the 'Preprocess input file to possibly improve structure detection' option. I just doublechecked, and it does still appear to be working well in the latest version of Calibre. Is that function still in preprocess.py?

The regular expressions used for pdf don't apply to Lit/Txt files, but couldn't new regexes be defined for each input format, and then automatically used based on the input format when that box is checked? I've been running into more Lit/Txt files with this problem lately, so I could see if I could get it working if this is a good approach.

There used to be an option to tweak the average line length detection logic, but that appears to be removed (from the GUI at least).
ldolse is offline   Reply With Quote
Old 08-19-2010, 11:43 PM   #12
Wintersdark
Junior Member
Wintersdark began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
That would be fantastic. While this does not affect every lit file I have, it certainly does with many and it's quite tedious to manually edit each file.
Wintersdark is offline   Reply With Quote
Old 08-20-2010, 12:00 AM   #13
capidamonte
Not who you think I am...
capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.
 
capidamonte's Avatar
 
Posts: 374
Karma: 30283
Join Date: Jan 2010
Location: Honolulu
Device: PocketBook 360 -- Ivory
Could the text output have a "convert to Markdown" option?
capidamonte is offline   Reply With Quote
Old 08-20-2010, 01:40 AM   #14
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Ah - I'd forgotten this discussion had already come up here:
https://www.mobileread.com/forums/sho...=preprocess.py
http://bugs.calibre-ebook.com/ticket/2359

I believe the 'Preprocess input file to possibly improve structure detection' is a result of that feature/bugfix. As I recall I started looking into how to implement the new function in other input plugins and shortly thereafter took a break from participating in Calibre/Mobileread. Looks like no one else has picked it up, but the logic is more or less ready to go. Just need to create the regexes and figure out how add the preprocess_html method to the input format plugins.
ldolse is offline   Reply With Quote
Old 08-20-2010, 02:25 AM   #15
rollercoaster
Zealot
rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.rollercoaster once ate a cherry pie in a record 7 seconds.
 
rollercoaster's Avatar
 
Posts: 126
Karma: 1826
Join Date: Jan 2010
Device: Kindle 2
That looks like MS Word's html.

Open the file in a text editor and copy past the content in a new .html file such that it is mostly valid html.

You can then open it in word and save as text. That has worked for me more then a few times.
rollercoaster is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tool for removing line breaks in text documents kahn10 Sony Reader 9 08-22-2010 10:05 PM
No line breaks ecpepper Amazon Kindle 3 08-09-2009 06:42 PM
Removing Line-breaks / Preserving Paragraphs ahi Workshop 5 06-08-2009 02:22 AM
Removing the first line jethro10 Calibre 2 03-05-2009 12:32 PM
Removing extra line breaks plemming Calibre 0 07-31-2008 07:50 PM


All times are GMT -4. The time now is 12:41 AM.


MobileRead.com is a privately owned, operated and funded community.