Removing unnecessary line breaks in books.

Wintersdark · 08-19-2010, 03:21 PM

I have a great many books that were converted from .lit's to .epubs to be Stanza friendly - I read on my iPhone.

My problem is that a very large number of my books (hundreds) have line breaks scattered throughout paragraphs. Now, I convert these epubs to another format and manually edit them to remove the line breaks, but this is very impractical given the number of books.

Is there a way to have Calibre apply a regex search/replace to remove these line breaks on conversion so I could bulk-convert everything at once?

I figure that searching for "^13([a-z])" and replacing with " \1" will work often enough to make the text readable at least. There are instances where it will miss, but it's really good enough - though I'm open to ways to do it better of course.

Could I use the header/footer removal settings in calibre's Convert Books regex's to acheive this? Or is that removal only, not replacing?

Thanks!!

Derrick

kovidgoyal · 08-19-2010, 03:29 PM

There's no way to do this, short of writing a plugin to modify how calibre processes the files. Or use the command line tools. Output to txt run your regex and convert to final format via a script

Wintersdark · 08-19-2010, 03:34 PM

Oh, dear. Right.

Ok, I extracted a sample epub, and it seems it's decided that each line is a paragraph. So, a sample of the raw text is:

Code:

      <p class="MsoPlainText">He stared at the warm blackness, half closing his eye, then opening it again, </p>
      <p class="MsoPlainText">wide. Over on his left, in front, was a narrow smear of murky light in the air, </p>
      <p class="MsoPlainText">which at first he could make no sense of. The light danced, a flickering glow. </p>
      <p class="MsoPlainText">Then gradually he began to sort out details of the room.</p>
      <p class="MsoPlainText">Or half room. It was big, high ceilinged. There was no furniture, but the floor </p>
      <p class="MsoPlainText">was carpeted. Across the room, from wall to wall, hung some kind of thick </p>
      <p class="MsoPlainText">curtain. Two curtains, actually, pulled together. Hence that chink of light in </p>
      <p class="MsoPlainText">the center where the inner folds of the two draperies didn't quite meet.</p>

This makes things much more difficult. Still, removing all the paragraphs except where the </p> occurs immediately following a " or . would do the job - not perfect, there'd still be a few paragraph breaks where there shouldn't, but at least conversation would be split up nicely and there wouldn't be paragraph breaks mid-sentence.

Any advice?

Wintersdark · 08-19-2010, 03:43 PM

Quote:

Originally Posted by kovidgoyal

There's no way to do this, short of writing a plugin to modify how calibre processes the files. Or use the command line tools. Output to txt run your regex and convert to final format via a script

Alas. Is there any chance that there's a somewhat central library of plugins anywhere, in case someone has already made one to do this? Writing my own is somewhat beyond me. It seems like it would be a fairly common problem, as it occurs in pretty much every lit -> epub conversion.

Edit: Or, failing that, at least to help clean up the lit>epub conversion in the future?

kovidgoyal · 08-19-2010, 03:45 PM

No it doesn't occur in every lit to epub. It only occurs if your original lit file has text formatted like that. And no, as far as I'm aware no one has written such a plugin.

Wintersdark · 08-19-2010, 03:47 PM

Quote:

Originally Posted by kovidgoyal

No it doesn't occur in every lit to epub. It only occurs if your original lit file has text formatted like that. And no, as far as I'm aware no one has written such a plugin.

Ok, thanks for your help!

chaley · 08-19-2010, 03:52 PM

Try converting the lit to txt, then the txt to epub. Calibre's txt input plugin knows how to construct paragraphs from long lines or sets of shorter lines.

kovidgoyal · 08-19-2010, 03:53 PM

Quote:

Originally Posted by chaley

Try converting the lit to txt, then the txt to epub. Calibre's txt input plugin knows how to construct paragraphs from long lines or sets of shorter lines.

Doing that will mean he'll lose all character formatting (italics, bold, etc). IIRC the TXT output plugin doesn't preserve those.

Wintersdark · 08-19-2010, 05:43 PM

Quote:

Originally Posted by kovidgoyal

Doing that will mean he'll lose all character formatting (italics, bold, etc). IIRC the TXT output plugin doesn't preserve those.

This isn't a problem. While character formatting is nice, having readable text at all is nicer.

You're right, it's not all .lit -> .epub, I'd (incorrectly) assumed that as several books I checked all suffered the same problem. Further investigation shows that's not the case - good news!

I tried converting to text and back, but the way it's formatted I basically get each paragraph followed by a pair of CR/LF's. So, converting directly back to epub doesn't help.

However, as it's not every book, I'm just addressing it on a case by case basis with Notepad++ as I go. If I were still running linux, I'd mass convert them all to text and figure out how to script applying the regex replace to them, but I have no idea of how to go about that in windows.

Unfortunately with Notepad++ you cannot use \r\n in regex expressions (who knows why), but you *can* replace (with "extended" searches) all the CRLF pairs with a unique identifier (I used QQQQ) then simply replace all .QQQQ and "QQQQ with \r\n\r\n, then all remaining QQQQ's with spaces. It's sort of a pain in the ass to have to do it one at a time, but it works at least.

If anyone knows a better tool to do this with - one that can macro the operations; or apply a regex directly, or better yet be applied in bulk, in windows, with a minimum of hassle for one not used to dealing with these things, I'd love to hear about it. But, even if not, this does work.

As a feature request for Calibre I'd definitely like to see, for this and other formatting issues, the ability to apply a regex directly in the conversion options (or some such easily accessible place). It would really help people cleaning up poor source material when converting to their ereader format of choice.

chaley · 08-19-2010, 05:49 PM

Quote:

Originally Posted by Wintersdark

If I were still running linux, I'd mass convert them all to text and figure out how to script applying the regex replace to them, but I have no idea of how to go about that in windows.

Two suggestions:

1) Use the cygwin tools. You can get windows versions of all the standard Unix stuff such as bash, sed, egrep, etc. Couple that with vim (windows native) and you should have what you need.

2) Run linux in a VM under windows, and do the work there. I do this frequently with virtualbox, but I am sure that the other VM managers work fine as well.

ldolse · 08-19-2010, 11:26 PM

This was done for pdf a while back, the 'Preprocess input file to possibly improve structure detection' option. I just doublechecked, and it does still appear to be working well in the latest version of Calibre. Is that function still in preprocess.py?

The regular expressions used for pdf don't apply to Lit/Txt files, but couldn't new regexes be defined for each input format, and then automatically used based on the input format when that box is checked? I've been running into more Lit/Txt files with this problem lately, so I could see if I could get it working if this is a good approach.

There used to be an option to tweak the average line length detection logic, but that appears to be removed (from the GUI at least).

Wintersdark · 08-19-2010, 11:43 PM

That would be fantastic. While this does not affect every lit file I have, it certainly does with many and it's quite tedious to manually edit each file.

capidamonte · 08-20-2010, 12:00 AM

Could the text output have a "convert to Markdown" option?

ldolse · 08-20-2010, 01:40 AM

Ah - I'd forgotten this discussion had already come up here:
https://www.mobileread.com/forums/sho...=preprocess.py
http://bugs.calibre-ebook.com/ticket/2359

I believe the 'Preprocess input file to possibly improve structure detection' is a result of that feature/bugfix. As I recall I started looking into how to implement the new function in other input plugins and shortly thereafter took a break from participating in Calibre/Mobileread. Looks like no one else has picked it up, but the logic is more or less ready to go. Just need to create the regexes and figure out how add the preprocess_html method to the input format plugins.

rollercoaster · 08-20-2010, 02:25 AM

That looks like MS Word's html.

Open the file in a text editor and copy past the content in a new .html file such that it is mostly valid html.

You can then open it in word and save as text. That has worked for me more then a few times.

08-19-2010, 03:21 PM	#1
Wintersdark Junior Member Posts: 7 Karma: 10 Join Date: Nov 2009 Device: iPhone 3G	Removing unnecessary line breaks in books. I have a great many books that were converted from .lit's to .epubs to be Stanza friendly - I read on my iPhone. My problem is that a very large number of my books (hundreds) have line breaks scattered throughout paragraphs. Now, I convert these epubs to another format and manually edit them to remove the line breaks, but this is very impractical given the number of books. Is there a way to have Calibre apply a regex search/replace to remove these line breaks on conversion so I could bulk-convert everything at once? I figure that searching for "^13([a-z])" and replacing with " \1" will work often enough to make the text readable at least. There are instances where it will miss, but it's really good enough - though I'm open to ways to do it better of course. Could I use the header/footer removal settings in calibre's Convert Books regex's to acheive this? Or is that removal only, not replacing? Thanks!! Derrick

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Tool for removing line breaks in text documents	kahn10	Sony Reader	9	08-22-2010 10:05 PM
No line breaks	ecpepper	Amazon Kindle	3	08-09-2009 06:42 PM
Removing Line-breaks / Preserving Paragraphs	ahi	Workshop	5	06-08-2009 02:22 AM
Removing the first line	jethro10	Calibre	2	03-05-2009 12:32 PM
Removing extra line breaks	plemming	Calibre	0	07-31-2008 07:50 PM

08-19-2010, 03:29 PM	#2
kovidgoyal creator of calibre Posts: 43,835 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There's no way to do this, short of writing a plugin to modify how calibre processes the files. Or use the command line tools. Output to txt run your regex and convert to final format via a script

08-19-2010, 03:45 PM	#5
kovidgoyal creator of calibre Posts: 43,835 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	No it doesn't occur in every lit to epub. It only occurs if your original lit file has text formatted like that. And no, as far as I'm aware no one has written such a plugin.

08-19-2010, 03:52 PM	#7
chaley Grand Sorcerer Posts: 11,731 Karma: 6690881 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Try converting the lit to txt, then the txt to epub. Calibre's txt input plugin knows how to construct paragraphs from long lines or sets of shorter lines.

08-19-2010, 11:26 PM	#11
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	This was done for pdf a while back, the 'Preprocess input file to possibly improve structure detection' option. I just doublechecked, and it does still appear to be working well in the latest version of Calibre. Is that function still in preprocess.py? The regular expressions used for pdf don't apply to Lit/Txt files, but couldn't new regexes be defined for each input format, and then automatically used based on the input format when that box is checked? I've been running into more Lit/Txt files with this problem lately, so I could see if I could get it working if this is a good approach. There used to be an option to tweak the average line length detection logic, but that appears to be removed (from the GUI at least).

08-19-2010, 11:43 PM	#12
Wintersdark Junior Member Posts: 7 Karma: 10 Join Date: Nov 2009 Device: iPhone 3G	That would be fantastic. While this does not affect every lit file I have, it certainly does with many and it's quite tedious to manually edit each file.

08-20-2010, 12:00 AM	#13
capidamonte Not who you think I am... Posts: 374 Karma: 30283 Join Date: Jan 2010 Location: Honolulu Device: PocketBook 360 -- Ivory	Could the text output have a "convert to Markdown" option?

08-20-2010, 01:40 AM	#14
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Ah - I'd forgotten this discussion had already come up here: https://www.mobileread.com/forums/sho...=preprocess.py http://bugs.calibre-ebook.com/ticket/2359 I believe the 'Preprocess input file to possibly improve structure detection' is a result of that feature/bugfix. As I recall I started looking into how to implement the new function in other input plugins and shortly thereafter took a break from participating in Calibre/Mobileread. Looks like no one else has picked it up, but the logic is more or less ready to go. Just need to create the regexes and figure out how add the preprocess_html method to the input format plugins.

08-20-2010, 02:25 AM	#15
rollercoaster Zealot Posts: 126 Karma: 1826 Join Date: Jan 2010 Device: Kindle 2	That looks like MS Word's html. Open the file in a text editor and copy past the content in a new .html file such that it is mostly valid html. You can then open it in word and save as text. That has worked for me more then a few times.

Advert

Advert