![]() |
#1 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
|
Removing unnecessary line breaks in books.
I have a great many books that were converted from .lit's to .epubs to be Stanza friendly - I read on my iPhone.
My problem is that a very large number of my books (hundreds) have line breaks scattered throughout paragraphs. Now, I convert these epubs to another format and manually edit them to remove the line breaks, but this is very impractical given the number of books. Is there a way to have Calibre apply a regex search/replace to remove these line breaks on conversion so I could bulk-convert everything at once? I figure that searching for "^13([a-z])" and replacing with " \1" will work often enough to make the text readable at least. There are instances where it will miss, but it's really good enough - though I'm open to ways to do it better of course. Could I use the header/footer removal settings in calibre's Convert Books regex's to acheive this? Or is that removal only, not replacing? Thanks!! Derrick |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,156
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
There's no way to do this, short of writing a plugin to modify how calibre processes the files. Or use the command line tools. Output to txt run your regex and convert to final format via a script
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
|
![]()
Oh, dear. Right.
Ok, I extracted a sample epub, and it seems it's decided that each line is a paragraph. So, a sample of the raw text is: Code:
<p class="MsoPlainText">He stared at the warm blackness, half closing his eye, then opening it again, </p> <p class="MsoPlainText">wide. Over on his left, in front, was a narrow smear of murky light in the air, </p> <p class="MsoPlainText">which at first he could make no sense of. The light danced, a flickering glow. </p> <p class="MsoPlainText">Then gradually he began to sort out details of the room.</p> <p class="MsoPlainText">Or half room. It was big, high ceilinged. There was no furniture, but the floor </p> <p class="MsoPlainText">was carpeted. Across the room, from wall to wall, hung some kind of thick </p> <p class="MsoPlainText">curtain. Two curtains, actually, pulled together. Hence that chink of light in </p> <p class="MsoPlainText">the center where the inner folds of the two draperies didn't quite meet.</p> Any advice? |
![]() |
![]() |
![]() |
#4 | |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
|
Quote:
Edit: Or, failing that, at least to help clean up the lit>epub conversion in the future? Last edited by Wintersdark; 08-19-2010 at 03:46 PM. |
|
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,156
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
No it doesn't occur in every lit to epub. It only occurs if your original lit file has text formatted like that. And no, as far as I'm aware no one has written such a plugin.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
|
|
![]() |
![]() |
![]() |
#7 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,321
Karma: 7975240
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Try converting the lit to txt, then the txt to epub. Calibre's txt input plugin knows how to construct paragraphs from long lines or sets of shorter lines.
|
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,156
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Doing that will mean he'll lose all character formatting (italics, bold, etc). IIRC the TXT output plugin doesn't preserve those.
|
![]() |
![]() |
![]() |
#9 | |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
|
Quote:
You're right, it's not all .lit -> .epub, I'd (incorrectly) assumed that as several books I checked all suffered the same problem. Further investigation shows that's not the case - good news! I tried converting to text and back, but the way it's formatted I basically get each paragraph followed by a pair of CR/LF's. So, converting directly back to epub doesn't help. However, as it's not every book, I'm just addressing it on a case by case basis with Notepad++ as I go. If I were still running linux, I'd mass convert them all to text and figure out how to script applying the regex replace to them, but I have no idea of how to go about that in windows. Unfortunately with Notepad++ you cannot use \r\n in regex expressions (who knows why), but you *can* replace (with "extended" searches) all the CRLF pairs with a unique identifier (I used QQQQ) then simply replace all .QQQQ and "QQQQ with \r\n\r\n, then all remaining QQQQ's with spaces. It's sort of a pain in the ass to have to do it one at a time, but it works at least. If anyone knows a better tool to do this with - one that can macro the operations; or apply a regex directly, or better yet be applied in bulk, in windows, with a minimum of hassle for one not used to dealing with these things, I'd love to hear about it. But, even if not, this does work. As a feature request for Calibre I'd definitely like to see, for this and other formatting issues, the ability to apply a regex directly in the conversion options (or some such easily accessible place). It would really help people cleaning up poor source material when converting to their ereader format of choice. |
|
![]() |
![]() |
![]() |
#10 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,321
Karma: 7975240
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
1) Use the cygwin tools. You can get windows versions of all the standard Unix stuff such as bash, sed, egrep, etc. Couple that with vim (windows native) and you should have what you need. 2) Run linux in a VM under windows, and do the work there. I do this frequently with virtualbox, but I am sure that the other VM managers work fine as well. |
|
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
This was done for pdf a while back, the 'Preprocess input file to possibly improve structure detection' option. I just doublechecked, and it does still appear to be working well in the latest version of Calibre. Is that function still in preprocess.py?
The regular expressions used for pdf don't apply to Lit/Txt files, but couldn't new regexes be defined for each input format, and then automatically used based on the input format when that box is checked? I've been running into more Lit/Txt files with this problem lately, so I could see if I could get it working if this is a good approach. There used to be an option to tweak the average line length detection logic, but that appears to be removed (from the GUI at least). |
![]() |
![]() |
![]() |
#12 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Nov 2009
Device: iPhone 3G
|
That would be fantastic. While this does not affect every lit file I have, it certainly does with many and it's quite tedious to manually edit each file.
|
![]() |
![]() |
![]() |
#13 |
Not who you think I am...
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 374
Karma: 30283
Join Date: Jan 2010
Location: Honolulu
Device: PocketBook 360 -- Ivory
|
Could the text output have a "convert to Markdown" option?
|
![]() |
![]() |
![]() |
#14 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Ah - I'd forgotten this discussion had already come up here:
https://www.mobileread.com/forums/sho...=preprocess.py http://bugs.calibre-ebook.com/ticket/2359 I believe the 'Preprocess input file to possibly improve structure detection' is a result of that feature/bugfix. As I recall I started looking into how to implement the new function in other input plugins and shortly thereafter took a break from participating in Calibre/Mobileread. Looks like no one else has picked it up, but the logic is more or less ready to go. Just need to create the regexes and figure out how add the preprocess_html method to the input format plugins. |
![]() |
![]() |
![]() |
#15 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 126
Karma: 1826
Join Date: Jan 2010
Device: Kindle 2
|
That looks like MS Word's html.
Open the file in a text editor and copy past the content in a new .html file such that it is mostly valid html. You can then open it in word and save as text. That has worked for me more then a few times. |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Tool for removing line breaks in text documents | kahn10 | Sony Reader | 9 | 08-22-2010 10:05 PM |
No line breaks | ecpepper | Amazon Kindle | 3 | 08-09-2009 06:42 PM |
Removing Line-breaks / Preserving Paragraphs | ahi | Workshop | 5 | 06-08-2009 02:22 AM |
Removing the first line | jethro10 | Calibre | 2 | 03-05-2009 12:32 PM |
Removing extra line breaks | plemming | Calibre | 0 | 07-31-2008 07:50 PM |