Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 07-17-2011, 08:37 AM   #1
scubaddictions
Member
scubaddictions began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
Find/Replace bogus line breaks in Text editor, w/Regular Expression

Hello all,

I'm trying to work within Calibre primarily but occasionally the formatting fails a bit and I try to drop back and fix the original text files manually. I'm working off some info from an older post on this forum and so this post is more about pre-Calibre editing.

Here's the thing: a number of the original texts I'm dealing with have hard coded line break/carriage returns that cause broken-up sentences in the final product. I've gathered enough information to create a functioning solution but I'm using demo software that will expire in a month. In brief, I'm trying to figure out what freeware text editor will allow me to use the same solution I've worked out on the expensive software that I don't want to purchase. I'm not truly cheap, I'm just certain that I can do this task without needing one particular commercial software.

From this forum post here:

https://www.mobileread.com/forums/showthread.php?t=47044

I learned that it was possible to do a relatively simple Find/Replace function in a text editor to search for a line break followed directly by any lower-case letter of the alphabet as would usually happen if you a place a line break mid-sentence. I was successful using this technique in the recommended text editor (UltraEdit) but of course it costs money. I have a multitude of other free text editors and I believe I should be able to perform the same task in one of them just the same. I have to admit that I only partially understand the syntax of the search parameters so that makes it difficult to translate it directly to another application.

First, what works: Open document in UltraEdit, pull up Replace window. Select Match Case and turn on Regular Expression, choose Perl as Expression Engine.

Find What: \r\n([a-z])

Replace With: \1 <---There is a space before the One. (Space - Backslash - One)

This grabs most instances. For various reasons (capital letters, punctuation) I found that running a second pass using the inverse manages to catch almost all of the other instances, like this:

Find What: ([a-z])\r\n

Replace with: \1 <---There is a space after the one. (Backslash - One - Space)

So, this works like a charm but the Demo expiration on UltraEdit (ver. 17.10.0.1010) will leave me stranded. The same author of this information above recommended a different text editor in addition, TextPad, which I downloaded (ver. 5.4.2) In addition, I have access to NotePad++ (ver. 5.9.2) , Open Office (ver. 3.2.1), along with Window's Wordpad and Notepad. With the possible exception of Open Office and the built-in Windows stuff the rest are all recent downloads and should be the newest available.

I've tried so many different versions of this syntax in the other text editors available to me, with no real success. It seems to be partially a problem with the different ways a text editor can view search perameters, as Normal Text, as Extended characters or as Regular Expression. Each has it's own version of a line break (^13 or ^p, \r\n, and $) and I'm reading websites that reference all of those and more. None of the other text editors accept the exact syntax as I've outlined above. It either erases characters that it shouldn't, pastes in characters that I don't want or just leaves the extra line breaks intact. I think I've hit a brick wall and need help from people more experienced that I, and here I am. Can anybody help me?

Thanks!

Ryan
scubaddictions is offline   Reply With Quote
Old 07-17-2011, 10:06 AM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Have you tried enabling heuristics under the conversion options and enabling 'unwrap hard line breaks'? Most of that logic is hard-coded into the heuristic option.

Also, many text files have consistently formatted hard breaks with an empty line between paragraphs, indents, or some similar convention - for well formatted text files there are several different text input options to handle those formatting situations, and by default it does try to autodetect the formatting.
ldolse is offline   Reply With Quote
Old 07-19-2011, 06:32 AM   #3
scubaddictions
Member
scubaddictions began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
Quote:
Originally Posted by ldolse View Post
Have you tried enabling heuristics under the conversion options and enabling 'unwrap hard line breaks'? Most of that logic is hard-coded into the heuristic option.
Yeah, I enabled the heuristics and played around with unwrapping. Sometimes it works, other times...not so much! Occasionally I just like to do things manually and the end result of my e-book won't always be on my phone. I like having the flexibility of using a robust text editor and knowing how to fall back on it when Calibre doesn't quite get things right.

Any ideas on how to port this stuff over to another text editor?

Sorry for the delay in reply, Internet out here is kinda flaky.

Thanks!

Ryan
scubaddictions is offline   Reply With Quote
Old 07-19-2011, 09:46 AM   #4
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,795
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by scubaddictions View Post
Yeah, I enabled the heuristics and played around with unwrapping. Sometimes it works, other times...not so much! Occasionally I just like to do things manually and the end result of my e-book won't always be on my phone. I like having the flexibility of using a robust text editor and knowing how to fall back on it when Calibre doesn't quite get things right.

Any ideas on how to port this stuff over to another text editor?

Sorry for the delay in reply, Internet out here is kinda flaky.

Thanks!

Ryan
REGEX is already in many TEXT editor used by programmers.
Notepad++ is a free one for Windows users.
theducks is online now   Reply With Quote
Old 07-19-2011, 10:25 AM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by scubaddictions View Post
Yeah, I enabled the heuristics and played around with unwrapping. Sometimes it works, other times...not so much! Occasionally I just like to do things manually and the end result of my e-book won't always be on my phone. I like having the flexibility of using a robust text editor and knowing how to fall back on it when Calibre doesn't quite get things right.

Any ideas on how to port this stuff over to another text editor?

Sorry for the delay in reply, Internet out here is kinda flaky.

Thanks!

Ryan
You can post bugs with the non-working variants if they're consistently not unwrapping lines. Also note you might need to reduce the 'line unwrap factor' - the unwrap function looks at the median (or average, can't remember) line length, and only unwraps lines that exceed that length. That works well for a book with consistent breaks in roughly the same location for every line (OCR, pdf, many well formatted text files), but it will fail where the hard breaks are inconsistent/infrequent. Reducing the unwrap factor basically tells Calibre to look for shorter lines than the median. The fewer or more erratic the breaks the lower you need to go, sometimes all the way down to 0.05

The only type of line breaks that aren't currently supported by the function are ones where the document abuses <br> tags - heuristics currently ignores those. There also a couple minor cases where it's conservative about unwrapping that some users have complained about, but I have a strong preference for false negatives vs. false positives.

emeditor is another editor I liked with Regex support when I was in the windows world. On the Mac TextWrangler is the way to go.
ldolse is offline   Reply With Quote
Old 07-20-2011, 09:51 AM   #6
scubaddictions
Member
scubaddictions began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
Quote:
Originally Posted by theducks View Post
REGEX is already in many TEXT editor used by programmers.
Notepad++ is a free one for Windows users.
Yeah, I know that there are other text editors that do Regex, and I already have Notepad++. The problem is that the syntax of a Regex in one text editor doesn't work in Notepad++, or any of the other text editors I've tried. I'm almost certain that this particular Find/Replace function will work in Notepad++ if I can just get the syntax correct. Any help on that front would be appreciated.

Thanks!

Ryan
scubaddictions is offline   Reply With Quote
Old 07-20-2011, 09:58 AM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,795
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by scubaddictions View Post
Any help on that front would be appreciated.

Thanks!

Ryan
Still learning to deal with that myself (I learned on Sigil )

I discovered NP++ color codes (LF) differently. Only the Darker one matches a \s+
theducks is online now   Reply With Quote
Old 07-20-2011, 10:11 AM   #8
scubaddictions
Member
scubaddictions began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
Quote:
Originally Posted by ldolse View Post
emeditor is another editor I liked with Regex support when I was in the windows world. On the Mac TextWrangler is the way to go.
I'm willing to give another text editor a try, but I wouldn't mind getting some help on the syntax. Unless I'm mistaken though EmEditor appears to be another commercial text editor and if I'm reading things right it's actually more expensive than UltraEdit! I already know that it works within UltraEdit so if anything I'd stick with that software. Most important though, this should all be standard Regex stuff. There just isn't a good reason why I shouldn't be able to perform this same function in one of the free text editors available. Plenty of them do Regex functions.

Quote:
Originally Posted by ldolse View Post
You can post bugs with the non-working variants if they're consistently not unwrapping lines. Also note you might need to reduce the 'line unwrap factor' - the unwrap function looks at the median (or average, can't remember) line length, and only unwraps lines that exceed that length. That works well for a book with consistent breaks in roughly the same location for every line (OCR, pdf, many well formatted text files), but it will fail where the hard breaks are inconsistent/infrequent. Reducing the unwrap factor basically tells Calibre to look for shorter lines than the median. The fewer or more erratic the breaks the lower you need to go, sometimes all the way down to 0.05.
Sounds like some good advice, I'll have to give it another chance on those problem files. I'd still like to have a good text editor as a fallback if needed.

Thanks!

Ryan
scubaddictions is offline   Reply With Quote
Old 07-20-2011, 10:14 AM   #9
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by scubaddictions View Post
Yeah, I know that there are other text editors that do Regex, and I already have Notepad++. The problem is that the syntax of a Regex in one text editor doesn't work in Notepad++, or any of the other text editors I've tried. I'm almost certain that this particular Find/Replace function will work in Notepad++ if I can just get the syntax correct. Any help on that front would be appreciated.

Thanks!

Ryan
I use UltraEdit, so I can't help a lot, but it's my understanding that Notepad++ does not support multiline regex searches. You have to use the extended mode to get multiline searches. Try replacing the CR and LF with unique character strings. Then do what you need to do with regex and finally fix it up by replacing the unique character strings.
Starson17 is offline   Reply With Quote
Old 07-20-2011, 10:39 AM   #10
scubaddictions
Member
scubaddictions began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
Quote:
Originally Posted by Starson17 View Post
I use UltraEdit, so I can't help a lot, but it's my understanding that Notepad++ does not support multiline regex searches. You have to use the extended mode to get multiline searches. Try replacing the CR and LF with unique character strings. Then do what you need to do with regex and finally fix it up by replacing the unique character strings.
Hmmm, too late right now to poke around and try to learn up on that. What exactly is "multiline"? Is it just that you're looking at characters that all on two separate lines? If that's the case then I imagine you're right, Notepad++ might not work for me.

Regarding the replacement of CR and LF's...not sure how I can do that while leaving intact those line breaks that were intended by the author. The RegEx in my original post does the job I need and leaves the original line breaks intact. Brain turning mushy, time for sleep. Thanks!
scubaddictions is offline   Reply With Quote
Old 07-20-2011, 01:24 PM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by scubaddictions View Post
Hmmm, too late right now to poke around and try to learn up on that. What exactly is "multiline"?
It's searching for something that starts on one line and finishes on another or searching and replacing end of line characters, like CR and LF.

Quote:
Is it just that you're looking at characters that all on two separate lines? If that's the case then I imagine you're right, Notepad++ might not work for me.
I'm pretty sure it will work, but you may have to do it in more than one step.
Quote:
Regarding the replacement of CR and LF's...not sure how I can do that while leaving intact those line breaks that were intended by the author.
I'm not suggesting that you eliminate any line breaks you don't want. I'm suggesting you may want to try replacing the line breaks (which you can find with non-regex searches) with a unique character string. That avoids the multiline problem. Then you can do regex searches, then you can finish up by replacing the remaining unique strings with line breaks as needed. It's just one way to avoid having to do a regex multiline search.

I'm sure a user of Notepad++ will show up soon and give better advice on how to do this. In Ultraedit to fix text files, I often replace all double line breaks with the text string "parapara," then I replace all remaining single line breaks with a space, then go back and replace all the "parapara" strings with a single break. I was suggesting a variation of that approach for Notepad++.
Starson17 is offline   Reply With Quote
Old 07-20-2011, 06:58 PM   #12
scubaddictions
Member
scubaddictions began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
Quote:
Originally Posted by Starson17 View Post
...I often replace all double line breaks with the text string "parapara," then I replace all remaining single line breaks with a space, then go back and replace all the "parapara" strings with a single break. I was suggesting a variation of that approach for Notepad++.
Ok, I understand that part. The problem I'm having though doesn't deal with double line breaks, I'm not trying to remove blank lines or put paragraphs back together. The problem texts I'm dealing with only have single line breaks. Some of them are original and required, some are not from the original text and stuck into the middle of a sentence. For example:

This is not a broken sentence. This sentence, however
is broken in the middle. I'd like to fix it if at all possible.
"Do you want broken sentences?" Bill asked.
Jimmy replied "No, I do not".

This paragraph has only single line breaks, four of them. Three of them are as author intended, breaking up text onto different lines so it doesn't run together. One of them (after the word "however") is not what the author wanted, it was added during some later editing to fit the borders of of some other format.

The Find/Replace searches from my first post fix this by finding lines that either begin with or end with a lower case letter. Seems to work near perfectly. I can't figure out any other way to Find/Replace only the unintended single line breaks.

Ideas? Thanks!
scubaddictions is offline   Reply With Quote
Old 07-20-2011, 07:46 PM   #13
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,795
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by scubaddictions View Post
Ok, I understand that part. The problem I'm having though doesn't deal with double line breaks, I'm not trying to remove blank lines or put paragraphs back together. The problem texts I'm dealing with only have single line breaks. Some of them are original and required, some are not from the original text and stuck into the middle of a sentence. For example:

This is not a broken sentence. This sentence, however
is broken in the middle. I'd like to fix it if at all possible.
"Do you want broken sentences?" Bill asked.
Jimmy replied "No, I do not".

This paragraph has only single line breaks, four of them. Three of them are as author intended, breaking up text onto different lines so it doesn't run together. One of them (after the word "however") is not what the author wanted, it was added during some later editing to fit the borders of of some other format.

The Find/Replace searches from my first post fix this by finding lines that either begin with or end with a lower case letter. Seems to work near perfectly. I can't figure out any other way to Find/Replace only the unintended single line breaks.

Ideas? Thanks!
I fix these in a text Editor/Sigil (code view), with a number of passes to carefully get most

Case sensitive mode: set

replaces if line ends in a-z or comma and next starts with a-z (replace all fairly safe after testing )
([a-z,])</p>\s+<p class=.+>([a-z])

Replace: \1 \2

Next pass, I pick up a-z AND closing quotes
([a-z]\")</p>\s+<p class=.+>([a-z])
Replace: \1 \2
fairly safe Replace all

Now it gets iffy, I suggest Finding and selecting Replace Or Skip, rather than Replace All

we are going to repeat the above BUT with the next part beginning with a Capitol letter;
... he looked at</p>
<p>James and winked...

([a-z,])</p>\s+<p class=.+>([A-Z])

and now with quote (be sure to use the type straight or closing curly quote as used within your book)

([a-z]\")</p>\s+<p class=.+>([A-Z])


There may be a few odd ones that you will have to custom deal with by hand. Line ends with abbreviation/initial

Mr.</p>
<p>Jones
theducks is online now   Reply With Quote
Old 07-21-2011, 05:26 AM   #14
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
Quote:
Originally Posted by scubaddictions View Post
I'm trying to work within Calibre primarily but occasionally the formatting fails a bit and I try to drop back and fix the original text files manually. I'm working off some info from an older post on this forum and so this post is more about pre-Calibre editing.
I format my plain text originals using Markdown syntax. Calibre has a Markdown option in text conversion, (TXT Input - Paragraph style: off, Formatting style: markdown), which automatically takes care of paragraphs which span multiple lines; and no regex needed.

If you do want to force a hard-break in a paragraph then you simply add 2 trailing spaces to a line.

Last edited by Agama; 07-21-2011 at 07:39 AM.
Agama is offline   Reply With Quote
Old 07-21-2011, 08:36 AM   #15
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
Quote:
Originally Posted by scubaddictions View Post
Yeah, I know that there are other text editors that do Regex, and I already have Notepad++. The problem is that the syntax of a Regex in one text editor doesn't work in Notepad++, or any of the other text editors I've tried. I'm almost certain that this particular Find/Replace function will work in Notepad++ if I can just get the syntax correct. Any help on that front would be appreciated.
Hi Ryan,
If you want a Notepad++ equivalent to your UltraEdit solution, I believe the following will work. It's a multi-pass operation.

First, choose a short text string which doesn't occur elsewhere in your text file. I'm using ~~~ in my example.
  1. Regex mode, with Match case checked
    Code:
    Find: ([a-z])$
    Replace: \1~~~
  2. Extended mode
    Code:
    Find: ~~~\r\n
    Replace: singlespace 
  3. Regex mode, with Match case checked
    Code:
    Find: ^([a-z])
    Replace: ~~~\1
  4. Extended mode
    Code:
    Find: \r\n~~~
    Replace: singlespace 

I believe Notepad++ also has a better macro system these days, so perhaps the above 4 commands can be wrapped up as a single macro. I haven't tried this, though.

I hope this helps you save your money

Last edited by jackie_w; 07-21-2011 at 08:38 AM.
jackie_w is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with regular expression search/replace bfollowell Sigil 12 06-20-2013 07:36 PM
Regular Expression Help Azhad Calibre 86 09-27-2011 02:37 PM
Search & Replace - Regular expression oldbwl Calibre 2 01-09-2011 09:33 AM
Tool for removing line breaks in text documents kahn10 Sony Reader 9 08-22-2010 10:05 PM
Find/Replace with regular expression hydrolith Sigil 6 03-01-2010 08:42 PM


All times are GMT -4. The time now is 03:03 PM.


MobileRead.com is a privately owned, operated and funded community.