![]() |
#1 |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
|
Find/Replace bogus line breaks in Text editor, w/Regular Expression
Hello all,
I'm trying to work within Calibre primarily but occasionally the formatting fails a bit and I try to drop back and fix the original text files manually. I'm working off some info from an older post on this forum and so this post is more about pre-Calibre editing. Here's the thing: a number of the original texts I'm dealing with have hard coded line break/carriage returns that cause broken-up sentences in the final product. I've gathered enough information to create a functioning solution but I'm using demo software that will expire in a month. In brief, I'm trying to figure out what freeware text editor will allow me to use the same solution I've worked out on the expensive software that I don't want to purchase. I'm not truly cheap, I'm just certain that I can do this task without needing one particular commercial software. From this forum post here: https://www.mobileread.com/forums/showthread.php?t=47044 I learned that it was possible to do a relatively simple Find/Replace function in a text editor to search for a line break followed directly by any lower-case letter of the alphabet as would usually happen if you a place a line break mid-sentence. I was successful using this technique in the recommended text editor (UltraEdit) but of course it costs money. I have a multitude of other free text editors and I believe I should be able to perform the same task in one of them just the same. I have to admit that I only partially understand the syntax of the search parameters so that makes it difficult to translate it directly to another application. First, what works: Open document in UltraEdit, pull up Replace window. Select Match Case and turn on Regular Expression, choose Perl as Expression Engine. Find What: \r\n([a-z]) Replace With: \1 <---There is a space before the One. (Space - Backslash - One) This grabs most instances. For various reasons (capital letters, punctuation) I found that running a second pass using the inverse manages to catch almost all of the other instances, like this: Find What: ([a-z])\r\n Replace with: \1 <---There is a space after the one. (Backslash - One - Space) So, this works like a charm but the Demo expiration on UltraEdit (ver. 17.10.0.1010) will leave me stranded. The same author of this information above recommended a different text editor in addition, TextPad, which I downloaded (ver. 5.4.2) In addition, I have access to NotePad++ (ver. 5.9.2) , Open Office (ver. 3.2.1), along with Window's Wordpad and Notepad. With the possible exception of Open Office and the built-in Windows stuff the rest are all recent downloads and should be the newest available. I've tried so many different versions of this syntax in the other text editors available to me, with no real success. It seems to be partially a problem with the different ways a text editor can view search perameters, as Normal Text, as Extended characters or as Regular Expression. Each has it's own version of a line break (^13 or ^p, \r\n, and $) and I'm reading websites that reference all of those and more. None of the other text editors accept the exact syntax as I've outlined above. It either erases characters that it shouldn't, pastes in characters that I don't want or just leaves the extra line breaks intact. I think I've hit a brick wall and need help from people more experienced that I, and here I am. Can anybody help me? Thanks! Ryan |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Have you tried enabling heuristics under the conversion options and enabling 'unwrap hard line breaks'? Most of that logic is hard-coded into the heuristic option.
Also, many text files have consistently formatted hard breaks with an empty line between paragraphs, indents, or some similar convention - for well formatted text files there are several different text input options to handle those formatting situations, and by default it does try to autodetect the formatting. |
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
|
Quote:
Any ideas on how to port this stuff over to another text editor? Sorry for the delay in reply, Internet out here is kinda flaky. ![]() Thanks! Ryan |
|
![]() |
![]() |
![]() |
#4 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,889
Karma: 59840450
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Notepad++ is a free one for Windows users. |
|
![]() |
![]() |
![]() |
#5 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
The only type of line breaks that aren't currently supported by the function are ones where the document abuses <br> tags - heuristics currently ignores those. There also a couple minor cases where it's conservative about unwrapping that some users have complained about, but I have a strong preference for false negatives vs. false positives. emeditor is another editor I liked with Regex support when I was in the windows world. On the Mac TextWrangler is the way to go. |
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
|
Quote:
Thanks! Ryan |
|
![]() |
![]() |
![]() |
#7 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,889
Karma: 59840450
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
![]() |
![]() |
![]() |
#8 | ||
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
|
Quote:
Quote:
Thanks! Ryan |
||
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
![]() |
![]() |
![]() |
#10 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
|
Quote:
Regarding the replacement of CR and LF's...not sure how I can do that while leaving intact those line breaks that were intended by the author. The RegEx in my original post does the job I need and leaves the original line breaks intact. Brain turning mushy, time for sleep. Thanks! |
|
![]() |
![]() |
![]() |
#11 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
Quote:
I'm sure a user of Notepad++ will show up soon and give better advice on how to do this. In Ultraedit to fix text files, I often replace all double line breaks with the text string "parapara," then I replace all remaining single line breaks with a space, then go back and replace all the "parapara" strings with a single break. I was suggesting a variation of that approach for Notepad++. |
|||
![]() |
![]() |
![]() |
#12 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2011
Device: Smartphone
|
Quote:
This is not a broken sentence. This sentence, however is broken in the middle. I'd like to fix it if at all possible. "Do you want broken sentences?" Bill asked. Jimmy replied "No, I do not". This paragraph has only single line breaks, four of them. Three of them are as author intended, breaking up text onto different lines so it doesn't run together. One of them (after the word "however") is not what the author wanted, it was added during some later editing to fit the borders of of some other format. The Find/Replace searches from my first post fix this by finding lines that either begin with or end with a lower case letter. Seems to work near perfectly. I can't figure out any other way to Find/Replace only the unintended single line breaks. Ideas? Thanks! |
|
![]() |
![]() |
![]() |
#13 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,889
Karma: 59840450
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
![]() Case sensitive mode: set replaces if line ends in a-z or comma and next starts with a-z (replace all fairly safe after testing ![]() ([a-z,])</p>\s+<p class=.+>([a-z]) Replace: \1 \2 Next pass, I pick up a-z AND closing quotes ([a-z]\")</p>\s+<p class=.+>([a-z]) Replace: \1 \2 fairly safe Replace all Now it gets iffy, I suggest Finding and selecting Replace Or Skip, rather than Replace All we are going to repeat the above BUT with the next part beginning with a Capitol letter; ... he looked at</p> <p>James and winked... ([a-z,])</p>\s+<p class=.+>([A-Z]) and now with quote (be sure to use the type straight or closing curly quote as used within your book) ([a-z]\")</p>\s+<p class=.+>([A-Z]) There may be a few odd ones that you will have to custom deal with by hand. Line ends with abbreviation/initial Mr.</p> <p>Jones |
|
![]() |
![]() |
![]() |
#14 | |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
|
Quote:
If you do want to force a hard-break in a paragraph then you simply add 2 trailing spaces to a line. Last edited by Agama; 07-21-2011 at 07:39 AM. |
|
![]() |
![]() |
![]() |
#15 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,247
Karma: 16539642
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Quote:
If you want a Notepad++ equivalent to your UltraEdit solution, I believe the following will work. It's a multi-pass operation. First, choose a short text string which doesn't occur elsewhere in your text file. I'm using ~~~ in my example.
I believe Notepad++ also has a better macro system these days, so perhaps the above 4 commands can be wrapped up as a single macro. I haven't tried this, though. I hope this helps you save your money ![]() Last edited by jackie_w; 07-21-2011 at 08:38 AM. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help with regular expression search/replace | bfollowell | Sigil | 12 | 06-20-2013 07:36 PM |
Regular Expression Help | Azhad | Calibre | 86 | 09-27-2011 02:37 PM |
Search & Replace - Regular expression | oldbwl | Calibre | 2 | 01-09-2011 09:33 AM |
Tool for removing line breaks in text documents | kahn10 | Sony Reader | 9 | 08-22-2010 10:05 PM |
Find/Replace with regular expression | hydrolith | Sigil | 6 | 03-01-2010 08:42 PM |