reformatting: text with unwanted linebreaks

tscamera · 12-20-2010, 05:20 PM

what i am trying to do:
merging two lines of code, where the first line is not ending with .!?" etcpp.
means, merging lines which are broken in the middle, means to merge to a complete sentence.

what i did:
The template line is this
little sentence
using a find/replace with regex:
[a-zA-Z0-9] -will find: s
but the: is the only one, i need to delete.

request 1:
how can ich truncate the search result?
please help with the complete regex-formula to find the "" within the primary search result "s"
(grouping, lookahead, lookbehind, atomic group...???)

request2:
if this would be done, how can i get access to the beginning of the second line-
wich is also needed to be deleted, to join both lines at one?
[a-zA-Z0-9] does'nt help.
searching for won't help either, because it's not segnificant enough.

request3:
so, does anybody know, if it's possible to search over two lines of sourcecode?

please help

kiwidude · 12-20-2010, 05:30 PM

Quote:

Originally Posted by tscamera

what i am trying to do:
merging two lines of code, where the first line is not ending with .!?" etcpp.
means, merging lines which are broken in the middle, means to merge to a complete sentence.

what i did:
The template line is this
little sentence
using a find/replace with regex:
[a-zA-Z0-9] -will find: s
but the: is the only one, i need to delete.

request 1:
how can ich truncate the search result?
please help with the complete regex-formula to find the "" within the primary search result "s"
(grouping, lookahead, lookbehind, atomic group...???)

request2:
if this would be done, how can i get access to the beginning of the second line-
wich is also needed to be deleted, to join both lines at one?
[a-zA-Z0-9] does'nt help.
searching for won't help either, because it's not segnificant enough.

request3:
so, does anybody know, if it's possible to search over two lines of sourcecode?

please help

I've done this a lot to "repair" the results of PDF conversions.

What you want to do is something like this:
Find: ([a-z])\s+
Replace: \1
In the replace expression, it is \1 followed by a single space.

That will find any sentences ending with a lowercase a-z and strip the paragraph end/beginning and replace with that same last character with an additional space. Putting the () brackets around the expression in the Find puts it into a group which you then access in the replace with \1

You might find in really bad PDF conversions that sometimes a word is split across the paragraph boundary. In which case you don't want the replace expression to have a space or else the word will have a space in it. What I do is manually step through all the matches rather than doing Replace All, and that way you can catch any exceptions.

You may also want to check for other characters like commas and hyphens in that initial ([a-z]). You can also check for paragraphs that start with a lowercase word using similar expressions:
Find: \s+([a-z])
Replace: \1 (a space followed by \1)

tscamera · 12-20-2010, 06:31 PM

i... i am totally fascinated!
after waiting for other replies of other (!) forums... this...
completely competent answer...
and so quick,
i'm stunned!
thanks a lot!

kiwidude · 12-20-2010, 07:45 PM

You are welcome. Regexes can make the otherwise mindless task of tidying up a book conversion more interesting. Ok, not that much, but a little bit

There is a big mental checklist of stuff I go through with every epub I cleanup (not all using regex exclusively of course) including...
- Stripping any "faked" indenting with   & replacing it with an indented justified style
- Ensuring all chapters are given a heading style
- Stripping out nested div tags and replacing divs with paragraphs
- Stripping out tags that are unnecessary when the paragraph css style is set correctly.
- Recombining paragraphs that contain broken sentences
- Replacing incorrect or inadequate quotes around speech. For instance I don't like speech that is 'Some quote' (or worse, an inconsistent combination of " ` ' etc from a bad OCR conversion) and prefer to see “Some quote”

There are still circumstances you won't catch without manually eyeballing but you can fairly quickly turn a very badly formatted document into one that is considerably more pleasant to read.

You mentioned multi-line paragraphs - hopefully you saw you can cope with those in Sigil with my example above by just using \s+ (one or more spaces). You don't have to worry thinking about "newline" characters like \r or \n in Sigil, just use \s+ between the ending/opening tags and that will allow your expression to be matched multi-line.

One final point which is mentioned on a few other threads. You should tick the "Minimal Matching" checkbox on the Find/Replace dialog that is enabled when you choose regular expressions. In fact I haven't needed to uncheck it since finding out it's purpose so pretty much set and forget. It is the only way for certain expressions to work. For instance say your document looks like this with some pointless span tag pairs to remove:
Blah blah text
More text

Find: (.*)
Replace: \1

This says Find *any* text within pairs of and tags and replace it with just the text, thereby removing the outer set of tags. This will only work "correctly" with "Minimal Matching" checkbox turned on.

ldolse · 12-21-2010, 12:41 PM

Quote:

Originally Posted by kiwidude

You are welcome. Regexes can make the otherwise mindless task of tidying up a book conversion more interesting. Ok, not that much, but a little bit

There is a big mental checklist of stuff I go through with every epub I cleanup (not all using regex exclusively of course) including...
- Stripping any "faked" indenting with   & replacing it with an indented justified style
- Ensuring all chapters are given a heading style
- Stripping out nested div tags and replacing divs with paragraphs
- Stripping out tags that are unnecessary when the paragraph css style is set correctly.
- Recombining paragraphs that contain broken sentences
- Replacing incorrect or inadequate quotes around speech. For instance I don't like speech that is 'Some quote' (or worse, an inconsistent combination of " ` ' etc from a bad OCR conversion) and prefer to see “Some quote”

There are still circumstances you won't catch without manually eyeballing but you can fairly quickly turn a very badly formatted document into one that is considerably more pleasant to read.

You mentioned multi-line paragraphs - hopefully you saw you can cope with those in Sigil with my example above by just using \s+ (one or more spaces). You don't have to worry thinking about "newline" characters like \r or \n in Sigil, just use \s+ between the ending/opening tags and that will allow your expression to be matched multi-line.

One final point which is mentioned on a few other threads. You should tick the "Minimal Matching" checkbox on the Find/Replace dialog that is enabled when you choose regular expressions. In fact I haven't needed to uncheck it since finding out it's purpose so pretty much set and forget. It is the only way for certain expressions to work. For instance say your document looks like this with some pointless span tag pairs to remove:
Blah blah text
More text

Find: (.*)
Replace: \1

This says Find *any* text within pairs of and tags and replace it with just the text, thereby removing the outer set of tags. This will only work "correctly" with "Minimal Matching" checkbox turned on.

Several of those functions are logic I've placed in Calibre's preprocess code so I didn't have to find/rewrite the regexes every time I convert a book. The ones that aren't there yet:

Converting ` ' to '' - haven't tried to come up with a safe function to fix that one yet.
I don't mess with tags too much though unless there isn't any content between them. Empty spans and other empty formatting tags get deleted.
I don't see a lot of use of divs except in LRF content, haven't done that one yet either.
Deleting a lot of microsoft junk is still on my to-do, that's partially done though.

kiwidude · 12-21-2010, 01:28 PM

Quote:

Originally Posted by ldolse

Several of those functions are logic I've placed in Calibre's preprocess code so I didn't have to find/rewrite the regexes every time I convert a book. The ones that aren't there yet:

Converting ` ' to '' - haven't tried to come up with a safe function to fix that one yet.
I don't mess with tags too much though unless there isn't any content between them. Empty spans and other empty formatting tags get deleted.
I don't see a lot of use of divs except in LRF content, haven't done that one yet either.
Deleting a lot of microsoft junk is still on my to-do, that's partially done though.

Interesting, could you share the things you do as "preprocess code" and how? While I use Calibre I haven't yet gone through the huge amount of options to figure out what ones might be most useful to me.

tscamera · 12-22-2010, 06:33 AM

hallo again.
as you know, i'm a regex novice.
i suppose all of you have struggled with this:

ProtoDionysos
should be
Proto-Dionysos

find:
i tried- ([a-z])+([A-Z])
but the result is- rotoD
replace: ??

maybe i need another hint?

kiwidude · 12-22-2010, 07:26 AM

Quote:

Originally Posted by tscamera

hallo again.
as you know, i'm a regex novice.
i suppose all of you have struggled with this:

ProtoDionysos
should be
Proto-Dionysos

find:
i tried- ([a-z])+([A-Z])
but the result is- rotoD
replace: ??

maybe i need another hint?

Try this:
Find: ([a-z]+)([A-Z])
Replace: \1-\2
Match case ticked.

Jellby · 12-22-2010, 07:47 AM

You actually don't need the +
Searching for a single lowercase letter (regardless of previous characters) followed by a single uppercase letter is enough.

kiwidude · 12-22-2010, 08:37 AM

Quote:

Originally Posted by Jellby

You actually don't need the +
Searching for a single lowercase letter (regardless of previous characters) followed by a single uppercase letter is enough.

Yeah, my initial response didn't have it and then I edited it straight away to include it. Why? (a) to minimise changes to hopefully help them understand the edits I made and (b) just in case the OP had some scenario which is why they put a + there in the first place. Like you I cannot think of a situation where it is needed but some books like fantasy have all sorts of weird and wonderful casing for character names etc so maybe it met a requirement.

EDIT: Nope, still can't think of a scenario for my reason (b) where ([a-z]+)([A-Z]) makes a difference compared to ([a-z])([A-Z]) not that it does any harm either. Should have left my post alone, thanks Jellby.

tscamera · 12-22-2010, 08:44 AM

thanks a lot!
learning by doing.

so, summary, for everyone it may concern:
find: ([a-z])([A-Z]) matching case on
replace: \1-\2

as slim version, will work.

ticked on minimal matching, mentioned by kiwidude,
doesn't matter, in this case.

BUT beware of replace all, in case there is a MacCool or other Macs in the text.
but i think there is a regex solution to exclude specs. i.e Mac

isn't it?

if there is, please let me know

Toxaris · 12-22-2010, 10:46 AM

Hmm, is there a place to store these handy-dandy regex procedures? Perhaps in the wiki or sticky?

Ahmad Samir · 12-22-2010, 11:07 AM

Quote:

Originally Posted by tscamera

thanks a lot!
learning by doing.

so, summary, for everyone it may concern:
find: ([a-z])([A-Z]) matching case on
replace: \1-\2

as slim version, will work.

ticked on minimal matching, mentioned by kiwidude,
doesn't matter, in this case.

BUT beware of replace all, in case there is a MacCool or other Macs in the text.
but i think there is a regex solution to exclude specs. i.e Mac

isn't it? :blink:

if there is, please let me know

FWIW, sometimes for some changes like these, you have to do them one by one (just to be sure the MacCool... etc won't get hurt in the process... :)).

ldolse · 12-22-2010, 11:30 AM

Quote:

Originally Posted by kiwidude

Interesting, could you share the things you do as "preprocess code" and how? While I use Calibre I haven't yet gone through the huge amount of options to figure out what ones might be most useful to me.

The option is under structure detection in Calibre (disabled by default), enable it before converting your source to epub. It doesn't work on epub documents primarily because the conversion pipeline assumes that epub should be a well formatted document already. If your source is epub just rename it from .epub to .zip and import it to Calibre as zipped html, which you can then convert to epub with the feature enabled.

The code is here if you wanted to review it - it's mostly regex based, so it's pretty easy to understand if you're familiar with regex. If you've got things which would work across a wide variety of docs that you would like to see added let me know.

There are a bunch of things it does:

Removes empty span and formatting tags
Checks to see if the doc is just a giant text file in <pre> tags and marks the individual lines up with tags.
Searches for faux indents using nbsp and replaces it with a 3% text indent style
de-hyphenates the source document to get rid of 99-100% of the hyphens that shouldn't be there, but retain the ones that should. (currently only on line breaks, I've been considering doing all hyphenated content in the doc)
Unwraps hard line breaks
Searches for numerous kinds of common chapter headings - wraps the heading in an <h2> tag, also searches for common titles following the headings and wraps them in <h3> tags. A lot of logic put into here to prevent false positives, though they can still creep in (still easier to fix a a couple false positives after the fact than to split and mark up the whole book by hand)
Centers common soft break markers if they're not centered
Deletes empty paragraphs in cases where there is an empty paragraph between every other paragraph, but only when a user also enables the 'remove paragraph spacing' option under look and feel (still need to tune this one to detect/retain soft-breaks)
Probably a couple other things I don't remember
The only really destructive thing it does is remove all non-breaking spaces - this needs to be done for the regexes to work correctly. non-breaking spaces in empty paragraphs which were used for spacing are replaced at the end of the process, but others are eliminated permanently. This isn't often an issue, but it might screw up a bit of formatting in one book out of fifty.

Once I do all that I find the work I need to do in Sigil is a lot less. I've been thinking about fixing the chapter markup routine to work a little bit better with Sigil as well, add in the 'not in sigil toc' id so that only the heading or the title gets used by Sigil instead of both.

kiwidude · 12-22-2010, 04:41 PM

Thanks very much Idolse for your detailed response. Certainly looks worthy of tinkering with, have never tried that setting.

I know I am at risk of going O/T with Calibre discussion in a Sigil forum here but this is all related to recommended ways of conversion to make the Sigil editing work easier...

One thing that I have found with Calibre is due to the way it stores the conversion metadata I have to be careful to "unselect" stuff when doing different conversions. i.e. I always want EPUB to be my "master copy" since it converts so easily to other formats. So the first conversion will be from something else to EPUB for tidying up in Sigil. After that I then need to convert to MOBI for use on my Kindle. However I found I need to make sure I deselect any Calibre conversion options before I do the EPUB->MOBI conversion or else some of my careful Sigil work gets undone.

Is this what you would expect or am I doing something wrong? Because of this I don't really set much in the way of "global defaults" for conversions since so many settings are common to all formats but you actually only want them to be applied to the first conversion. The "re-run" factor to other formats becomes an issue when you turn these things on. Maybe I just got unlucky or imagined it...

12-20-2010, 05:20 PM	#1
tscamera Enthusiast Posts: 30 Karma: 10 Join Date: Dec 2010 Device: PRS-650 ... ipad	reformatting: text with unwanted linebreaks what i am trying to do: merging two lines of code, where the first line is not ending with .!?" etcpp. means, merging lines which are broken in the middle, means to merge to a complete sentence. what i did: <p class="calibre2">The template line is this</p> <p class="calibre2">little sentence</p> using a find/replace with regex: [a-zA-Z0-9]</p> -will find: s</p> but the: </p> is the only one, i need to delete. request 1: how can ich truncate the search result? please help with the complete regex-formula to find the "</p>" within the primary search result "s</p>" (grouping, lookahead, lookbehind, atomic group...???) request2: if this would be done, how can i get access to the beginning of the second line-<p class="calibre2"> wich is also needed to be deleted, to join both lines at one? [a-zA-Z0-9]</p> <p class="calibre2"> does'nt help. searching for <p class="calibre2"> won't help either, because it's not segnificant enough. request3: so, does anybody know, if it's possible to search over two lines of sourcecode? please help

12-20-2010, 07:45 PM	#4
kiwidude Calibre Plugins Developer Posts: 4,730 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	You are welcome. Regexes can make the otherwise mindless task of tidying up a book conversion more interesting. Ok, not that much, but a little bit There is a big mental checklist of stuff I go through with every epub I cleanup (not all using regex exclusively of course) including... - Stripping any "faked" indenting with   & replacing it with an indented justified style - Ensuring all chapters are given a heading style - Stripping out nested div tags and replacing divs with paragraphs - Stripping out <span> tags that are unnecessary when the paragraph css style is set correctly. - Recombining paragraphs that contain broken sentences - Replacing incorrect or inadequate quotes around speech. For instance I don't like speech that is 'Some quote' (or worse, an inconsistent combination of " ` ' etc from a bad OCR conversion) and prefer to see “Some quote” There are still circumstances you won't catch without manually eyeballing but you can fairly quickly turn a very badly formatted document into one that is considerably more pleasant to read. You mentioned multi-line paragraphs - hopefully you saw you can cope with those in Sigil with my example above by just using \s+ (one or more spaces). You don't have to worry thinking about "newline" characters like \r or \n in Sigil, just use \s+ between the ending/opening tags and that will allow your expression to be matched multi-line. One final point which is mentioned on a few other threads. You should tick the "Minimal Matching" checkbox on the Find/Replace dialog that is enabled when you choose regular expressions. In fact I haven't needed to uncheck it since finding out it's purpose so pretty much set and forget. It is the only way for certain expressions to work. For instance say your document looks like this with some pointless span tag pairs to remove: <p class="calibre2"><span class="none">Blah blah text</span></p> <p class="calibre2"><span class="none">More text</span></p> Find: <span class="none">(.)</span> Replace: \1 This says Find any* text within pairs of <span class="none"> and </span> tags and replace it with just the text, thereby removing the outer set of tags. This will only work "correctly" with "Minimal Matching" checkbox turned on.

12-22-2010, 06:33 AM	#7
tscamera Enthusiast Posts: 30 Karma: 10 Join Date: Dec 2010 Device: PRS-650 ... ipad	II reformatting: words with missing hyphen hallo again. as you know, i'm a regex novice. i suppose all of you have struggled with this: ProtoDionysos should be Proto-Dionysos find: i tried- ([a-z])+([A-Z]) but the result is- rotoD replace: ?? maybe i need another hint?

12-22-2010, 08:44 AM	#11
tscamera Enthusiast Posts: 30 Karma: 10 Join Date: Dec 2010 Device: PRS-650 ... ipad	it helps thanks a lot! learning by doing. so, summary, for everyone it may concern: find: ([a-z])([A-Z]) matching case on replace: \1-\2 as slim version, will work. ticked on minimal matching, mentioned by kiwidude, doesn't matter, in this case. BUT beware of replace all, in case there is a MacCool or other Macs in the text. but i think there is a regex solution to exclude specs. i.e Mac isn't it? if there is, please let me know Last edited by tscamera; 12-22-2010 at 10:30 AM. Reason: i'm nosy

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Request: Adding linebreaks in sidebar window.	svenlind	Calibre	5	04-14-2010 03:46 AM
Chapters showing unwanted pagebreaks and < h1 > text	raltman	Calibre	2	10-05-2009 04:50 PM
PDF reformatting help.	Ham88	Workshop	1	05-14-2009 03:07 PM
Using Acrobat for reformatting to e-readers	snowgoose	PDF	8	02-04-2009 08:13 PM
Reformatting untidy text files macro	46137	Workshop	8	05-02-2008 09:27 PM

12-20-2010, 06:31 PM	#3
tscamera Enthusiast Posts: 30 Karma: 10 Join Date: Dec 2010 Device: PRS-650 ... ipad	i... i am totally fascinated! after waiting for other replies of other (!) forums... this... completely competent answer... and so quick, i'm stunned! thanks a lot!

12-22-2010, 07:47 AM	#9
Jellby frumious Bandersnatch Posts: 7,550 Karma: 19500001 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	You actually don't need the + Searching for a single lowercase letter (regardless of previous characters) followed by a single uppercase letter is enough.

12-22-2010, 10:46 AM	#12
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Hmm, is there a place to store these handy-dandy regex procedures? Perhaps in the wiki or sticky?

12-22-2010, 04:41 PM	#15
kiwidude Calibre Plugins Developer Posts: 4,730 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Thanks very much Idolse for your detailed response. Certainly looks worthy of tinkering with, have never tried that setting. I know I am at risk of going O/T with Calibre discussion in a Sigil forum here but this is all related to recommended ways of conversion to make the Sigil editing work easier... One thing that I have found with Calibre is due to the way it stores the conversion metadata I have to be careful to "unselect" stuff when doing different conversions. i.e. I always want EPUB to be my "master copy" since it converts so easily to other formats. So the first conversion will be from something else to EPUB for tidying up in Sigil. After that I then need to convert to MOBI for use on my Kindle. However I found I need to make sure I deselect any Calibre conversion options before I do the EPUB->MOBI conversion or else some of my careful Sigil work gets undone. Is this what you would expect or am I doing something wrong? Because of this I don't really set much in the way of "global defaults" for conversions since so many settings are common to all formats but you actually only want them to be applied to the first conversion. The "re-run" factor to other formats becomes an issue when you turn these things on. Maybe I just got unlucky or imagined it...