Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 12-25-2010, 05:31 PM   #1
Danger
Evangelist
Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.
 
Danger's Avatar
 
Posts: 490
Karma: 1665031
Join Date: Nov 2010
Location: Vancouver Island, Nanaimo
Device: K2 (retired), Kobo Touch (passed to the wife), KGlo, Galaxy TabPro
Find this NOT that

I'm trying to do a search but a narrow one. Basically I've converted some PDFs to ePub but some paragraphs are broken up say one ends with half a sentence and the other paragraph continues on with the sentence.

I want to do search for any 2 characters and </p> but don't find ."</p>, .</p>, ?</p>, ."</p>, ?"</p>, !</p>, !"</p> as those should be proper sentence enders.

Right now I have [^.,^\?,^\!][a-z,A-Z,”,\,, ,+]</p> and it seems to work but is there a simpler way of doing this?
Danger is offline   Reply With Quote
Old 12-25-2010, 05:55 PM   #2
huebi
Zealot
huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!
 
Posts: 121
Karma: 5070
Join Date: Dec 2010
Device: none
I#m doing this with

Code:
s: ([a-zA-Z])</p>\s*<p>
r: \1##
o: minimal matching
The reason for the two #: its more easy to find out if a word is splitted or the sentence is splitted between two words.

Now i'm looking for a lower case letter, followed by ##, followed by an uppercase letter. This is for sure a sign for two seperate words.

Code:
s: ([a-z])##([A-Z])
r: \1_\2
o: minimal matching, match case

(the underscore represents a blank)
Now i'm going thru looking for two separate words:

Code:
s: ([a-zA-Z])##([a-zA-Z])
r: \1_\2
o: minimal matching, match
If the ## is between two words, i press alt-r (replace), otherwise its a splitted word and i'm skipping this with alt-f (find next)

At the end only the ## splitting a word are remaining and i'm substituting them with

Code:
s: ([a-zA-Z])##([a-zA-Z])
r: \1\2
o: minimal matching
Last step is searching for a comma, question mark or exlamation mark:

Code:
s: ([,?!])</p>\s*<p>
r: \1_
o: minimal matching
More steps then yours, but i'm replacing the text without changing it manually which can be really annoying.
huebi is offline   Reply With Quote
Advert
Old 12-26-2010, 04:42 PM   #3
mshellberg
Junior Member
mshellberg began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jun 2010
Device: none
When I'm joining split sentences like that, I search for a lowercase letter after the <p> tags...

search: </p>\s*<p>([a-z])
replace: _\1
(Note the space before \1.)

Hope that helps.
mshellberg is offline   Reply With Quote
Old 12-27-2010, 02:43 AM   #4
huebi
Zealot
huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!huebi , Klaatu Barada Niktu!
 
Posts: 121
Karma: 5070
Join Date: Dec 2010
Device: none
Code:
<p>I had splitted sen</p>

</P>tences like this one</p>
I need to investigate why the - ( may be soft hyphen) disappeared...
huebi is offline   Reply With Quote
Old 12-27-2010, 11:57 AM   #5
Danger
Evangelist
Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.
 
Danger's Avatar
 
Posts: 490
Karma: 1665031
Join Date: Nov 2010
Location: Vancouver Island, Nanaimo
Device: K2 (retired), Kobo Touch (passed to the wife), KGlo, Galaxy TabPro
Quote:
Originally Posted by mshellberg View Post
When I'm joining split sentences like that, I search for a lowercase letter after the <p> tags...

search: </p>\s*<p>([a-z])
replace: _\1
(Note the space before \1.)

Hope that helps.
It helps but I think I have to do like huebi suggested and do several sweeps. looking for different things each time because I found not all sentence splits end with a lowercase letter. Some ended with quotes but no period, some with a comma, some with a question/exclamation mark but no quotes because the person speaking was still speaking but it was continued in another paragraph.

Basically I was trying to catch that all in one go. Looking at other examples I think I have too many commas to separate stuff that don't need it.

[^.^\?^\!][a-zA-Z”\,\?\!+]</p>

Should find: ?</p> but not ?”</p> or z”</p> but not z.”</p>

Just tested this with the following BOLD found ITALICS skipped:
<p>x?</p>
<p>x.”</p>
<p>x?</p>
<p>X!</p>
<p>x!”</p>
<p>x,</p>
<p>x,”</p>
<p>x”</p>

EDIT: The above can be simplified further: [^.?!][a-zA-Z”,?!]</p>
So basically we now have: if any of those 3 characters [^.?!] appear before any of these characters [a-zA-Z”,?!] (specifically the”) & </p> then skip that find.

Last edited by Danger; 12-27-2010 at 03:57 PM.
Danger is offline   Reply With Quote
Advert
Old 12-27-2010, 03:13 PM   #6
Danger
Evangelist
Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.Danger ought to be getting tired of karma fortunes by now.
 
Danger's Avatar
 
Posts: 490
Karma: 1665031
Join Date: Nov 2010
Location: Vancouver Island, Nanaimo
Device: K2 (retired), Kobo Touch (passed to the wife), KGlo, Galaxy TabPro
Working on a book this morning and I found after stripping out all the class, style and useless spans/divs I was left once again with broken up sentences like this:
<p>this is</p>
<p>part of a</p>
<p>paragraph.</p>

<p>&nbsp;</p>

So I came up with:
FIND: ([a-z,’”.?!-])</p>\n\n\s\s<p>([a-z,A-Z“-])
REPLACE: \1 \2

\n = new line
\s = white space

All the <p>&nbsp;</p> are ignored and then I just strip them out when all the paragraphs are back together.

Interesting info here
Danger is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
I know what I want but I cant find it! jakedubbleya Which one should I buy? 1 04-07-2009 03:51 PM
Anyone know where I can find..... sarahw2275 Sony Reader 2 10-06-2008 08:56 AM
Can someone help me find... Nate the great Reading Recommendations 2 07-08-2007 09:30 PM
How to find BD 5? Patricia Sony Reader 23 05-18-2007 08:56 AM
Cannot find something you are looking for...? TadW Lounge 1 07-06-2003 10:48 AM


All times are GMT -4. The time now is 09:39 PM.


MobileRead.com is a privately owned, operated and funded community.