![]() |
#1 |
Enthusiast
![]() Posts: 30
Karma: 10
Join Date: Dec 2010
Device: PRS-650 ... ipad
|
![]()
what i am trying to do:
merging two lines of code, where the first line is not ending with .!?" etcpp. means, merging lines which are broken in the middle, means to merge to a complete sentence. what i did: <p class="calibre2">The template line is this</p> <p class="calibre2">little sentence</p> using a find/replace with regex: [a-zA-Z0-9]</p> -will find: s</p> but the: </p> is the only one, i need to delete. request 1: how can ich truncate the search result? please help with the complete regex-formula to find the "</p>" within the primary search result "s</p>" (grouping, lookahead, lookbehind, atomic group...???) request2: if this would be done, how can i get access to the beginning of the second line-<p class="calibre2"> wich is also needed to be deleted, to join both lines at one? [a-zA-Z0-9]</p> <p class="calibre2"> does'nt help. searching for <p class="calibre2"> won't help either, because it's not segnificant enough. request3: so, does anybody know, if it's possible to search over two lines of sourcecode? please help |
![]() |
![]() |
![]() |
#2 | |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,718
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
What you want to do is something like this: Find: ([a-z])</p>\s+<p class="calibre2"> Replace: \1 In the replace expression, it is \1 followed by a single space. That will find any sentences ending with a lowercase a-z and strip the paragraph end/beginning and replace with that same last character with an additional space. Putting the () brackets around the expression in the Find puts it into a group which you then access in the replace with \1 You might find in really bad PDF conversions that sometimes a word is split across the paragraph boundary. In which case you don't want the replace expression to have a space or else the word will have a space in it. What I do is manually step through all the matches rather than doing Replace All, and that way you can catch any exceptions. You may also want to check for other characters like commas and hyphens in that initial ([a-z]). You can also check for paragraphs that start with a lowercase word using similar expressions: Find: </p>\s+<p class="calibre2">([a-z]) Replace: \1 (a space followed by \1) Last edited by kiwidude; 12-20-2010 at 05:37 PM. |
|
![]() |
![]() |
![]() |
#3 |
Enthusiast
![]() Posts: 30
Karma: 10
Join Date: Dec 2010
Device: PRS-650 ... ipad
|
![]()
i... i am totally fascinated!
after waiting for other replies of other (!) forums... this... completely competent answer... and so quick, i'm stunned! thanks a lot! |
![]() |
![]() |
![]() |
#4 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,718
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
You are welcome. Regexes can make the otherwise mindless task of tidying up a book conversion more interesting. Ok, not that much, but a little bit
![]() There is a big mental checklist of stuff I go through with every epub I cleanup (not all using regex exclusively of course) including... - Stripping any "faked" indenting with & replacing it with an indented justified style - Ensuring all chapters are given a heading style - Stripping out nested div tags and replacing divs with paragraphs - Stripping out <span> tags that are unnecessary when the paragraph css style is set correctly. - Recombining paragraphs that contain broken sentences - Replacing incorrect or inadequate quotes around speech. For instance I don't like speech that is 'Some quote' (or worse, an inconsistent combination of " ` ' etc from a bad OCR conversion) and prefer to see “Some quote” There are still circumstances you won't catch without manually eyeballing but you can fairly quickly turn a very badly formatted document into one that is considerably more pleasant to read. You mentioned multi-line paragraphs - hopefully you saw you can cope with those in Sigil with my example above by just using \s+ (one or more spaces). You don't have to worry thinking about "newline" characters like \r or \n in Sigil, just use \s+ between the ending/opening tags and that will allow your expression to be matched multi-line. One final point which is mentioned on a few other threads. You should tick the "Minimal Matching" checkbox on the Find/Replace dialog that is enabled when you choose regular expressions. In fact I haven't needed to uncheck it since finding out it's purpose so pretty much set and forget. It is the only way for certain expressions to work. For instance say your document looks like this with some pointless span tag pairs to remove: <p class="calibre2"><span class="none">Blah blah text</span></p> <p class="calibre2"><span class="none">More text</span></p> Find: <span class="none">(.*)</span> Replace: \1 This says Find *any* text within pairs of <span class="none"> and </span> tags and replace it with just the text, thereby removing the outer set of tags. This will only work "correctly" with "Minimal Matching" checkbox turned on. |
![]() |
![]() |
![]() |
#5 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Last edited by ldolse; 12-21-2010 at 12:45 PM. |
|
![]() |
![]() |
![]() |
#6 | |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,718
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
|
|
![]() |
![]() |
![]() |
#7 |
Enthusiast
![]() Posts: 30
Karma: 10
Join Date: Dec 2010
Device: PRS-650 ... ipad
|
![]()
hallo again.
as you know, i'm a regex novice. i suppose all of you have struggled with this: ProtoDionysos should be Proto-Dionysos find: i tried- ([a-z])+([A-Z]) but the result is- rotoD replace: ?? maybe i need another hint? |
![]() |
![]() |
![]() |
#8 | |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,718
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
Find: ([a-z]+)([A-Z]) Replace: \1-\2 Match case ticked. |
|
![]() |
![]() |
![]() |
#9 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,543
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
You actually don't need the +
Searching for a single lowercase letter (regardless of previous characters) followed by a single uppercase letter is enough. |
![]() |
![]() |
![]() |
#10 | |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,718
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
EDIT: Nope, still can't think of a scenario for my reason (b) where ([a-z]+)([A-Z]) makes a difference compared to ([a-z])([A-Z]) not that it does any harm either. Should have left my post alone, thanks Jellby. Last edited by kiwidude; 12-22-2010 at 08:46 AM. |
|
![]() |
![]() |
![]() |
#11 |
Enthusiast
![]() Posts: 30
Karma: 10
Join Date: Dec 2010
Device: PRS-650 ... ipad
|
![]()
thanks a lot!
learning by doing. so, summary, for everyone it may concern: find: ([a-z])([A-Z]) matching case on replace: \1-\2 as slim version, will work. ticked on minimal matching, mentioned by kiwidude, doesn't matter, in this case. BUT beware of replace all, in case there is a MacCool or other Macs in the text. but i think there is a regex solution to exclude specs. i.e Mac isn't it? ![]() if there is, please let me know Last edited by tscamera; 12-22-2010 at 10:30 AM. Reason: i'm nosy |
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Hmm, is there a place to store these handy-dandy regex procedures? Perhaps in the wiki or sticky?
|
![]() |
![]() |
![]() |
#13 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 114
Karma: 5246
Join Date: Jul 2010
Device: none
|
Quote:
|
|
![]() |
![]() |
![]() |
#14 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
The code is here if you wanted to review it - it's mostly regex based, so it's pretty easy to understand if you're familiar with regex. If you've got things which would work across a wide variety of docs that you would like to see added let me know. There are a bunch of things it does:
Once I do all that I find the work I need to do in Sigil is a lot less. I've been thinking about fixing the chapter markup routine to work a little bit better with Sigil as well, add in the 'not in sigil toc' id so that only the heading or the title gets used by Sigil instead of both. Last edited by ldolse; 12-22-2010 at 11:40 AM. |
|
![]() |
![]() |
![]() |
#15 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,718
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Thanks very much Idolse for your detailed response. Certainly looks worthy of tinkering with, have never tried that setting.
I know I am at risk of going O/T with Calibre discussion in a Sigil forum here but this is all related to recommended ways of conversion to make the Sigil editing work easier... ![]() One thing that I have found with Calibre is due to the way it stores the conversion metadata I have to be careful to "unselect" stuff when doing different conversions. i.e. I always want EPUB to be my "master copy" since it converts so easily to other formats. So the first conversion will be from something else to EPUB for tidying up in Sigil. After that I then need to convert to MOBI for use on my Kindle. However I found I need to make sure I deselect any Calibre conversion options before I do the EPUB->MOBI conversion or else some of my careful Sigil work gets undone. Is this what you would expect or am I doing something wrong? Because of this I don't really set much in the way of "global defaults" for conversions since so many settings are common to all formats but you actually only want them to be applied to the first conversion. The "re-run" factor to other formats becomes an issue when you turn these things on. Maybe I just got unlucky or imagined it... |
![]() |
![]() |
![]() |
Tags |
find, html code, regex, replace, source view |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Request: Adding linebreaks in sidebar window. | svenlind | Calibre | 5 | 04-14-2010 03:46 AM |
Chapters showing unwanted pagebreaks and < h1 > text | raltman | Calibre | 2 | 10-05-2009 04:50 PM |
PDF reformatting help. | Ham88 | Workshop | 1 | 05-14-2009 03:07 PM |
Using Acrobat for reformatting to e-readers | snowgoose | 8 | 02-04-2009 08:13 PM | |
Reformatting untidy text files macro | 46137 | Workshop | 8 | 05-02-2008 09:27 PM |