Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 07-01-2014, 12:38 PM   #1
BookJunkieLI
Addict
BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.
 
Posts: 261
Karma: 3136
Join Date: Jan 2010
Location: Long Island
Device: Nook, REB 1100, Ebookwise 1500, Jb-Lite, PRS-650, PB302, Asus T-90MT
Question Regex help please

I'm working on a project in which I'm harvesting story links from downloaded webpages to use with the FFDL plugin for Calibre to import the actual story into Calibre. The result I'm trying to get is:

Code:
/works/1064569
I'm working with Notepad++ and OpenOffice because they let me search with Regular Expressions *and* do Find All. Using the following search string in Notepad++:
Code:
/works/.+\d
this is the result:

Code:
	Line 72895: href="/works/1064569"</a><o:p></o:p></span></p>
	Line 72904: href="/works/1064569?show_comments=true&amp;view_full_work=true#comments">270</a><o:p></o:p></span></p>
	Line 72911: href="/works/1064569?view_full_work=true#comments">229</a><o:p></o:p></span></p>
	Line 72917: "Times New Roman";mso-ansi-language:EN'><a href="/works/1064569/bookmarks">21</a><o:p></o:p></span></p>
The problem with Notepad++ is it returns the entire line of code with the pertinent result highlighted. So I copy and paste the entire thing into OpenOffice. I don't start out in OpenOffice because it won't do Find All in HTML Source View. Anyway, I then run this code:
Code:
/works/.+[0-9]
with these results:

Code:
	/works/1064569
	/works/1064569?show_comments=true&amp;view_full_work=true#comments">270
	/works/1064569?view_full_work=true#comments">229
	/works/1064569/bookmarks">21
I originally tried using the \d instead of [0-9] only to find that if there wasn't a letter 'd' somewhere in the line OpenOffice didn't return a result. The problem I have is if there is a number in the code line past the portion that I want it returns everything from /works/ to that last number. I can't figure out how to limit it to the first string of numbers in the result. I've tried adding a '?' '( )' '{ }' and a number of other symbols that I didn't think to keep track of. The problem is that while I'm okay at frankensteining bits of regex strings that I've found together to do what I want, I don't actually understand what any of it truly does. I've read the tutorial in Calibre's manual, I've read through Penguinaka's thread on Regex: File Renaming Pre-Import, and a thread called Structure Detection which is where I got the barebones of the search string I'm using. I've also looked at the Regular Expression Syntax protion of the Python documentation. I just can't figure out how to limit it to what I'm looking for. I've been playing around with The Regex Coach program but it seems to only tell me that a string is wrong or right, not any suggestions on how to fix it.

Is there a way to get the results to end at the first string of numbers? I don't care if I get this as a result:

Code:
	/works/1064569
	/works/1064569
	/works/1064569
	/works/1064569
I can drop the whole list into Excel or OpenOffice Calc and filter for unique records. That takes like 10 seconds. I'm trying to avoid having to do a whole bunch of potentially damaging search and replace runs or going line by line and deleting the results I don't need/want.

Notepad++ and OpenOffice are latest version as of 6/30, OS is either Win 7 Pro or Win XP Pro depending on what computer I'm on at the moment. I tried jEdit and EditPad Lite but neither seem to let you do Find All searches.

Thanks
BookJunkieLI is offline   Reply With Quote
Old 07-01-2014, 02:15 PM   #2
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 9,119
Karma: 40942904
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
I think you may be a little misguided about what you can actually accomplish with text editors and their Search & Replace features (regex or otherwise).

You can find a pattern in a document (just like Notepad++ is dutifully doing for you), and step through them one at a time. Once found, you can replace that pattern with something else (sometimes using complicated captures from the original pattern-match). Then you can move on to the next occurrence of the pattern and replace IT with something (or skip it and move on to the next). Or you can replace all occurrences of the matched pattern in one fell swoop.

What you cannot do is extract all occurrences of a particular pattern--effectively getting rid of everything else. For that you would probably need to script a solution to extract the info (quite possibly using that scripting language's regex capabilities) that you want.

As far as your original regular expression to find the pieces you want: (judging by the limited amount of data I can see) I would think something like
Code:
/works/\d+
would do the trick.

I would change your second expression (the one you were using in OpenOffice) to:
Code:
/works/[0-9]+
(which for all practical purposes is the exact same expression as the first one: I don't know why the '\d' shortcut wouldn't work for you in OpenOffice--I don't really use it myself)

You were getting bit by regex's greediness in your expressions. Not sure how (or if) you can control Notepad++ or OpenOffice's greedy/non-greedy behavior. I tend to build expressions (wherever possible) that don't rely on manipulating the greedy/non-greedy behavior.

Last edited by DiapDealer; 07-01-2014 at 02:40 PM.
DiapDealer is offline   Reply With Quote
Old 07-01-2014, 03:07 PM   #3
BookJunkieLI
Addict
BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.BookJunkieLI could sell banana peel slippers to a Deveel.
 
Posts: 261
Karma: 3136
Join Date: Jan 2010
Location: Long Island
Device: Nook, REB 1100, Ebookwise 1500, Jb-Lite, PRS-650, PB302, Asus T-90MT
Thank you! That gave me exactly what I was looking for.

Re-reading what I wrote I realized that I wasn't specific in the whole process I use. Notepad++ drops the results of the search into a separate window that I can then copy so I only have the lines with the links I'm looking for, and not all the miscellaneous coding that's in the file. I paste those results into an OpenOffice document where I run the slightly modified search string using Find All and it highlights the results. With everything still highlighted I can then copy and paste just that data into another OpenOffice document. And, voila, I have the list of links that I was looking for. It would be awesome if OpenOffice had a Select Inverse like I use in Photoshop but this gets the job done.
BookJunkieLI is offline   Reply With Quote
Old 07-01-2014, 03:18 PM   #4
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 9,119
Karma: 40942904
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Glad to help.
DiapDealer is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help for a regex wobohohoho Sigil 4 01-02-2013 04:42 AM
Regex help paulfiera Sigil 4 06-14-2012 07:55 AM
RegEx Help ghostyjack Workshop 4 03-22-2012 09:24 AM
regex help please thevoiceofcheese Calibre 2 08-01-2011 11:27 PM
Help with a regex A.T.E. Calibre 1 04-05-2010 07:50 AM


All times are GMT -4. The time now is 04:29 AM.


MobileRead.com is a privately owned, operated and funded community.