Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > Miscellaneous > Archive > Sunrise

Notices

 
 
Thread Tools Search this Thread
Old 03-10-2006, 08:50 AM   #1
Yeuclid
Junior Member
Yeuclid began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Mar 2006
Device: Palm TX
Sunrise XP link filtering tutorial?

I've been using Sunrise XP for a few weeks now, originally on my M130, and now on my recently acquired TX. It's a superb program, and I've taken a look at the Showcase examples as I would like to use filtering to sift out some of the junk from several web pages that I frequently download.

However, I'm not sure where to start as there doesn't seem to be any documentation around on the filtering language.

If I've missed it could anyone point me in the right direction?

Thanks
Yeuclid is offline  
Old 03-10-2006, 12:58 PM   #2
DTM
Intentionally Left Blank
DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.
 
DTM's Avatar
 
Posts: 172
Karma: 300106
Join Date: Feb 2006
Location: Royal Oak, MI, USA
Device: Nook STR
I can give you an example that should help, and will include a couple of extra tricks as well.

Let's say you want to grab the columns by Chuck Colson from the Christianity Today website. You'll find them at:

http://www.christianitytoday.com/ctmag/features/columns/colson.html

Notice that this page, and each of the article pages, are loaded with ads, unwanted links, etc.

On the right, click the link for the printer version. When the printer-friendly box opens, right-click on it and select Properties from Internet Explorer or View Page Info from Firefox. You will find that the URL for the printer-friendly page is:

http://www.christianitytoday.com/global/printer.html?/ctmag/features/columns/colson.html

This is the URL to use in the URL/File field on the Main tab when you create the Sunrise XP document. You'll directly load the printer-friendly main page, eliminating the junk.

Now click on the link for a specfic column and you get something like this:

http://www.christianitytoday.com/ct/2006/002/19.144.html

The exact URL depends on the article you clicked. Again open the printer version, right-click and select Properties or View Page Info. The printer-friendly URL is:

http://www.christianitytoday.com/global/printer.html?/ct/2006/002/19.144.html

Now create your Sunrise XP document and create a link filter. Select "Regular Expression" for Match, "Filter all links" for Links, and "Rewrite links matching this pattern" for Filter.

Now, how do you turn the article link into the printable link?

Notice that they are identical up to ".com", then the printable link has some extra stuff (/global/printer.html?), then they end identically. If you check several articles, you'll see that the ending part is different for each article. You need to tell Sunrise to stick the extra text in ahead of the article-specific stuff no matter what it is. You start by specifying the part that is identical for all articles, then replace the rest with "(.*)", which essentially says, "match everything here no matter what it is". The result is:

http://www.christianitytoday.com(.*)

but the "." is a special Perl character, so you must put a backslash in front of it when you want it to be taken literally. Now you have:

http://www\.christianitytoday\.com(.*)

That's what goes in the Pattern field for the link filter. Not only will that match the link for any article, but the (.*) part will also grab all of the last part of the text and save it. Later, you can refer to it as "$1"

Now to rewrite the link, you want the part up to ".com", plus the extra stuff you need to insert, followed by the stuff saved as $1. You can write this as:

http://www.christianitytoday.com/global/printer.html?$1

Again, you must put the backslash ahead of all "." characters, and also the "?", because it too is a special character.

The result is:

http://www\.christianitytoday\.com/global/printer\.html\?$1

This is what you put in the Rewrite field.

In more complex cases, you may need to use more than one "(.*)". In such a case, when you do the rewrite, the first becomes $1, the second $2 and so on.

Hope this made some sense!

Dan
DTM is offline  
Advert
Old 03-10-2006, 02:19 PM   #3
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
Great post. Do note that, strictly speaking, the "." has to be escaped in regular expressions.

For example: http://www\.christianitytoday\.com(.*)

Since the dot means "any character" it will probably work as intended anyway.
Laurens is offline  
Old 03-24-2006, 08:38 PM   #4
efra
Junior Member
efra began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2006
Device: Palm TX
it is very interesting, yet a bit confusing what DTM is explaining. but my scenario is different. I am trying to create a plucker file with all the daily mass readings for the entire year. the URL that has the whole year is http://www.aciprensa.com/calendario/ then when you select one particular day the url changes to: http://www.aciprensa.com/calendario/...mes=3&ano=2006
notice that the date is at the end of the url (dia=25&mes=3&ano=2006) now from that page if I select the print version the url will change to http://www.aciprensa.com/utiles/myprint/print.php
but that is the same url for every day no matter the date!
I wonder if there is a way to ask sunrise XP to create a plucker document with the print version of the readings for the entire year?
if that can not be done, that's ok I will settle with instructions to get the calendar part of the website, what I am doing is providing the following url http://www.aciprensa.com/calendario/
with link depth 1 and restrict to directory. but what sunrise is doing is gettig the entire home page of the website down to one level. could you please help me out?
thanks. Efra
efra is offline  
Old 03-24-2006, 11:05 PM   #5
DTM
Intentionally Left Blank
DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.
 
DTM's Avatar
 
Posts: 172
Karma: 300106
Join Date: Feb 2006
Location: Royal Oak, MI, USA
Device: Nook STR
Interesting problem! The way they handle printing is like nothing I've seen before, and the normal type of procedure clearly won't work. I have come up with something that's awfully close, however.

For your Source URL, use

http://www.aciprensa.com/calendario/

Set link depth to 1 with no restriction.

Create a filter with the Pattern field set to:

http://www\.aciprensa\.com/calendario/calendario\.php(.*)

This will match all the calendar links and only the calendar links.

Select "Regular expression", "Filter all links" and "Include only links matching this pattern".

This does not get you the printer pages, but it does exclude all of the links from the left column, the ones that give you the whole website. Each calendar link works, although you do get a bunch of dead links at the top. On the whole, it's somewhat ugly, but it does get you everything in one document.

It's pretty huge. With pictures excluded, it comes out to just over a megabyte, but you'll only be updating once a year, so that's not too bad.
DTM is offline  
Advert
Old 03-25-2006, 03:27 AM   #6
efra
Junior Member
efra began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Mar 2006
Device: Palm TX
Dear DTM,

it worked like a charm, like you said the calendar is not pretty but I got my daily mass readings for the entire year!
thank you very much!!
Efra
efra is offline  
Old 03-25-2006, 08:22 AM   #7
DTM
Intentionally Left Blank
DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.DTM ought to be getting tired of karma fortunes by now.
 
DTM's Avatar
 
Posts: 172
Karma: 300106
Join Date: Feb 2006
Location: Royal Oak, MI, USA
Device: Nook STR
You're welcome! Glad I could help.

Just a clarification for those who are trying to understand the syntax of these things. In the above post I have SunriseXP match the pattern:

http://www\.aciprensa\.com/calendario/calendario\.php(.*)

This works, but strictly speaking I should have written:

http://www\.aciprensa\.com/calendario/calendario\.php.*

The first expression matches anything at the end and saves the (.*) part so it can be referred to as $1 in the rewrite expression. The second expression, without the parentheses, would also match anything at the end, but does not save the ending part. Since I'm not rewriting the link, the result is the same whether I save the ending or not so there is no need to specify that it should be saved. This is why you will see some expressions with (.*) and others with just .*
DTM is offline  
 


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Possible calibre filtering? Glenndk Calibre 0 07-31-2010 01:34 PM
MPAA Wants Congress to ‘Encourage’ 3 Strikes, Filtering Sonist Lounge 17 11-08-2009 08:18 AM
New Features on Inkmesh: Result Filtering by Device & More Subjects anurag News 6 10-07-2009 10:55 PM
0.6.4 and filtering by tags itimpi Calibre 3 08-01-2009 01:43 PM
iLiad First HelloWorld tutorial, a n00b primer First Hello World Tutorial, a n00b primer mind iRex Developer's Corner 13 09-19-2008 09:43 AM


All times are GMT -4. The time now is 11:45 PM.


MobileRead.com is a privately owned, operated and funded community.