Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Other formats > LRF

Notices

Reply
 
Thread Tools Search this Thread
Old 03-29-2007, 06:02 PM   #151
adinb
RSS & Gadget Addict!
adinb is on a distinguished road
 
adinb's Avatar
 
Posts: 82
Karma: 67
Join Date: May 2005
Location: Albuquerque, NM
Device: Sony PRS-500, iPod Touch, iPhone
BTW, has anyone else noticed particularly strange behavior when you get more than 128 documents on a memory card? I haven't tried it on a memory stick, b/c I'm using a 2GB SD card with my reader.

Web2Book is failing to auto-transfer files to my SD card whenever I hit the 128 file boundary (though it looks more like a bug in the sony driver to me)--I was wondering if anyone else was experiencing this before I post it to the general areas of the forums (possibly to add to the FAQ's) and before I report it to sony.
adinb is offline   Reply With Quote
Old 03-29-2007, 06:30 PM   #152
adinb
RSS & Gadget Addict!
adinb is on a distinguished road
 
adinb's Avatar
 
Posts: 82
Karma: 67
Join Date: May 2005
Location: Albuquerque, NM
Device: Sony PRS-500, iPod Touch, iPhone
Quote:
Originally Posted by shawn
Can someone please give me some advice on formatting a particular page?
This is the page:
http://www.econlib.org/library/Mises/msStoc.html
Actually, after taking a look at the page, you might want to break up the book so that you have a query per section--web2book doesn't currently allow content extraction patterns to apply to followed links (AFAIK, geekraver, please correct me if I'm wrong).

All the chapters that belong to a particular section are on the same page anyways--so setting a content extraction pattern for the TOC and following links to depth of 2 would result in a lot of duplicated content.

The prefaces/introduction are all on one page, each part/section is on one page, the conclusion is on one page, and all the appendices are on one page--so you'll end up with 11 entries, with a link depth of 1 and a fairly simple regex..this worked for part 1 and will probably work for the other chapters:
Code:
(<h2>.*<!--endofchap-->)
I tried publishing this for you to just subscribe to, but publishing doesn't seem to be working ATM.
adinb is offline   Reply With Quote
Advert
Old 03-30-2007, 04:48 PM   #153
shawn
Junior Member
shawn began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2007
adinb, thank you very much for your help, it formatted the page very nicely
shawn is offline   Reply With Quote
Old 03-31-2007, 05:10 AM   #154
adinb
RSS &amp;amp; Gadget Addict!
adinb is on a distinguished road
 
adinb's Avatar
 
Posts: 82
Karma: 67
Join Date: May 2005
Location: Albuquerque, NM
Device: Sony PRS-500, iPod Touch, iPhone
no worries. Just glad to be of some help.

And if anyone else needs some regexp assistance, feel free to PM/email me.

I've been working with geekraver and it appears that the bug I was seeing was really a much smaller bug with what's used in the {0} field when the supplied link extraction regexp doesn't match anything. (making it appear that the app is failing when the user's regex is really the problem)

I sent in a report to sony on the 128 file transfer bug, we'll see if that results in any connect software changes, but until then, be careful that you don't have more than 127 feeds that you're autoupdating on your memory car (or that you try to transfer more than 128 files at a time to a memory card in the connect reader software).

Publishing feeds is working great ATM, everyone seeing web2book should start seeing a *wide* variety of feeds to subscribe to.

If anyone has a particular site that they'd like me to work on getting into the directory, please feel free to pm/email me.

-adin
adinb is offline   Reply With Quote
Old 04-01-2007, 03:41 PM   #155
fritz_the_blank
Member
fritz_the_blank began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Jan 2007
Device: Sony PRS-500
If someone could help me with this please, I would be much obliged:

Link: http://feeds.newsweek.com/Newsweek/CoverStory
Link Element: guid
Extractor Pattern: http://www.msnbc.msn.com/id/(\d+)/site/newsweek/?from=rss
Link Reformatter: {0}&displaymode=1098

I always get 0 articles regardless of the value set for days.

Thank you very much,

FtB
fritz_the_blank is offline   Reply With Quote
Advert
Old 04-03-2007, 08:16 AM   #156
nmackay
Junior Member
nmackay began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2006
Help on web2book for the less Geeky?

Like many others I was surprised at how poor the Sony Connect software is for such a good unit, and delighted when I found web2book. I use it for several RSS feeds I watch. Now, I have used computers for probably longer than many of the contributers to this forum (As a Capetonian, Geekraver might like to know that at UCT in the early 70's I used to work with the Psychology Department main frame - & yes, the units were literally mounted on a frame), however, I do not have the knowledge to customize my feed/web information to pick out particular sub feeds, or threads (eg this Mobileread one here) or to manage one that needs a password. Is there any chance that that someone might write a basic set of instructions for those like me? I expect that there are others who want this but feel too awed by the high geek quotient of the forum contributers to ask.
nmackay is offline   Reply With Quote
Old 04-04-2007, 03:19 AM   #157
adinb
RSS &amp;amp; Gadget Addict!
adinb is on a distinguished road
 
adinb's Avatar
 
Posts: 82
Karma: 67
Join Date: May 2005
Location: Albuquerque, NM
Device: Sony PRS-500, iPod Touch, iPhone
Quote:
Originally Posted by fritz_the_blank
If someone could help me with this please, I would be much obliged:

Link: http://feeds.newsweek.com/Newsweek/CoverStory
Are you wanting *just* the week's coverstory? Here's what I came up with for this entry (please pardon any typos since parallels clipboard isn't wanting to work tonight...but I did publish this particular feed for a known working version):

Code:
Link: http://feeds.newsweek.com/CoverStory 
Link Element: origLink
Link Extractor Pattern: id/(\d+)/site
Link Reformatter: http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/
Content Extraction Pattern: (<div class="caption">.*)
The process I go through to get all this stuff: (may break this into a few messages)

- I enter the rss feed link (I try to get RSS 2.0 links, since some atom date formats aren't complete supported by web2book.) I set the days to "0" and I select test. If the full content of the articles is in the feed and everything is good, you don't have to do anything other than select the number of days you want, name the entry, and select the "enabled" box. If you are just getting a small snippet and want additional content, you need to fill in the "Link Element" so that web2book knows what link to follow.

- Since you have to find the right link for web2book to follow, view the source of the feed. I do this by typing in the URL of the feed into firefox, right clicking on the loaded page and select "view source". I then look for which tag in the page source holds the "real" link to the story, (not a link that goes through feedburner or some in between website.). In this case the source was really funky and tough to read, but the origLink tag had the real link....and presto, that's the "Link Element".

-The next step is to run the test again. The output will probably be weird, but if you have the correct link element, the log should show web2book following the link and then converting the raw html it got into pdf.

-Assuming that web2book grabbed the article page that you wanted, you just have to figure out the "content extraction pattern" that will pull out the content without all the ads. Finding the correct regular expression is a bit of an art. I would recommend using the regular expression helper in web2book in the tools menu to test/experiment to find the right content expression pattern. Copy the page source of the of the page that web2book grabbed the html from in the earlier steps and copy it into the input field. Type your Regular Expression into the RegExp field and click test. The "Group" field will be the html that would be sent on to be turned into PDF. A good guide that I refer to for building regular expressions is http://www.regular-expressions.info/tutorial.html . This is *definitely* an art form, and you might want to search the net for other, more complete tools to assist in building regular expressions. I know that I put in about a full week's worth of time to spin myself back up on complex regex's.

***Tip: Test your regular expressions before even trying them in web2book. web2book just takes the regular expressions and applies them to the html, so even if you *think* you have it right (which I did many, many times when I didn't have it right) you probably are missing a backslash or a parenthesis somewhere.

***Tip: If web2book doesn't actually generate a pdf during a test, take a look at the log. If the extracted link, and link reformatter both look good, then there is an error in your "content extraction pattern" regular expression. If you don't see a correct extracted or reformatted link, then there is an error in your "link element", "link extractor pattern" regular expression, or in your "link reformatter".

-If there is a "print me" link on the page and you want to use that page as your content source instead of the page at the destination of the "link element", then things get a little more complicated. You will have to find whether you can jump to the print page by grabbing the article ID from the "link element" URL or if you have to look on the destination of the "link element" for the URL of the print page. In this example we can grab the article ID directly out of the link element URL using another regular expression ("id/(\d+)/site") and pasting it into the middle of a fairly static URL for printing ("http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/").

If Newsweek didn't want to be nice and be complicated like Time, you would tick the box "Apply extractor to linked content instead of link text" and you would have to write *another* regular expression to be applied to the *content* of the *destination* of the "Link Element" to find the link to the printable version of the page. Take a look at the published Time feeds for a good example of having to go all the way down the rabbit hole to get to the printable versions of the page.

Some sites just plain won't let an automated "scraper" program like web2book to grab the printable versions of their page. They may "lie" and tell you they're going to the printable version of the page and not actually go there. It's tough to debug and will require a bit of intuition.

-Once you get the URL for the printable page, you need to still do the "Content Extraction Pattern" to be applied to the printable page; make sure that you exclude the "<title>" tag, or else you will have a funky title in the finished PDF.


So, that's it for the moment, time for bed tonight, but hopefully this helps a little in getting a good page. I've published a lot of examples, so subscribe to a few feeds using the File|Subscribe command and take a look.

Good luck, and good hunting!
adinb is offline   Reply With Quote
Old 04-04-2007, 02:17 PM   #158
fritz_the_blank
Member
fritz_the_blank began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Jan 2007
Device: Sony PRS-500
Thank you for your detailed response.

Thank you also to GeekRaver for his/her work on this project.

As it turns out, I had the wrong URL for the cover story. The correct URL should be http://feeds.newsweek.com/CoverStory and now things are working. However, I get following error when testing:


See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.ComponentModel.Win32Exception: No application is associated with the specified file for this operation
at System.Diagnostics.Process.StartWithShellExecuteEx (ProcessStartInfo startInfo)
at System.Diagnostics.Process.Start()
at web2book.Utils.RunExternalCommand(String cmd, String args, String workdir, Boolean useShell, Int32 timeout, String& output)
at web2book.MainForm.Test(ContentSourceList sourceClass, ContentSource source)
at web2book.MainForm.testButton_Click(Object sender, EventArgs e)
at System.Windows.Forms.Control.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnMouseUp(MouseEventAr gs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.O nMessage(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.W ndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)


************** Loaded Assemblies **************
mscorlib
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll
----------------------------------------
Web2Book
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/Web2Book.exe
----------------------------------------
Utils
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/Utils.DLL
----------------------------------------
System.Windows.Forms
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Drawing
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------
IHtmlConverter
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/IHtmlConverter.DLL
----------------------------------------
ISyncDevice
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ISyncDevice.DLL
----------------------------------------
ISource
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ISource.DLL
----------------------------------------
System.Configuration
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Configuration/2.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll
----------------------------------------
System.Xml
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll
----------------------------------------
ReadLit
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadLit.dll
----------------------------------------
ReadWeb
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadWeb.dll
----------------------------------------
ReadXWord
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadXWord.dll
----------------------------------------
ReadWeb
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadWeb.DLL
----------------------------------------
Accessibility
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/Accessibility/2.0.0.0__b03f5f7f11d50a3a/Accessibility.dll
----------------------------------------
writeHtmlDoc
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writeHtmlDoc.dll
----------------------------------------
WriteLRF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/WriteLRF.dll
----------------------------------------
writePDF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writePDF.dll
----------------------------------------
ITextSharpConverter
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ITextSharpConverter.DLL
----------------------------------------
itextsharp
Assembly Version: 3.1.8.0
Win32 Version: 3.1.8.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/itextsharp.DLL
----------------------------------------
writeRTF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writeRTF.dll
----------------------------------------
SyncPRS500
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/SyncPRS500.dll
----------------------------------------

************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.

For example:

<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>

When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.
fritz_the_blank is offline   Reply With Quote
Old 04-04-2007, 02:29 PM   #159
fritz_the_blank
Member
fritz_the_blank began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Jan 2007
Device: Sony PRS-500
As an addendum to my last post, I am using slightly different settings than the ones that you posted for me. For comparison:

Mine:

LE: guid
LEP: http://www.msnbc.msn.com/id/(\d+)/site/newsweek/?from=rss
LR: {0}&displaymode=1098


Yours:

LE: origLink
LEP: id/(\d+)/site
LR: http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/
CEP: (<div class="caption">.*)

Testing mine returns one article, yours returns none at the moment (most likely, I am doing something wrong).

Thank you once again for all of your help.
fritz_the_blank is offline   Reply With Quote
Old 04-04-2007, 05:25 PM   #160
adinb
RSS &amp;amp; Gadget Addict!
adinb is on a distinguished road
 
adinb's Avatar
 
Posts: 82
Karma: 67
Join Date: May 2005
Location: Albuquerque, NM
Device: Sony PRS-500, iPod Touch, iPhone
There shouldn't be a problem with using the guid, in this case the guid and origLink are the same, though the guid has the "permalink=false" attribute, which usually doesn't matter, but I try to not use the guid when it uses that directive. But, it comes down to personal taste, tomahtoe, tomaytoe.

Your LEP regular expression should put only the ID itself into field {0}; so your LR should probably include the link upto the ID, if its filling the entire html link into field {0}, the regex engine is being nice to ya.

My regex only grabs what's directly around the digits just because I try leave as much room as possible for site changes--if the link changes at all your regex won't match, mine isn't much more flexible, but either works--it's more a matter of taste.

Your LR leaves a bunch of gunk at the bottom of the entry ("More from Newsweek Health"), so just make sure to adjust your CEP regular expression to account for the extra gunk. I left mine open-ended so that it'd be a little more flexible in case the source html changed at all--but there's nothing wrong with putting something solid on the trailing part of your CEP regular expression.
You do need to include a CEP, when there's a <title> tag in the html that's sent to htmldoc, it'll make a title (even though the cmdline specifies "no title"). The title overrides the filename in the PRS-500's display, so in your "book" listing, it'll show as the contents of the title tag instead of "rss-Newsweek Cover".


There's only one article in the feed at a time, so one article is valid, though I'm going to attribute any errors in my message to it being late--the entry that I published last night to Geekraver's server should be correct. I'm testing a date format fix ATM, so my copy may be parsing dates that V23 isn't.
adinb is offline   Reply With Quote
Old 04-04-2007, 08:58 PM   #161
fritz_the_blank
Member
fritz_the_blank began at the beginning.
 
Posts: 20
Karma: 10
Join Date: Jan 2007
Device: Sony PRS-500
I just tried your setting again and it found an article, and the output from yours is soooo much cleaner!

I should be able to apply that setting to the remainder of the Newsweek feeds.

Thank you once again for your help.

FtB
fritz_the_blank is offline   Reply With Quote
Old 04-05-2007, 05:27 PM   #162
geekraver
Addict
geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.geekraver ought to be getting tired of karma fortunes by now.
 
Posts: 364
Karma: 1035291
Join Date: Jul 2006
Location: Redmond, WA
Device: iPad Mini,Kindle Paperwhite
Quote:
Originally Posted by nmackay
Like many others I was surprised at how poor the Sony Connect software is for such a good unit, and delighted when I found web2book. I use it for several RSS feeds I watch. Now, I have used computers for probably longer than many of the contributers to this forum (As a Capetonian, Geekraver might like to know that at UCT in the early 70's I used to work with the Psychology Department main frame - & yes, the units were literally mounted on a frame), however, I do not have the knowledge to customize my feed/web information to pick out particular sub feeds, or threads (eg this Mobileread one here) or to manage one that needs a password. Is there any chance that that someone might write a basic set of instructions for those like me? I expect that there are others who want this but feel too awed by the high geek quotient of the forum contributers to ask.
He he - well, I do remember the old Sperry 1100 well, writing Fortran progs in punched cards.

By this stage Adin is probably more of an expert than I am. He gave a pretty detailed description of his approach (which I haven't yet read in detail). I'll add mine as it may be slightly different and have some value.

1. First you need to get the URL for the RSS feed of the site you care about. Enter it into your browser and look at the results. If they have the content you want, then all you really need to do is add the URL to web2book; you shouldn't even need to bother with the settings under 'Customize'

2. Assuming they don't have the content you want (e.g. they have an excerpt and end with "Read More" or something like that), then you will need to customize them. Typically I will at this point do two things:

i) right click in the browser and select 'View Source", and look at the RSS XML, to make sure that the permalink or other link has an XML tag that web2book expects; you can see which one web2book expects by going to Customize and clicking on Help. If this feed for some reason has an unusual XML element tag, then you'll need to enter its name in the Link Element field

ii) in the original page in the browser, click on the title link of the first story to have the browser load up the referenced page. We now want to deal with this page, which we'll do in step 3.

3. If the page has a "Printable version" or "Print" link at the top or bottom, we probably want to use that version of the page, as it will have less fluff like ads that needs to be stripped out (if there is no such link go to step 4). So we have to figure out how to get at the link for that. I'll typically hover over the "Print" or "Printable Version" button/link, and see in the status bar of the browser what the URL is for that version. We want to either munge the original article link into this new print one (which we might be able to do just with the link extraction patter and link reformatter), or we may have to suck the link out of the page we are now viewing (which requires checking the checkbox which says "Apply extractor to linked content instead of link text). In the latter case we have to look at the web page source and find the part that has the HREF for the printable version and figure out a regexp pattern to get at that. Regexp patterns and reformatting are a whole separate topic that I will discuss later. Once the link extractor and link reformatter are done, we should have an URL that refers to the low-fluff version of the content. Load up that content in your browser.

4. Now we want to remove ads, etc, from the page. You have to 'View Source' in your browser, and look for the start and end of the content you care about. Then comes the tricky part, which is trying to find some unique delimiters that bracket this content. Once you've found these (and sometimes it isn't possible) you can create a content extraction pattern, and perhaps a content reformatter (if necessary) for getting the content out. A content reformatter is usually just useful if you need to rebalance some HTML tags in the extracted content, or in cases where the content extraction pattern is complex and extracts the content in multiple pieces ("groups") that must be reassembled.

The regular expression helper in the tools menu is very useful for testing your regular expressions. You can do a "View Source" in your browser and paste the full HTML content of a page in the Input box, and enter your regular expression in the RegExp box, and click the Test button, to see what parts of a page your pattern will extract. You must use grouping (which is done with parentheses) to specify the content you want to keep, and if you use more than one group you will need a reformatter to specify how that groups get put back into a single piece of text. When learning to use the regexps also pay attention to "greedy" (match as much text as possible) versus non-greedy (match as little text as possible) matching, as sometimes you need one style and sometimes the other.

If you're lucky you might find DIV html tags with "class" attributes that bracket the content you want. This is fairly common. Comment blocks are also commonly used to identify the article content start and end. An excellent way to master this stuff is to look at the existing published feeds, and work through them yourself, trying to understand how the existing settings make them tick. Do test them though, as websites change and some of the published entries may break, and you might go nuts trying to understand how something works when in fact it doesn't work any more!

Last edited by geekraver; 04-05-2007 at 05:42 PM.
geekraver is offline   Reply With Quote
Old 04-05-2007, 09:13 PM   #163
nmackay
Junior Member
nmackay began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2006
Thank you, Geekraver and AdinB, for those details and your work. I am already helped by your replies. I hope others are too. & I will spend some of the Easter weekend finding how to extract more material from various pages.
NM
nmackay is offline   Reply With Quote
Old 04-06-2007, 03:13 AM   #164
adinb
RSS &amp;amp; Gadget Addict!
adinb is on a distinguished road
 
adinb's Avatar
 
Posts: 82
Karma: 67
Join Date: May 2005
Location: Albuquerque, NM
Device: Sony PRS-500, iPod Touch, iPhone
Cool

Well, I doubt that I'm all *that* much of an "expert" (I shudder at the word ), but I'm glad that I've been able to help out.

And Geekraver has done an excellent job with web2book and his posting on his approach. *excellent* tips on what to look for. I know that it took me awhile to realise that the "view source chart" firefox extension was changing quote types in spans and divs) -- I still highly recommend the extension to quickly make sense of the page source (at https://addons.mozilla.org/en-US/firefox/addon/655 ), just know that if your regex isn't working, check the raw page source.

And if there's anything I can do for anyone, (help debug regex's, point to tutorials and tools for regex's on windows or osx) feel free to PM or email me.

-adin
adinb is offline   Reply With Quote
Old 04-06-2007, 07:13 PM   #165
InspectorGadget
Enthusiast
InspectorGadget is on a distinguished road
 
InspectorGadget's Avatar
 
Posts: 47
Karma: 70
Join Date: Apr 2007
Device: Sony PRS-500
If you'll pardon the remedial question, I can't even get the "Subscribe" function to work. When I click on "File | Subscribe", it just freezes up for 20 seconds and then comes up with an empty list in a "Subscribe to Feed" window (if I'm on the "Feed" tab). It does the same with any of the tabs in the main window. It's behaved this way consistently over the last few days, at home and at work.

I downloaded Web2Book from GeekRaver's original post on this thread, but everyone else is saying, "rss2book". Do I have the correct program??

I downloaded the accessory DLLs but turned out I already had them. I installed the non-beta .NET framework 2.0 fresh. I also downloaded and installed HtmlDoc but I haven't gotten that far yet.

Any ideas to get me going?
InspectorGadget is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
rss2book release 20 now available geekraver Sony Reader 4 01-26-2007 01:36 PM
rss2book release 19 geekraver Sony Reader 2 12-30-2006 10:51 AM
rss2book release 18 geekraver Sony Reader 0 12-22-2006 03:57 AM
rss2book release 16 geekraver Sony Reader 1 12-13-2006 05:56 AM
rss2book release 13 geekraver Sony Reader 0 11-13-2006 02:41 AM


All times are GMT -4. The time now is 02:13 AM.


MobileRead.com is a privately owned, operated and funded community.