Rss2Book - Page 11

adinb · 03-29-2007, 06:02 PM

BTW, has anyone else noticed particularly strange behavior when you get more than 128 documents on a memory card? I haven't tried it on a memory stick, b/c I'm using a 2GB SD card with my reader.

Web2Book is failing to auto-transfer files to my SD card whenever I hit the 128 file boundary (though it looks more like a bug in the sony driver to me)--I was wondering if anyone else was experiencing this before I post it to the general areas of the forums (possibly to add to the FAQ's) and before I report it to sony.

adinb · 03-29-2007, 06:30 PM

Quote:

Originally Posted by shawn

Can someone please give me some advice on formatting a particular page?
This is the page:
http://www.econlib.org/library/Mises/msStoc.html

Actually, after taking a look at the page, you might want to break up the book so that you have a query per section--web2book doesn't currently allow content extraction patterns to apply to followed links (AFAIK, geekraver, please correct me if I'm wrong).

All the chapters that belong to a particular section are on the same page anyways--so setting a content extraction pattern for the TOC and following links to depth of 2 would result in a lot of duplicated content.

The prefaces/introduction are all on one page, each part/section is on one page, the conclusion is on one page, and all the appendices are on one page--so you'll end up with 11 entries, with a link depth of 1 and a fairly simple regex..this worked for part 1 and will probably work for the other chapters:

Code:

(<h2>.*<!--endofchap-->)

I tried publishing this for you to just subscribe to, but publishing doesn't seem to be working ATM.

shawn · 03-30-2007, 04:48 PM

adinb, thank you very much for your help, it formatted the page very nicely

adinb · 03-31-2007, 05:10 AM

no worries.

Just glad to be of some help.

And if anyone else needs some regexp assistance, feel free to PM/email me.

I've been working with geekraver and it appears that the bug I was seeing was really a much smaller bug with what's used in the {0} field when the supplied link extraction regexp doesn't match anything. (making it appear that the app is failing when the user's regex is really the problem)

I sent in a report to sony on the 128 file transfer bug, we'll see if that results in any connect software changes, but until then, be careful that you don't have more than 127 feeds that you're autoupdating on your memory car (or that you try to transfer more than 128 files at a time to a memory card in the connect reader software).

Publishing feeds is working great ATM, everyone seeing web2book should start seeing a *wide* variety of feeds to subscribe to.

If anyone has a particular site that they'd like me to work on getting into the directory, please feel free to pm/email me.

-adin

fritz_the_blank · 04-01-2007, 03:41 PM

If someone could help me with this please, I would be much obliged:

Link: http://feeds.newsweek.com/Newsweek/CoverStory
Link Element: guid
Extractor Pattern: http://www.msnbc.msn.com/id/(\d+)/site/newsweek/?from=rss
Link Reformatter: {0}&displaymode=1098

I always get 0 articles regardless of the value set for days.

Thank you very much,

FtB

nmackay · 04-03-2007, 08:16 AM

Like many others I was surprised at how poor the Sony Connect software is for such a good unit, and delighted when I found web2book. I use it for several RSS feeds I watch. Now, I have used computers for probably longer than many of the contributers to this forum (As a Capetonian, Geekraver might like to know that at UCT in the early 70's I used to work with the Psychology Department main frame - & yes, the units were literally mounted on a frame), however, I do not have the knowledge to customize my feed/web information to pick out particular sub feeds, or threads (eg this Mobileread one here) or to manage one that needs a password. Is there any chance that that someone might write a basic set of instructions for those like me? I expect that there are others who want this but feel too awed by the high geek quotient of the forum contributers to ask.

adinb · 04-04-2007, 03:19 AM

Quote:

Originally Posted by fritz_the_blank

If someone could help me with this please, I would be much obliged:

Link: http://feeds.newsweek.com/Newsweek/CoverStory

Are you wanting *just* the week's coverstory? Here's what I came up with for this entry (please pardon any typos since parallels clipboard isn't wanting to work tonight...but I did publish this particular feed for a known working version):

Code:

Link: http://feeds.newsweek.com/CoverStory 
Link Element: origLink
Link Extractor Pattern: id/(\d+)/site
Link Reformatter: http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/
Content Extraction Pattern: (<div class="caption">.*)

The process I go through to get all this stuff: (may break this into a few messages)

- I enter the rss feed link (I try to get RSS 2.0 links, since some atom date formats aren't complete supported by web2book.) I set the days to "0" and I select test. If the full content of the articles is in the feed and everything is good, you don't have to do anything other than select the number of days you want, name the entry, and select the "enabled" box. If you are just getting a small snippet and want additional content, you need to fill in the "Link Element" so that web2book knows what link to follow.

- Since you have to find the right link for web2book to follow, view the source of the feed. I do this by typing in the URL of the feed into firefox, right clicking on the loaded page and select "view source". I then look for which tag in the page source holds the "real" link to the story, (not a link that goes through feedburner or some in between website.). In this case the source was really funky and tough to read, but the origLink tag had the real link....and presto, that's the "Link Element".

-The next step is to run the test again. The output will probably be weird, but if you have the correct link element, the log should show web2book following the link and then converting the raw html it got into pdf.

-Assuming that web2book grabbed the article page that you wanted, you just have to figure out the "content extraction pattern" that will pull out the content without all the ads. Finding the correct regular expression is a bit of an art. I would recommend using the regular expression helper in web2book in the tools menu to test/experiment to find the right content expression pattern. Copy the page source of the of the page that web2book grabbed the html from in the earlier steps and copy it into the input field. Type your Regular Expression into the RegExp field and click test. The "Group" field will be the html that would be sent on to be turned into PDF. A good guide that I refer to for building regular expressions is http://www.regular-expressions.info/tutorial.html . This is *definitely* an art form, and you might want to search the net for other, more complete tools to assist in building regular expressions. I know that I put in about a full week's worth of time to spin myself back up on complex regex's.

***Tip: Test your regular expressions before even trying them in web2book. web2book just takes the regular expressions and applies them to the html, so even if you *think* you have it right (which I did many, many times when I didn't have it right) you probably are missing a backslash or a parenthesis somewhere.

***Tip: If web2book doesn't actually generate a pdf during a test, take a look at the log. If the extracted link, and link reformatter both look good, then there is an error in your "content extraction pattern" regular expression. If you don't see a correct extracted or reformatted link, then there is an error in your "link element", "link extractor pattern" regular expression, or in your "link reformatter".

-If there is a "print me" link on the page and you want to use that page as your content source instead of the page at the destination of the "link element", then things get a little more complicated. You will have to find whether you can jump to the print page by grabbing the article ID from the "link element" URL or if you have to look on the destination of the "link element" for the URL of the print page. In this example we can grab the article ID directly out of the link element URL using another regular expression ("id/(\d+)/site") and pasting it into the middle of a fairly static URL for printing ("http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/").

If Newsweek didn't want to be nice and be complicated like Time, you would tick the box "Apply extractor to linked content instead of link text" and you would have to write *another* regular expression to be applied to the *content* of the *destination* of the "Link Element" to find the link to the printable version of the page. Take a look at the published Time feeds for a good example of having to go all the way down the rabbit hole to get to the printable versions of the page.

Some sites just plain won't let an automated "scraper" program like web2book to grab the printable versions of their page. They may "lie" and tell you they're going to the printable version of the page and not actually go there. It's tough to debug and will require a bit of intuition.

-Once you get the URL for the printable page, you need to still do the "Content Extraction Pattern" to be applied to the printable page; make sure that you exclude the "<title>" tag, or else you will have a funky title in the finished PDF.

So, that's it for the moment, time for bed tonight, but hopefully this helps a little in getting a good page. I've published a lot of examples, so subscribe to a few feeds using the File|Subscribe command and take a look.

Good luck, and good hunting!

fritz_the_blank · 04-04-2007, 02:17 PM

Thank you for your detailed response.

Thank you also to GeekRaver for his/her work on this project.

As it turns out, I had the wrong URL for the cover story. The correct URL should be http://feeds.newsweek.com/CoverStory and now things are working. However, I get following error when testing:

See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.ComponentModel.Win32Exception: No application is associated with the specified file for this operation
at System.Diagnostics.Process.StartWithShellExecuteEx (ProcessStartInfo startInfo)
at System.Diagnostics.Process.Start()
at web2book.Utils.RunExternalCommand(String cmd, String args, String workdir, Boolean useShell, Int32 timeout, String& output)
at web2book.MainForm.Test(ContentSourceList sourceClass, ContentSource source)
at web2book.MainForm.testButton_Click(Object sender, EventArgs e)
at System.Windows.Forms.Control.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnMouseUp(MouseEventAr gs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.O nMessage(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.W ndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)

************** Loaded Assemblies **************
mscorlib
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll
----------------------------------------
Web2Book
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/Web2Book.exe
----------------------------------------
Utils
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/Utils.DLL
----------------------------------------
System.Windows.Forms
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Drawing
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------
IHtmlConverter
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/IHtmlConverter.DLL
----------------------------------------
ISyncDevice
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ISyncDevice.DLL
----------------------------------------
ISource
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ISource.DLL
----------------------------------------
System.Configuration
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Configuration/2.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll
----------------------------------------
System.Xml
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll
----------------------------------------
ReadLit
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadLit.dll
----------------------------------------
ReadWeb
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadWeb.dll
----------------------------------------
ReadXWord
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadXWord.dll
----------------------------------------
ReadWeb
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadWeb.DLL
----------------------------------------
Accessibility
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/Accessibility/2.0.0.0__b03f5f7f11d50a3a/Accessibility.dll
----------------------------------------
writeHtmlDoc
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writeHtmlDoc.dll
----------------------------------------
WriteLRF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/WriteLRF.dll
----------------------------------------
writePDF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writePDF.dll
----------------------------------------
ITextSharpConverter
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ITextSharpConverter.DLL
----------------------------------------
itextsharp
Assembly Version: 3.1.8.0
Win32 Version: 3.1.8.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/itextsharp.DLL
----------------------------------------
writeRTF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writeRTF.dll
----------------------------------------
SyncPRS500
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/SyncPRS500.dll
----------------------------------------

************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.

For example:

<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>

When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.

fritz_the_blank · 04-04-2007, 02:29 PM

As an addendum to my last post, I am using slightly different settings than the ones that you posted for me. For comparison:

Mine:

LE: guid
LEP: http://www.msnbc.msn.com/id/(\d+)/site/newsweek/?from=rss
LR: {0}&displaymode=1098

Yours:

LE: origLink
LEP: id/(\d+)/site
LR: http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/
CEP: (<div class="caption">.*)

Testing mine returns one article, yours returns none at the moment (most likely, I am doing something wrong).

Thank you once again for all of your help.

adinb · 04-04-2007, 05:25 PM

There shouldn't be a problem with using the guid, in this case the guid and origLink are the same, though the guid has the "permalink=false" attribute, which usually doesn't matter, but I try to not use the guid when it uses that directive. But, it comes down to personal taste, tomahtoe, tomaytoe.

Your LEP regular expression should put only the ID itself into field {0}; so your LR should probably include the link upto the ID, if its filling the entire html link into field {0}, the regex engine is being nice to ya.

My regex only grabs what's directly around the digits just because I try leave as much room as possible for site changes--if the link changes at all your regex won't match, mine isn't much more flexible, but either works--it's more a matter of taste.

Your LR leaves a bunch of gunk at the bottom of the entry ("More from Newsweek Health"), so just make sure to adjust your CEP regular expression to account for the extra gunk. I left mine open-ended so that it'd be a little more flexible in case the source html changed at all--but there's nothing wrong with putting something solid on the trailing part of your CEP regular expression.
You do need to include a CEP, when there's a <title> tag in the html that's sent to htmldoc, it'll make a title (even though the cmdline specifies "no title"). The title overrides the filename in the PRS-500's display, so in your "book" listing, it'll show as the contents of the title tag instead of "rss-Newsweek Cover".

There's only one article in the feed at a time, so one article is valid, though I'm going to attribute any errors in my message to it being late--the entry that I published last night to Geekraver's server should be correct. I'm testing a date format fix ATM, so my copy may be parsing dates that V23 isn't.

fritz_the_blank · 04-04-2007, 08:58 PM

I just tried your setting again and it found an article, and the output from yours is soooo much cleaner!

I should be able to apply that setting to the remainder of the Newsweek feeds.

Thank you once again for your help.

FtB

geekraver · 04-05-2007, 05:27 PM

Quote:

Originally Posted by nmackay

Like many others I was surprised at how poor the Sony Connect software is for such a good unit, and delighted when I found web2book. I use it for several RSS feeds I watch. Now, I have used computers for probably longer than many of the contributers to this forum (As a Capetonian, Geekraver might like to know that at UCT in the early 70's I used to work with the Psychology Department main frame - & yes, the units were literally mounted on a frame), however, I do not have the knowledge to customize my feed/web information to pick out particular sub feeds, or threads (eg this Mobileread one here) or to manage one that needs a password. Is there any chance that that someone might write a basic set of instructions for those like me? I expect that there are others who want this but feel too awed by the high geek quotient of the forum contributers to ask.

He he - well, I do remember the old Sperry 1100 well, writing Fortran progs in punched cards.

By this stage Adin is probably more of an expert than I am. He gave a pretty detailed description of his approach (which I haven't yet read in detail). I'll add mine as it may be slightly different and have some value.

1. First you need to get the URL for the RSS feed of the site you care about. Enter it into your browser and look at the results. If they have the content you want, then all you really need to do is add the URL to web2book; you shouldn't even need to bother with the settings under 'Customize'

2. Assuming they don't have the content you want (e.g. they have an excerpt and end with "Read More" or something like that), then you will need to customize them. Typically I will at this point do two things:

i) right click in the browser and select 'View Source", and look at the RSS XML, to make sure that the permalink or other link has an XML tag that web2book expects; you can see which one web2book expects by going to Customize and clicking on Help. If this feed for some reason has an unusual XML element tag, then you'll need to enter its name in the Link Element field

ii) in the original page in the browser, click on the title link of the first story to have the browser load up the referenced page. We now want to deal with this page, which we'll do in step 3.

3. If the page has a "Printable version" or "Print" link at the top or bottom, we probably want to use that version of the page, as it will have less fluff like ads that needs to be stripped out (if there is no such link go to step 4). So we have to figure out how to get at the link for that. I'll typically hover over the "Print" or "Printable Version" button/link, and see in the status bar of the browser what the URL is for that version. We want to either munge the original article link into this new print one (which we might be able to do just with the link extraction patter and link reformatter), or we may have to suck the link out of the page we are now viewing (which requires checking the checkbox which says "Apply extractor to linked content instead of link text). In the latter case we have to look at the web page source and find the part that has the HREF for the printable version and figure out a regexp pattern to get at that. Regexp patterns and reformatting are a whole separate topic that I will discuss later. Once the link extractor and link reformatter are done, we should have an URL that refers to the low-fluff version of the content. Load up that content in your browser.

4. Now we want to remove ads, etc, from the page. You have to 'View Source' in your browser, and look for the start and end of the content you care about. Then comes the tricky part, which is trying to find some unique delimiters that bracket this content. Once you've found these (and sometimes it isn't possible) you can create a content extraction pattern, and perhaps a content reformatter (if necessary) for getting the content out. A content reformatter is usually just useful if you need to rebalance some HTML tags in the extracted content, or in cases where the content extraction pattern is complex and extracts the content in multiple pieces ("groups") that must be reassembled.

The regular expression helper in the tools menu is very useful for testing your regular expressions. You can do a "View Source" in your browser and paste the full HTML content of a page in the Input box, and enter your regular expression in the RegExp box, and click the Test button, to see what parts of a page your pattern will extract. You must use grouping (which is done with parentheses) to specify the content you want to keep, and if you use more than one group you will need a reformatter to specify how that groups get put back into a single piece of text. When learning to use the regexps also pay attention to "greedy" (match as much text as possible) versus non-greedy (match as little text as possible) matching, as sometimes you need one style and sometimes the other.

If you're lucky you might find DIV html tags with "class" attributes that bracket the content you want. This is fairly common. Comment blocks are also commonly used to identify the article content start and end. An excellent way to master this stuff is to look at the existing published feeds, and work through them yourself, trying to understand how the existing settings make them tick. Do test them though, as websites change and some of the published entries may break, and you might go nuts trying to understand how something works when in fact it doesn't work any more!

nmackay · 04-05-2007, 09:13 PM

Thank you, Geekraver and AdinB, for those details and your work. I am already helped by your replies. I hope others are too. & I will spend some of the Easter weekend finding how to extract more material from various pages.
NM

adinb · 04-06-2007, 03:13 AM

Well, I doubt that I'm all *that* much of an "expert" (I shudder at the word

), but I'm glad that I've been able to help out.

And Geekraver has done an excellent job with web2book and his posting on his approach. *excellent* tips on what to look for. I know that it took me awhile to realise that the "view source chart" firefox extension was changing quote types in spans and divs) -- I still highly recommend the extension to quickly make sense of the page source (at https://addons.mozilla.org/en-US/firefox/addon/655 ), just know that if your regex isn't working, check the raw page source.

And if there's anything I can do for anyone, (help debug regex's, point to tutorials and tools for regex's on windows or osx) feel free to PM or email me.

-adin

InspectorGadget · 04-06-2007, 07:13 PM

If you'll pardon the remedial question, I can't even get the "Subscribe" function to work. When I click on "File | Subscribe", it just freezes up for 20 seconds and then comes up with an empty list in a "Subscribe to Feed" window (if I'm on the "Feed" tab). It does the same with any of the tabs in the main window. It's behaved this way consistently over the last few days, at home and at work.

I downloaded Web2Book from GeekRaver's original post on this thread, but everyone else is saying, "rss2book". Do I have the correct program??

I downloaded the accessory DLLs but turned out I already had them. I installed the non-beta .NET framework 2.0 fresh. I also downloaded and installed HtmlDoc but I haven't gotten that far yet.

Any ideas to get me going?

04-03-2007, 08:16 AM	#156
nmackay Junior Member Posts: 9 Karma: 10 Join Date: Dec 2006	Help on web2book for the less Geeky? Like many others I was surprised at how poor the Sony Connect software is for such a good unit, and delighted when I found web2book. I use it for several RSS feeds I watch. Now, I have used computers for probably longer than many of the contributers to this forum (As a Capetonian, Geekraver might like to know that at UCT in the early 70's I used to work with the Psychology Department main frame - & yes, the units were literally mounted on a frame), however, I do not have the knowledge to customize my feed/web information to pick out particular sub feeds, or threads (eg this Mobileread one here) or to manage one that needs a password. Is there any chance that that someone might write a basic set of instructions for those like me? I expect that there are others who want this but feel too awed by the high geek quotient of the forum contributers to ask.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
rss2book release 20 now available	geekraver	Sony Reader	4	01-26-2007 01:36 PM
rss2book release 19	geekraver	Sony Reader	2	12-30-2006 10:51 AM
rss2book release 18	geekraver	Sony Reader	0	12-22-2006 03:57 AM
rss2book release 16	geekraver	Sony Reader	1	12-13-2006 05:56 AM
rss2book release 13	geekraver	Sony Reader	0	11-13-2006 02:41 AM

03-29-2007, 06:02 PM	#151
adinb RSS &amp; Gadget Addict! Posts: 82 Karma: 67 Join Date: May 2005 Location: Albuquerque, NM Device: Sony PRS-500, iPod Touch, iPhone	BTW, has anyone else noticed particularly strange behavior when you get more than 128 documents on a memory card? I haven't tried it on a memory stick, b/c I'm using a 2GB SD card with my reader. Web2Book is failing to auto-transfer files to my SD card whenever I hit the 128 file boundary (though it looks more like a bug in the sony driver to me)--I was wondering if anyone else was experiencing this before I post it to the general areas of the forums (possibly to add to the FAQ's) and before I report it to sony.

03-30-2007, 04:48 PM	#153
shawn Junior Member Posts: 8 Karma: 10 Join Date: Mar 2007	adinb, thank you very much for your help, it formatted the page very nicely

03-31-2007, 05:10 AM	#154
adinb RSS &amp; Gadget Addict! Posts: 82 Karma: 67 Join Date: May 2005 Location: Albuquerque, NM Device: Sony PRS-500, iPod Touch, iPhone	no worries. Just glad to be of some help. And if anyone else needs some regexp assistance, feel free to PM/email me. I've been working with geekraver and it appears that the bug I was seeing was really a much smaller bug with what's used in the {0} field when the supplied link extraction regexp doesn't match anything. (making it appear that the app is failing when the user's regex is really the problem) I sent in a report to sony on the 128 file transfer bug, we'll see if that results in any connect software changes, but until then, be careful that you don't have more than 127 feeds that you're autoupdating on your memory car (or that you try to transfer more than 128 files at a time to a memory card in the connect reader software). Publishing feeds is working great ATM, everyone seeing web2book should start seeing a wide variety of feeds to subscribe to. If anyone has a particular site that they'd like me to work on getting into the directory, please feel free to pm/email me. -adin

04-01-2007, 03:41 PM	#155
fritz_the_blank Member Posts: 20 Karma: 10 Join Date: Jan 2007 Device: Sony PRS-500	If someone could help me with this please, I would be much obliged: Link: http://feeds.newsweek.com/Newsweek/CoverStory Link Element: guid Extractor Pattern: http://www.msnbc.msn.com/id/(\d+)/site/newsweek/?from=rss Link Reformatter: {0}&displaymode=1098 I always get 0 articles regardless of the value set for days. Thank you very much, FtB

04-04-2007, 02:17 PM	#158
fritz_the_blank Member Posts: 20 Karma: 10 Join Date: Jan 2007 Device: Sony PRS-500	Thank you for your detailed response. Thank you also to GeekRaver for his/her work on this project. As it turns out, I had the wrong URL for the cover story. The correct URL should be http://feeds.newsweek.com/CoverStory and now things are working. However, I get following error when testing: See the end of this message for details on invoking just-in-time (JIT) debugging instead of this dialog box. ************ Exception Text ********** System.ComponentModel.Win32Exception: No application is associated with the specified file for this operation at System.Diagnostics.Process.StartWithShellExecuteEx (ProcessStartInfo startInfo) at System.Diagnostics.Process.Start() at web2book.Utils.RunExternalCommand(String cmd, String args, String workdir, Boolean useShell, Int32 timeout, String& output) at web2book.MainForm.Test(ContentSourceList sourceClass, ContentSource source) at web2book.MainForm.testButton_Click(Object sender, EventArgs e) at System.Windows.Forms.Control.OnClick(EventArgs e) at System.Windows.Forms.Button.OnClick(EventArgs e) at System.Windows.Forms.Button.OnMouseUp(MouseEventAr gs mevent) at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks) at System.Windows.Forms.Control.WndProc(Message& m) at System.Windows.Forms.ButtonBase.WndProc(Message& m) at System.Windows.Forms.Button.WndProc(Message& m) at System.Windows.Forms.Control.ControlNativeWindow.O nMessage(Message& m) at System.Windows.Forms.Control.ControlNativeWindow.W ndProc(Message& m) at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam) ********** Loaded Assemblies ********** mscorlib Assembly Version: 2.0.0.0 Win32 Version: 2.0.50727.42 (RTM.050727-4200) CodeBase: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll ---------------------------------------- Web2Book Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/Web2Book.exe ---------------------------------------- Utils Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/Utils.DLL ---------------------------------------- System.Windows.Forms Assembly Version: 2.0.0.0 Win32 Version: 2.0.50727.42 (RTM.050727-4200) CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll ---------------------------------------- System Assembly Version: 2.0.0.0 Win32 Version: 2.0.50727.42 (RTM.050727-4200) CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll ---------------------------------------- System.Drawing Assembly Version: 2.0.0.0 Win32 Version: 2.0.50727.42 (RTM.050727-4200) CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll ---------------------------------------- IHtmlConverter Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/IHtmlConverter.DLL ---------------------------------------- ISyncDevice Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ISyncDevice.DLL ---------------------------------------- ISource Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ISource.DLL ---------------------------------------- System.Configuration Assembly Version: 2.0.0.0 Win32 Version: 2.0.50727.42 (RTM.050727-4200) CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Configuration/2.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll ---------------------------------------- System.Xml Assembly Version: 2.0.0.0 Win32 Version: 2.0.50727.42 (RTM.050727-4200) CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll ---------------------------------------- ReadLit Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadLit.dll ---------------------------------------- ReadWeb Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadWeb.dll ---------------------------------------- ReadXWord Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadXWord.dll ---------------------------------------- ReadWeb Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadWeb.DLL ---------------------------------------- Accessibility Assembly Version: 2.0.0.0 Win32 Version: 2.0.50727.42 (RTM.050727-4200) CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/Accessibility/2.0.0.0__b03f5f7f11d50a3a/Accessibility.dll ---------------------------------------- writeHtmlDoc Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writeHtmlDoc.dll ---------------------------------------- WriteLRF Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/WriteLRF.dll ---------------------------------------- writePDF Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writePDF.dll ---------------------------------------- ITextSharpConverter Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ITextSharpConverter.DLL ---------------------------------------- itextsharp Assembly Version: 3.1.8.0 Win32 Version: 3.1.8.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/itextsharp.DLL ---------------------------------------- writeRTF Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writeRTF.dll ---------------------------------------- SyncPRS500 Assembly Version: 1.0.0.0 Win32 Version: 1.0.0.0 CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/SyncPRS500.dll ---------------------------------------- ********** JIT Debugging ************ To enable just-in-time (JIT) debugging, the .config file for this application or computer (machine.config) must have the jitDebugging value set in the system.windows.forms section. The application must also be compiled with debugging enabled. For example: <configuration> <system.windows.forms jitDebugging="true" /> </configuration> When JIT debugging is enabled, any unhandled exception will be sent to the JIT debugger registered on the computer rather than be handled by this dialog box.

04-04-2007, 02:29 PM	#159
fritz_the_blank Member Posts: 20 Karma: 10 Join Date: Jan 2007 Device: Sony PRS-500	As an addendum to my last post, I am using slightly different settings than the ones that you posted for me. For comparison: Mine: LE: guid LEP: http://www.msnbc.msn.com/id/(\d+)/site/newsweek/?from=rss LR: {0}&displaymode=1098 Yours: LE: origLink LEP: id/(\d+)/site LR: http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/ CEP: (<div class="caption">.*) Testing mine returns one article, yours returns none at the moment (most likely, I am doing something wrong). Thank you once again for all of your help.

04-04-2007, 05:25 PM	#160
adinb RSS &amp; Gadget Addict! Posts: 82 Karma: 67 Join Date: May 2005 Location: Albuquerque, NM Device: Sony PRS-500, iPod Touch, iPhone	There shouldn't be a problem with using the guid, in this case the guid and origLink are the same, though the guid has the "permalink=false" attribute, which usually doesn't matter, but I try to not use the guid when it uses that directive. But, it comes down to personal taste, tomahtoe, tomaytoe. Your LEP regular expression should put only the ID itself into field {0}; so your LR should probably include the link upto the ID, if its filling the entire html link into field {0}, the regex engine is being nice to ya. My regex only grabs what's directly around the digits just because I try leave as much room as possible for site changes--if the link changes at all your regex won't match, mine isn't much more flexible, but either works--it's more a matter of taste. Your LR leaves a bunch of gunk at the bottom of the entry ("More from Newsweek Health"), so just make sure to adjust your CEP regular expression to account for the extra gunk. I left mine open-ended so that it'd be a little more flexible in case the source html changed at all--but there's nothing wrong with putting something solid on the trailing part of your CEP regular expression. You do need to include a CEP, when there's a <title> tag in the html that's sent to htmldoc, it'll make a title (even though the cmdline specifies "no title"). The title overrides the filename in the PRS-500's display, so in your "book" listing, it'll show as the contents of the title tag instead of "rss-Newsweek Cover". There's only one article in the feed at a time, so one article is valid, though I'm going to attribute any errors in my message to it being late--the entry that I published last night to Geekraver's server should be correct. I'm testing a date format fix ATM, so my copy may be parsing dates that V23 isn't.

04-04-2007, 08:58 PM	#161
fritz_the_blank Member Posts: 20 Karma: 10 Join Date: Jan 2007 Device: Sony PRS-500	I just tried your setting again and it found an article, and the output from yours is soooo much cleaner! I should be able to apply that setting to the remainder of the Newsweek feeds. Thank you once again for your help. FtB

04-05-2007, 09:13 PM	#163
nmackay Junior Member Posts: 9 Karma: 10 Join Date: Dec 2006	Thank you, Geekraver and AdinB, for those details and your work. I am already helped by your replies. I hope others are too. & I will spend some of the Easter weekend finding how to extract more material from various pages. NM

04-06-2007, 03:13 AM	#164
adinb RSS &amp; Gadget Addict! Posts: 82 Karma: 67 Join Date: May 2005 Location: Albuquerque, NM Device: Sony PRS-500, iPod Touch, iPhone	Well, I doubt that I'm all that much of an "expert" (I shudder at the word ), but I'm glad that I've been able to help out. And Geekraver has done an excellent job with web2book and his posting on his approach. excellent tips on what to look for. I know that it took me awhile to realise that the "view source chart" firefox extension was changing quote types in spans and divs) -- I still highly recommend the extension to quickly make sense of the page source (at https://addons.mozilla.org/en-US/firefox/addon/655 ), just know that if your regex isn't working, check the raw page source. And if there's anything I can do for anyone, (help debug regex's, point to tutorials and tools for regex's on windows or osx) feel free to PM or email me. -adin

Advert

Advert

04-06-2007, 07:13 PM	#165
InspectorGadget Enthusiast Posts: 47 Karma: 70 Join Date: Apr 2007 Device: Sony PRS-500	If you'll pardon the remedial question, I can't even get the "Subscribe" function to work. When I click on "File \| Subscribe", it just freezes up for 20 seconds and then comes up with an empty list in a "Subscribe to Feed" window (if I'm on the "Feed" tab). It does the same with any of the tabs in the main window. It's behaved this way consistently over the last few days, at home and at work. I downloaded Web2Book from GeekRaver's original post on this thread, but everyone else is saying, "rss2book". Do I have the correct program?? I downloaded the accessory DLLs but turned out I already had them. I installed the non-beta .NET framework 2.0 fresh. I also downloaded and installed HtmlDoc but I haven't gotten that far yet. Any ideas to get me going?