View Full Version : Rss2Book


geekraver
10-08-2006, 04:34 AM
Hi all

Here's a program (http://www.download.com/3000-20-10649163.html?part=undefined&subj=dl&tag=button) to make HTML, RTF, LRF or PDF files (the latter supports rich formatting if you have htmldoc installed) from RSS feeds and other websites. You need .Net Framework 2.0 or later installed to run it. PDF output is in iso-8859-15 character set, so some European languages are supported.

The program can write the output files on your PC or sync them directly to the Sony Reader over USB.

Just go to Tools-Options and make sure the options are set the way you want them, add a bunch of RSS feeds to the datagrid on the main window, and hit Go! The program can generate files on your PC or sync directly to your Reader if it is attached.

If you want to use feeds that others have already set up, open the File menu, and select Subscribe. You'll be shown the set of available published feeds. Click the checkboxes next to the ones you want and click the Subscribe button, and they'll be added to your setup. Note that you can subscribe separately to webpage entries from the webpages tab.

Attached are three screenshots; if all you want to do is look at RSS content then the last screenshot covers most of what you'll deal with (once you've checked the options in the first screenshot). The complex looking dialog in the middle is for extracting full HTML from RSS feeds that only include summaries in the feeds; with some tweaking you can get the app to get you full content with ads and other noise stripped.

Non-geeks can stop reading here and should just try the app out using the Subscribe facility in the File menu. Hardcore geeks who understand regular expressions, read on for details of how to add new feeds that no-one has published yet.

web2book supports a fairly powerful extension mechanism. Selecting a feed entry and clicking the Customize button brings up the advanced settings. Once in this property view you can also use the Test button to test your configuration for that feed; if all is well it will open your PDF reader eventually with the output for that site. A fairly detailed log is also generated to help troubleshooting. Once you are satisfied with the results for the entry you created, you can share it with others by clicking the Publish button.

The properties are mostly to support getting full versions of articles, possibly via modified links that point to lower noise printable versions, and extracting a subset of the article HTML (to skip ads, etc).

The various properties for Feeds are:

Url - pretty obvious; this is the RSS feed URL.

Enabled - whether to include this feed when you click on Go! from the main view.

Days - how many days back to go when using RSS entries.

Content Element - in most cases you can leave this blank; if specified (and if the Link Element field described below is blank) then the body of the element with this name will be used for the article text. If blank then rss2book will look for any of 'description', 'summary' or 'content'.

Link Element - the element in the RSS feed that specifies the link to the full article. Don't specify anything here unless you actually want the full article. Otherwise this will typically be either 'link' or 'guid' for most RSS feeds.

Link Extractor Pattern - this is an optional regular expression that will be applied to the link element to parse it into a collection of one or more substrings. You need to use unnamed groups (i.e. bits of regular expression pattern enclosed in parentheses) to identify the various substrings. If you leave this blank the original link will be used to create a single-element collection. Two simple examples:

(\d+) - will extract the first sequence of numbers found in the link element

http://(.*) - will strip off the leading http:// from the link element

Apply extractor to linked content instead of link text - if this is checked, then the extractor pattern above is not applied to the link; instead, we follow the link and retrieve the web page at that link, then apply the extractor pattern to the contents of that page. This is useful, for example, to extract 'printable version' URLs from article pages if there is no simple textual mapping from an article URL to the corresponding 'printable version' URL, but the 'printable version' URL is contained in the article page (tip: for web pages that have printable versions, the printable version is preferable).

Link Formatter - this is a format string that gets used to create a new link from the collection created above by the link extractor. It consists of a string with parameters {0}, {1}, {2}, etc, which are expanded to the various substrings in the collection. If you leave it blank that is equivalent to "{0}" - i.e. just use the first substring.

Content Extraction Pattern - this is a regular expression that is applied to the article content HTML from the previous step. It should have a single unnamed group; the text that matches that group is used as the final article content HTML. If left blank then the full article content from the link processing step is used.

Content Reformatter: This is similar to the link formatter. It can be used to wrap or insert some additional HTML around the content extracted by the pattern in the last step. Ifd left blank it has no effect. Once again positional parameters {0}, ... are used to identify the matched groups from the content extraction step.

The Tools menu has a regular expression tester that you may find helpful when doing advanced feed setups.

Okay, this probably sounds more complicated than it is, so here are some examples:

Name: BBC News
URL: http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml
Link Element: guid
Link Extractor Pattern: http://(.*)
Link Reformatter: http://newsvote.bbc.co.uk/mpapps/pagetools/print/{0}
Content Extraction Pattern:

i.e. get the RSS feed from the URL, pull out the links in the 'guid' elements, strip off the 'http://' part, prepend http://newsvote.bbc.co.uk/mpapps/pagetools/print/. then get the HTML at that link.

Name: Slate
URL: http://www.slate.com/rss/
Link Element: link
Link Extractor Pattern: (\d+)
Link Reformatter: http://www.slate.com/toolbar.aspx?action=read&id={0}
Content Extraction Pattern: (\<font.*)Article URL

I.e. get the RSS from http://www.slate.com/rss/, pull out each 'link' element, extract the sequence of digits from such an element and append it to 'http://www.slate.com/toolbar.aspx?action=read&id=', fetch the HTML at that URL, then extract everything starting from the first '<font>' tag up to but not including the text 'Article URL'.

Name: Reuters Top News
URL: http://feeds.reuters.com/reuters/topNews/
Link Element: guid
Link Reformatter: http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=topNews&storyID={0}
Content Extraction Pattern: (<span class=\"artTitle.*)</td>

i.e. get the RSS at http://feeds.reuters.com/reuters/topNews/, pull out each guid element, append the guid to 'http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=topNews&storyID=', get the HTML at that URL.

If you want to put Wikipedia articles on your reader, use something like:

URL: http://en.wikipedia.org/wiki/Nikola_Tesla
HTML: checked
Content Extraction Pattern: <!-- start content -->(.*)<!-- end content -->

Website entries support some metacharacters in the URL for dates, namely @yyyy, @yy, @mm and @dd. These are expanded to the year, month of day (either 4 or 2 digits for year; two digits for the others). If you specify a Number Of Days entry, then the URL will be expanded for each day in range and the contents for each day will be concatenated, starting with the oldest, and ending with the current day. For example, the following will get one week of Dilbert comic strips:

Url: http://www.unitedmedia.com/comics/dilbert/archive/dilbert-@yyyy@mm@dd.html
Number Of Days: 6
Content Extractor Pattern: (<IMG SRC="/comics/dilbert/archive/images/dilbert[^>]*>)
Content Reformatter: {0}<br>

geekraver
10-10-2006, 02:36 AM
bump as editing the posting and attachments didn't.

heavyB
10-27-2006, 03:11 PM
I'm surprised by the lack of comment on this app... This is most likely the coolest side function for the Reader yet! The RSS included with Sony's CONNECT software is basically broken compared to this.

Thanks Geekraver! You have a fan...

Bob Russell
10-27-2006, 04:41 PM
I subscribed to this thread in anticipation of trying it. I agree that RSS solutions are a big deal. Many like me are probably just waiting a bit to see what the best and easiest solutions turn out to be.

So I guess I have the same question... where are all the early software adopters with reports on how great this is? ;-)

Let me clarify.. sometimes things don't always come out the way you mean them... I'm not in any way trying to say that if it was good we'd see early adopter reports. I am trying to say I'm surprised that there aren't more people eager to try this because it sounds so good!

It makes you really cringe when you realize something you meant in a positive way can sound so negative! Sorry for the ambiguity. :)

heavyB
10-27-2006, 07:50 PM
I could definatly add more info to the use of this app, good pointer Bob.

It did take a .NET update on my XP and I had to trackdown the opensource version of HTMLdoc. I do admit to being a nerd for a living too, so this might be a bit more daunting for others.

After getting it all together, I tried the app and didn't think it worked, but on a reboot, it did work. Flawlessly.

I'll see if I can't backtrack my steps and provide some links if others are having a problem getting this going. (pipe up if you are)

The real strenth of this app is the Table of contents it builds. It seperates each RSS site and gives a sub table of contents for that selection. I wasn't even aware the Reader would do this until I used RSS2Book. Then, once viewing, each article in the feed is well formatted, with what appears to be a solid effort to lenth each article justify when possible. Too cool.

geekraver
10-28-2006, 02:46 AM
Glad you like it! I just wish I could figure out why my C# code to interface with te reader is busted (I've pored over Igor's Python code and can't see what I'm doing wrong but I get weird errors). If I get past that hurdle then I will finally get the app to sync straight to the Reader.

I'm also interested in how folks think the app should deal with successive updates. I tend to run it about once eevery three days and replace the old rss2book.pdf on my reader with a new one, but I have to remember when I last ran it. I could generate a separate PDF for each day. Any other ideas? Unfortunately it doesn't seem like you can delete files on the reader itself without attaching it to the PC, or I would definitely go the single day at a time route and just get dumping files whenever the reader is attached, relying on them being deleted on the Reader manually once they are read.

One other problem I've noticed is that with a couple of sites I get weird characters when viewing on the reader. I suspect this is a character set issue. Perhaps it could be worked around by using embedded fonts. Anyone else notice this?

Bob Russell
10-28-2006, 02:50 AM
This may be way off base, but would it make sense to keep the last n files or n days? Then a daily load with "keep 7 days" would give you anything that was generated within the last week. If the date was part of the filename, even better. And still better yet if only posts that haven't been downloaded are included in successive files.

geekraver
10-28-2006, 05:25 AM
Well, I found my bugs in prsutils (silly me wasn't packing the structs; my C has really gotten rusty). Should finally be able to get the synching done straight from rss2book within a few days if I can snatch some time from work/family/other projects.

Sam
10-28-2006, 10:00 AM
I'm a new adopter to this Sony Reader thing, and to RSS as well. This looks like a great app for me, to maximize my usage of the Reader- the idea of just loading a completed PDF from the previous day of RSS feeds in the morning on the way to work sounds great. I'm already sold on the thing since I do a lot of traveling and not having to carry around an extensive library is great, but if I can also throw my news/blog/internet reads on there, that's even better.

Is there any way someone could write a more detailed explanation of how to install and use these programs? I found a version of HTMLdoc, but can't figure out how to use it/install it. And I don't know how to put the RSS feeds into your program, much less make it produce a beautiful PDF like you made. I was getting confused with the three lines in the Options menu, the middle one is the formatting details, but I wasn't quite sure of how to set up the first and third. And like I said, I don't know if it even works if you haven't installed HTMLdoc, which I'm trying to look over at the moment.

I agree with HeavyB that this app really makes the reader very cool for users like myself. Appreciate any help you can offer for less literate users like myself.

Sam
10-28-2006, 01:39 PM
Think the problem I'm having is I don't know how to install HTMLdoc. Usually I just look for the .exe, but can't seem to find it.

Laurens
10-28-2006, 01:50 PM
I think people would be greatly helped if an HTMLDoc binary was bundled with the distribution.

huari
10-28-2006, 02:51 PM
I haven't tried these yet but if I get one to work, I'll post.
Looks like htmldoc needs some libraries, but it is a bit Martian to me.

http://mamboxchange.com/projects/htmldoc/
http://www.paehl.com/open_source/?HTMLDOC_1.8.x_OpenSource_Version
http://tecfa.unige.ch/guides/utils/htmldoc/1-compiling.html
http://www.htmldoc.org/documentation.php/CompilingunderUNIXLinux.html

Best,
tony

neilm2
10-28-2006, 03:11 PM
Geekraver's rss2book app is really great! I can't recommend it enough. It took me about 15 minutes to get my PC set up to use it.

First I downloaded and installed .NET framework 2.0 here...

http://msdn2.microsoft.com/en-us/netframework/aa731542.aspx

Then I downloaded and installed the Open Source version of HTML2Doc (and the 2 required dlls here)...

http://www.paehl.com/open_source/?HTMLDOC_1.8.x_OpenSource_Version

Then I downloaded and installed, rss2book from the top post of this Forum thread.

Now, I'm reading all my favorite rss feeds as pdf ebooks on my Sony Reader. I agree with HeavyB above, that the virtuoso touch getting a Table of Contents of your rss feeds in the pdf. Try it. You'll like it!

Sam
10-28-2006, 04:14 PM
Um... how do you use the dlls? And what should the output path be? Arghh....

geekraver
10-28-2006, 05:42 PM
There are places you can download it, such as here (http://fresh.t-systems-sfr.com/pc/src/www/) . However, if you install a prebuilt version it may be the commercial version which has some restrictions I believe. I don't know whether the free version is available precompiled or not.

Sam
10-28-2006, 06:07 PM
Looks like I have HTMLdoc now in place. But the third line in options, what do you put in the output path? Is that where the formed PDF will appear- on the desktop? Or something else?

geekraver
10-28-2006, 06:37 PM
Yes, that is the output path on your desktop where you want the file to go.

Sam
10-28-2006, 06:59 PM
I think it's working now, just want to do some more tests. It came up as an HTML, now I'm testing it as a PDF.

Sam
10-28-2006, 07:06 PM
I'm not doing something right. HTML worked fine- landed on the desktop. But PDF won't come. It seems the program freezes for a few seconds, like it's working hard, but then it doesn't come.

No wonder! The HTMLdoc wants a license number. Back to square one, finding an open source HTMLdoc that I can figure out how to assemble...

heavyB
10-28-2006, 09:26 PM
It wasn't easy finding those pre-compiled open source binaries for HTMLdoc. I found again where I got mine. Works the same that I can tell as the commercial version, only the graphical interfaces says "open source" instead of "commercial version".

http://users.tpg.com.au/naffall/htmldoc.html

Click the "HTMLDoc1.8.24.zip" link. I think you'll need to execute (double click) the "HtmlDoc.reg" file to regeister the app with windows. You may also need to edit that .reg file for the path you unzipped this download to. Follow the instructions on the website above.

neilm2
10-29-2006, 12:03 AM
I had a similar issue with html working but not the pdfs. Then I installed the required dlls from here...

http://www.paehl.com/open_source/?HTMLDOC_1.8.x_OpenSource_Version

(you just download them from the above url and drag the dlls into your WINDOWS folder)

geekraver
10-29-2006, 04:04 AM
This may be way off base, but would it make sense to keep the last n files or n days? Then a daily load with "keep 7 days" would give you anything that was generated within the last week. If the date was part of the filename, even better. And still better yet if only posts that haven't been downloaded are included in successive files.

I think that's fine. I've removed the date on the main form, and added two options, one for the N days, and one to specify whether to combine the feeds into a single file or split them out. I'm just working on getting the interop with the USB layer working now and then I will change the Go! button to Sync!

lordvetinari2
10-29-2006, 09:22 AM
Finally, I have been able to get a working version of HTMLDOC (paehl.com did it, thanks!), so now I can test your nearly-fantastic tool, geekraver.

It's perfect for English text. OWLRSS messes up RTF (posts as single paragraphs, no formatting, no images, no anything) and PDF (RTF problems + less reading space). Fortunately, rss2book keeps all original RSS format, generates a very useful TOC thanks to HTMLDOC, AND is optimized for the Reader. What more could I possibly ask? Read on...

Non-English text looks corrupt, though. See El País (http://www.elpais.es) and El Mundo (http://www.elmundo.es), the two most important newspapers in Spain. Both websites and RSS feeds are in ISO-8859-15. (Please find attached the resulting HTML file from those two feeds).

Rss2book seems to corrupt perfectly good ISO-8859-15 characters, and the resulting HTML can only be correctly viewed in Firefox if UTF-8 is chosen. Specifying UTF-8 in a HTML metadata fixes the HTML, BUT HTMLDOC is non-compatible with any kind of Unicode, so I'm screwed.

For now, I will try to clean UTF-8 back into ISO (with some automatic tool, I hope), and see what happens.

Sam
10-29-2006, 10:41 AM
I'm sorry, can you give me more explicit instructions on where to place the DLL files? I assume you mean Program Files -> HTMLdoc -> there in the main folder? If anywhere more obscure than that, please inform. Thanks!

lordvetinari2
10-29-2006, 10:46 AM
I'm sorry, can you give me more explicit instructions on where to place the DLL files? I assume you mean Program Files -> HTMLdoc -> there in the main folder? If anywhere more obscure than that, please inform. Thanks!

At the Windows folder.

huari
10-29-2006, 03:20 PM
Thank you, thank you
I got the rss2Book working and htmldoc to work too!

Here's a recap:
The binary I used for htmldoc 1.825 was here: http://www.paehl.com/open_source/?HTMLDOC_1.8.x_OpenSource_Version

download, unzip and double click set-up to install
Then I got the 2 required dlls as compressed files: libssl and msvcrtd
when uncompressed you get 3 files:
ssleay32.dll
libeay32.dll
msvcrtd.dll
unzip and copy or move them (3 dll files altogether) to the C:\WINDOWS folder

Open up Rss2Book/options and set your htmldoc location. Mine was C:\Program Files\HTMLDOC
I set my output path to the desktop and accepted the default format settings.

Add feeds and click go!
After you get your rssPDF, drag it over the Connect software to upload to the Reader.

My firewall ZoneAlarm alerted that htmldoc needed to access the internet, I allowed that and the program converted the rss2pdf, complete with big font and TOC.

I also tried just converting to html and using a PDF driver I have called PDFFactory which creates PDFs with TOCs, which worked to but did not have the Big fonts that rss2PDF has. I would have to edit the html source which is a cumbersome step. Another thing I noticed is that if you set the rss2book date too early you might not get content or if you set it to an earlier date you get too much so be aware of the update frequency of your feeds.

Thank you again GeekRaver and all the other posters for your efforts and help.

It is good to be able to 'read' on the Reader and not be locked in waiting for Sony.

geekraver
10-30-2006, 04:24 AM
New version posted (see first entry) that supports synching to the Reader. There are a bunch more options that should be checked under Tools - Options. Don't use this while you have the Sony COnnect software running (it will fail to load the Sony libraries in that case).

lordvetinari2, I'm not sure I can do too much about the fonts, due to htmldoc restrictions. I am stripping down to 7-bit clean fonts which will help some english sites look better, but doesn't help you.

Main other feature I'd still like to add would be RDF support.

Edit: I am now encoding the html in iso-8859-15, so things will look quite a bit better when using Spanish and other European languages.

igorsk
10-31-2006, 12:35 PM
Maybe you should try generating native LRF files instead of PDF?
Here's a Java project which converts HTML to LRF: http://monalipse.sourceforge.jp/tmp/lrf/
Er nvm, I confused your project with RssOwl which is in Java...
Anyway, I will be releasing one thing soon which might help with LRF generation :)

lordvetinari2
10-31-2006, 02:13 PM
New version posted (see first entry)

Sweet ISO! Thanks a bunch, this is 100% perfect for me now.
Maybe Russian and Chinese readers will not feel the same about this change, but since HTMLDOC does not support Unicode anyway...

geekraver
10-31-2006, 04:51 PM
Maybe you should try generating native LRF files instead of PDF?
Here's a Java project which converts HTML to LRF: http://monalipse.sourceforge.jp/tmp/lrf/
Er nvm, I confused your project with RssOwl which is in Java...
Anyway, I will be releasing one thing soon which might help with LRF generation :)

I'm not sure I want to go down the path of a generalized html to lrf convertor, when using PDFs works and supports a richer output.

Kaitou Ace
11-02-2006, 11:38 AM
Any way I can use this now to get Slate?
http://www.slate.com/rss/ has every link in the form of

http://www.slate.com/id/2152452/fr/rss/
and
http://www.slate.com/toolbar.aspx?action=read&id=2152452
is the print version of each of them. Or maybe could the program also be able to parse html files, and just get every link to a particular link depth, with matching urls? That'd be an amazing feature also.

It is impressive software, and I am looking forward to what the next version will offer :)

heavyB
11-03-2006, 07:07 PM
Absolutly fantastic Geekraver! At the rate you're pumping out revisions, I'm unsure I should start posting my feed profiles or wait for the XML imports :)

geekraver
11-03-2006, 07:13 PM
I should have the XML version done tonight, so you may as well wait.

heavyB
11-03-2006, 07:37 PM
I should have the XML version done tonight, so you may as well wait.

Wow! I was half joking :)

For those who are confused about regular expressions, check out:
http://www.amk.ca/python/howto/regex/

You only need concern yourself with the first couple pages of this tutorial to get down what you need to use geekraver's powerful app. It's easy and actually pretty fun.

neilm2
11-04-2006, 01:20 PM
O.K., now I'm falling behind on my books because my Reader is becoming an all-purpose book-blog-newspaper-magazine thing.

vranghel
11-04-2006, 03:30 PM
I have a question for geekraver.
I want to make a PDF of the artcles from www.damninteresting.com.
I put 'link' for the 'link element' field, and it works but i get EVERYTHING on that page: links(menu), article, comments. Is there a way to set the program to only get the article?

I read all your instructions but i didnt understand all of it. So if you can please enlighten me :huh:

I attached an example pdf to better understand what i mean.

geekraver
11-04-2006, 03:59 PM
You need to filter the article content, which is done by the 'Content Extraction Pattern'. This will work:

(<div id="post.*)<div class="postMetaData">

Alternatively, import the attached xml file.

The stuff that gets included is the stuff in parentheses, so this pattern says include everything starting from the first occurrence of '<div id="post' up to but not including the last occurrence of '<div class="postMetaData>'.

The .* matches any text of zero or more characters. The match is 'greedy'; i.e. as much text as possible gets matched, which is why we start with the FIRST occurence of '<div id="post' and end with the LAST occurence of '<div class="postMetaData'. There's probably only one occurence of each anyway but its worth mentioning the greedy aspect as it can cause confusion.

When experimenting with the patterns, use the RegExp Helper under the Tools menu. You can paste the web page HTML source into the Input box, then enter different patterns in the RegExp textbox. Click on Test and you will be shown the text that matches the whole pattern and the text that matches the parenthesized part of the pattern (i.e. the ultimately important stuff).

geekraver
11-04-2006, 04:09 PM
O.K., now I'm falling behind on my books because my Reader is becoming an all-purpose book-blog-newspaper-magazine thing.

:) That's largely why I bought a reader; I rarely get to read actual books and there aren't many that interest me at the Connect store yet anyway.

vranghel
11-04-2006, 06:10 PM
Wow! Works great geekraver! Thanks a lot!

One other question: how far back can your program pull articles from?
I tried getting articles as far back as 100 days but i only got 30 days worth of articles.
Is there a 1 month limit?

geekraver
11-04-2006, 06:42 PM
Wow! Works great geekraver! Thanks a lot!

One other question: how far back can your program pull articles from?
I tried getting articles as far back as 100 days but i only got 30 days worth of articles.
Is there a 1 month limit?

It depends on the site. I only pull things that are in the RSS feed at the specified URL that fall within the range specified. If the site only includes the last 30 days in the RSS, then that's what you'll be limited to.

vranghel
11-04-2006, 07:05 PM
oh! i undestand...so it's not the program, it's the feed that's the limiting factor.

geekraver
11-04-2006, 07:49 PM
Okay, so I ended up making another release again already. I wanted to fix a couple of issues (web exceptions on a single entry preventing a whole feed from being handled, for example). I also found that I often ended up running in the debugger to understand why a feed didn't work; to make it easier for others I added a detailed test log window; if you click on Test then when the program is done testing the feed this window will pop up with lots of info about what happened.

vranghel
11-04-2006, 09:54 PM
I wish iRex would release software as least 10 times slower than you do.

Keep up the great work!

heavyB
11-04-2006, 11:00 PM
I sync with rss2book every morning, pour a cup of coffee, and read in natural sunlight the articles your app makes possible. Thanks for taking the time to not only make this, but improve on it so quickly. No small effort!

I think we should start a thread for nothing but xml imports. I get all mine together and if it hasn't happened yet, I'll start one.

geekraver
11-05-2006, 01:41 AM
I've been thinking about what the best approach is for collecting them. There are various options:

- I could collect them and put them on my website
- I could keep adding them as attachments in the initial post; that may become unwieldy
- I could keep adding them to a single big Xml file that is kept with the initial post
- we could use the wiki
- we could just keep them on a thread

The main drawbacks to the last approach seem to be the haphazard organization that would result. Right now it seems like the wiki might be the best approach, and I can roll up the submissions on occasion into a single file and attach that to the first post.

So I've started a page at http://wiki.mobileread.com/wiki/Xml_feed_files

vranghel
11-05-2006, 02:20 AM
The wiki is a good ideea. I tried accessing it, but i only got text. Seems i have to save the file then rename it to xml.

Now we just need to add some more feeds.

Some ideas:
www.damninteresting.com (great articles)
www.lifehacker.com
www.boingboing.net
www.engadget.com
www.slashdot.com
www.wired.com
www.techcrunch.com

etc....

geekraver
11-05-2006, 03:12 AM
Yeah, the wiki doesn't allow file uploads on non-media files, so for now you will have to save the XML to file and import it. I'll add the rest of my feeds tonight.

Alexander Turcic
11-06-2006, 12:15 PM
Sorry about that... if you're logged in to the Wiki, you can now also upload XML files and others should be able to directly download them.

geekraver, what do you think if we slightly reorganize the Wiki entries... I think it's easier to maintain a single page with links to all XML files to download. Suggestions?

geekraver
11-06-2006, 01:09 PM
Sorry about that... if you're logged in to the Wiki, you can now also upload XML files and others should be able to directly download them.

geekraver, what do you think if we slightly reorganize the Wiki entries... I think it's easier to maintain a single page with links to all XML files to download. Suggestions?

Yes, that would be better.

geekraver
11-07-2006, 02:42 AM
It would be good to make a decision one way or the other and stick with it; that way I could have rss2book import directly from the wiki (and maybe even export, too).

aoni
11-07-2006, 04:03 PM
Hi could someone help me with the correct configuration for cnn.com. Thanks for this great program!

Allen

geekraver
11-07-2006, 07:15 PM
Sorry about that... if you're logged in to the Wiki, you can now also upload XML files and others should be able to directly download them.


BTW I tried this last night and I still got errors.

geekraver
11-07-2006, 07:43 PM
Hi could someone help me with the correct configuration for cnn.com. Thanks for this great program!

Allen

It's on the wiki

geekraver
11-08-2006, 08:05 PM
Sorry about that... if you're logged in to the Wiki, you can now also upload XML files and others should be able to directly download them.

geekraver, what do you think if we slightly reorganize the Wiki entries... I think it's easier to maintain a single page with links to all XML files to download. Suggestions?

I've decided to build this in to the app, and to make it easier for me to maintain I'm just going to use my own server. I'll add a 'Publish' button that can be used to publish a feed using WebDAV to my server, and a Subscribe menu entry to pull down any new feeds from the server.

neilm2
11-12-2006, 11:28 AM
The Wikipedia sample (started in release 12) is a lot of fun! I subscribed to the Wikipedia sample, then went into rss2book's "customize" screen and typed random words and phrases at the end of the URL and hit the "test" button. Boom, a pdf would open with the wikipedia entry, photos and all... The sample's URL is http://en.wikipedia.org/wiki/nikola_tesla - replace "nikokola_tesla" with "the_clash" or "sony_reader" or "robocop" and hit the "test" button each time. It's like surfing Wikipedia in pdf. Of course, you'll want to save your favorites to the Reader.

kahm
11-28-2006, 09:39 PM
Okay, I'm having a heck of a time getting this to work. I've got HTML doc working, etc, and when I get an actual working feed it seems to work perfectly.

Problem is, most of the feeds aren't working. Seems to be a problem with the Content Element field.

Take Slashdot, for example. Their feed is: http://rss.slashdot.org/Slashdot/slashdot

The XML in the wiki is very basic. Just the feed name and get $link. WHen I do that and test it, all I get in the log window is:

Processing feed Slashdot
2006-11-29T00:21:00+00:00 is in range

Putting description in the Content element field nets me the same thing. If I put link in the Content element field then it "works", showing me the list of article titles and their links:

Easy Throw-Away Email Addresses
http://rss.slashdot.org/~r/Slashdot/slashdot/~3/55239644/article.pl
Novell Dumps the Hula Project
http://rss.slashdot.org/~r/Slashdot/slashdot/~3/55222711/article.pl

If I put title in the Content Element field, it doesn't work. The log shows me something else happening:

Processing feed Slashdot
2006-11-29T00:21:00+00:00 is in range

Final content:
null
No content for Barney Surrenders To the EFF
2006-11-28T22:32:00+00:00 is in range

Final content:
null
No content for Easy Throw-Away Email Addresses
2006-11-28T21:51:00+00:00 is in range

WHat the heck is going on here? :(

I can get slashdot working by using the link element and the content extractor, but it's slow and cludgy, and doesn't format well.

I do have one basic feed that does work - digg. Also, the dilbert one seems to work as well.

Can anyone help me?

geekraver
11-29-2006, 02:44 AM
I found the problem; I introduced a bug at some point that caused the app to silently stop processing if the description element in a simple feed contained http:// links. Fixed in release 14.

BTW don't use the Wiki - use the Subscribe option in the File menu to add feeds.

Alexander Turcic
11-29-2006, 11:58 AM
hey geekraver, I am about to clean up the Wiki a bit. Any suggestions what I should do with the xml snippets?

charkins
11-29-2006, 01:55 PM
geekraver:

You mentioned in another thread (Java rss2book in the developer's corner) about possibly releasing your C# code to serve as a basis for Java apps. While that would certainly be helpful, I think it would be possible to get your rss2book app to run under Linux using mono. I've tried already, but mono dies on some unimplemented features of Windows.Forms. It might be possible to rework aspects to avoid using Windows.Forms features that are not implemented in mono, or share the backend and write another front end using Gtk#. Never used .NET myself, but I'd definitely spend some time trying to get it working if the code was available!

geekraver
11-30-2006, 02:28 AM
hey geekraver, I am about to clean up the Wiki a bit. Any suggestions what I should do with the xml snippets?

May as well remove them, thanks!

kahm
12-01-2006, 12:17 PM
I found the problem; I introduced a bug at some point that caused the app to silently stop processing if the description element in a simple feed contained http:// links. Fixed in release 14.


Thanks! Have you released v14 yet? I can't find a link to it anywhere in the content forum now...


BTW don't use the Wiki - use the Subscribe option in the File menu to add feeds.

I went to the wiki after importing didn't work. After the wiki I started experimenting - ended up spending a couple of hours messing around to try and find out what was up...

Looking forward to a working copy... :)

Alexander Turcic
12-01-2006, 03:33 PM
May as well remove them, thanks!
Will do! ;)

Hadrien
12-01-2006, 03:55 PM
Instead of a focus on grabbing the content from the pages, it could be interesting to use something like this: http://www.dappit.com
You can then generate the RSS with all the information you need and just create a PDF out of it.

ThomWill
12-01-2006, 11:37 PM
Just took the time to really read this and get this working. This is a marvelous tool. If I had a suggestion it would be to disable the publish button or add a double-dare. I keep hitting it by mistake and it would seem (from the subscription list) that I am not alone.

Thanks GeekRaver !!!!

Am I missing it, or did the link to the app go away from this thread in the last couple of days?

BettyE
12-02-2006, 12:20 AM
I would love to try this, but is the file missing from the first message?

Betty

noads
12-03-2006, 08:24 AM
How can I download rss2book to give it a try? I've been looking all over this thread and web site and couldn't find a download link... please help.

neilm2
12-03-2006, 02:35 PM
BettyE and Noads speak the truth... the rss2book installer is -- gulp -- missing! I'm sending Geekraver a note, now.

geekraver
12-05-2006, 03:54 AM
Oops - my bad for finishing up r14 in the wee hours and not checking what I was doing when I edited the initial post - it is back now.

geekraver
12-05-2006, 03:57 AM
geekraver:

You mentioned in another thread (Java rss2book in the developer's corner) about possibly releasing your C# code to serve as a basis for Java apps. While that would certainly be helpful, I think it would be possible to get your rss2book app to run under Linux using mono. I've tried already, but mono dies on some unimplemented features of Windows.Forms. It might be possible to rework aspects to avoid using Windows.Forms features that are not implemented in mono, or share the backend and write another front end using Gtk#. Never used .NET myself, but I'd definitely spend some time trying to get it working if the code was available!

When I have time to figure out how to give read-only access to my Subversion server I'll make the code available again.

BettyE
12-06-2006, 02:06 AM
Thanks, Geekraver!

Betty

Fugubot
12-11-2006, 03:26 PM
Geekraver,

What an amazing program! Thanks.

A lot of members have been looking for the files necessary to install RSS2BOOK. There are links in this extended thread - some are not working and others don't seem to work (for example, I tried installing the program with HTMLDOC version 1.8.28 and RSS2BOOK did not seem to like it. Version 1.8.25 worked fine).

Any reason we can't aggregate all the necessary files and make them available from one consistent download site?

(If we have the appropriate permission, I have all the files zipped together and would be glad to upload it here or elsewhere)

After all the work that Geekraver has done, more people should have access to it. It makes the reader all that more powerful.

geekraver
12-12-2006, 01:47 AM
I'm using 1.8.27; I haven't tried 1.8.28. 1.8.25 is more stable that 1.8.27, IIRC - 1.8.27 often crashes when generating a TOC that is too large.

I'm not sure whether there are licensing issues with redistributing htmldoc; I assumed there were which is why I never bundled it.

<advertisement>
BTW I want to comment on my .sig - the Windows Live Search client for Windows Mobile is what I work on in my day job (fortunately assisted by another fantastic developer). Anyone who has a Windows Mobile phone should try it; it is great (and free). We have a cool J2ME version too for other phones (unfortunately not Blackberry yet, but that will come).
</advertisement>

geekraver
12-12-2006, 03:12 AM
I've just uploaded release 15. No real changes except that when you publish feeds a check will be done to see if the feed already exists. I encourage people to use this to help avoid the duplication of published feeds.

vitualis
12-14-2006, 08:31 PM
What a great project!

With regards to content, just as a lateral thought, would it be possible, e.g., to somehow connect to AvantGo? They have a plethora of good quality content that is already formatted for "small screen" devices...

Keep up the good work!

geekraver
12-14-2006, 10:27 PM
I do plan to set up a bug report/feature request site soon, and also add automatic updating. Not sure how reliable the electricity supply is going to be in the next few days in the Seattle area with the mother of all storms moving in, but hopefully it will be done real soon.

neilm2
12-16-2006, 12:15 AM
I've had some good luck lately finding full-text feeds. The attached xml file has about 50 full feeds, including about 20 that I haven't yet published on the rss2book server.

(I'm using version 13 of rss2book, and the "publish" feature isn't working for me right now.)

Anyway, here are the feeds...

geekraver
12-18-2006, 03:07 PM
Just want to apologize for publish/subscribe not working; I was in an area badly hit by the windstorm last Thursday and my house is still without electricity and may be for several more days. We had an outage last Wednesday too, so I'm rapidly approaching one week of server downtime. My UPS can only handle about an hour, unfortunately!

I'll post a message once I have power again and you can publish/subscribe away again.

I used to joke when I move to the USA from South Africa five years ago that I had moved from the first world to the third, but it feels more and more like that. In Cape Town winds like what Seattle had last week are pretty normal, and the power never goes out. Back in the bad old days of apartheid we'd have rare outages if the ANC blew up a generator somewhere but the power would usually be back up in an hour or two. You'd think that Americans would learn from experience and remove large trees from powerlines - these outages happen to us several times a year although not this bad - but I guess this is yet another of a number of areas where they don't. Or maybe its because they privatize the power system but have area-based monopolies, so there is little incentive to invest in fixing things instead of just repeatedly patching. Whatever the case, its makes me yearn to move back to civilization.

geekraver
12-30-2006, 02:17 AM
Release 19 is done. It uses the iTextSharp library (which is why the zipfile is now way bigger) which allows it to generate PDF files without requiring htmldoc. The html to PDF conversion is fairly basic so you can still elect to use htmldoc if you want more sophisticated conversion (including TOC).

Release 19 can also generate RTF files with images.

I was hoping to have Gutenberg integration by now but haven't done so; nontheless I published a sample 'feed' which is an example of how you can use rss2book to format Project Gutenberg books for your e-Book device.

neilm2
01-01-2007, 01:24 PM
Thanks, Geekraver. Quick question: What is the advantage of having images with RTF if they don't display on the Sony Reader? Am I missing a trick on how to take advantage of that new feature?

geekraver
01-03-2007, 02:45 AM
Thanks, Geekraver. Quick question: What is the advantage of having images with RTF if they don't display on the Sony Reader? Am I missing a trick on how to take advantage of that new feature?

No real advantage except that I want rss2book to be useful beyond the Sony Reader (and perhaps in the future the Sony reader will add image support).

col
01-08-2007, 12:33 PM
Great program but can't get it to combine on pdfs it will only work if I uncheck the combine box.any ideas?

geekraver
01-09-2007, 04:28 AM
Great program but can't get it to combine on pdfs it will only work if I uncheck the combine box.any ideas?

Are you using htmldoc or the built-in PDF converter? The built-in converter in release 19 doesn't like improperly nested tags, and that can happen much more easily if you combine the books. This is fixed in rel 20 (not public yet).

fritz_the_blank
01-14-2007, 03:39 PM
@geekraver--

First, thank you so very much for working on this application. The idea of being able to read the NY Times, BBC World News, NewsWeek and etc. on my reader makes me as happy as a clam.

I am certain that I am doing something wrong and am hoping that you can point me in the right direction. When I grab an xml feed, I only get the first level of content, i.e., any embedded links are not available. Is there a way to have this applications scrape 2 or 3 levels deep? One other issue has to do with cookies--one of the feeds that I would like (Newswek.com) returns a document with the following text:

Cookies not enabled Cookies Required


I would appreciate any advice and thank you once again for working on this application.

geekraver
01-16-2007, 03:53 PM
I'll look into the cookie issue. Can you give me an example of a site with which you have the first problem (levels)?

fritz_the_blank
01-17-2007, 03:19 AM
@GeekRaver--

Thank you for replying. As for the first issue, try:

http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml
Thanks again,


PS--I have written some code that scrapes all of the .xml files from a given page. Here is the code in case anyone should find it helpful:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>GetFeeds</title>

</head>
<%
Function GetHTML(strURL)
Dim objXMLHTTP, strReturn
Set objXMLHTTP = Server.CreateObject("MSXML2.ServerXMLHTTP")
objXMLHTTP.Open "GET", strURL, False
objXMLHTTP.Send
If Err <> 0 then
strReturn="Error"
Else
strReturn = objXMLHTTP.responseText
End If
Set objXMLHTTP = Nothing
GetHTML = strReturn
End Function

Function CleanURL(strURLText,strURL)
strStringTemp = Replace(strURLText,"href","",1,-1,1)
strStringTemp = Replace(strStringTemp,"=","",1,-1,1)
strStringTemp = Replace(strStringTemp,">","",1,-1,1)
If InStr(1,strStringTemp,"http:",1) < 1 Then
strStringTemp = strURL & "/" & strStringTemp
End If
strStringTemp = Replace(strStringTemp," ","",1,-1,1)
strStringTemp = Replace(strStringTemp,"""","",1,-1,1)
strStringTemp = Replace(strStringTemp,"""","",1,-1,1)
strStringTemp = Left(strStringTemp,8) & Replace(Right(strStringTemp,Len(strStringTemp)-8),"//","/")
CleanURL = strStringTemp
End Function

Sub findLinks(strPageToParse)
Set objRegExp = New RegExp
objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?[^>]*?>"
Set colMatches = objRegExp.Execute(strPageToParse)

Dim intCounter
intCounter = 0
For Each itmMatch in colMatches
If InStr(1,itmMatch.value,".xml",1)>1 then
Response.write(CleanURL(itmMatch.value,strURL) & "<br />")
intCounter = intCounter + 1
If intCounter>999 Then
Exit For
End If
End If
Next
Set objRegExp = Nothing
Set objXMLHTTP = Nothing
End Sub

strURL = "http://www.nytimes.com/services/xml/rss/index.html"
strPageToParse = GetHTML(strURL)
Call findLinks(strPageToParse)
%>
<body>

</body>
</html>

FtB

noads
01-18-2007, 12:37 AM
I subscribed to Economist.com online version, and it uses cookies to determine whether I am a subscriber (am logged in) or not. If rss2book can put the online version (i.e., enable cookies), it can save me some bucks and the planet some trees.

Have you noticed your reading habit has changed with Sony Reader+RSS2Book? Me for one is reading much more online stuff (vs paper stuffs) thanks to this combo. Thanks Geekraver!

AndyQ
01-19-2007, 09:51 AM
Geekraver, do you plan on releasing the source for rss2book at all?

geekraver
01-21-2007, 01:13 AM
I originally did include source, but later decided not too. Mostly because I'm now taking donations (all $25 so far) and I figured if I released source someone will probably rip off what I've done and try make money from it, and I don't feel like dealing with that. There may come a time when I don't care anymore and will again release the source but I have invested a lot of time in the code adding features that I didn't need to (like the WebDAV publish/subscribe), and it would be nice to see some return on that investment. My plans for now (once I finish up with a separate project I'm working on, which is why I'm not that active right now) are to try to generalize the code to the point where it can be extended via plug-ins. I'm already quite close to that on the back end (i.e. turning the HTML into PDF, RTF, etc), and want to do some more on the front end to extend the UI and range of sources.

geekraver
01-21-2007, 02:00 AM
The New York TImes is also constrained by cookies. Setting up the actual rss2book entry to pull down full articles is easy, but without login and cookie support you get a registration page. So I'll have to add the ability to do a HTTP POST to such sites with appropriate login info first, so as to get the cookies. I'll work on it in the next week or so.

AndyQ
01-21-2007, 06:03 AM
I originally did include source, but later decided not too. Mostly because I'm now taking donations (all $25 so far) and I figured if I released source someone will probably rip off what I've done and try make money from it, and I don't feel like dealing with that. There may come a time when I don't care anymore and will again release the source but I have invested a lot of time in the code adding features that I didn't need to (like the WebDAV publish/subscribe), and it would be nice to see some return on that investment. My plans for now (once I finish up with a separate project I'm working on, which is why I'm not that active right now) are to try to generalize the code to the point where it can be extended via plug-ins. I'm already quite close to that on the back end (i.e. turning the HTML into PDF, RTF, etc), and want to do some more on the front end to extend the UI and range of sources.


Thats a shame as I'm looking at an RSS import for BBeBinder and your html output converts quite nicely to LRF. What I was thinking of was to wrap your output engine as a library and use that.


If you do change your mind at any point then let me know. (Note BBeBinder is totally opensource).

geekraver
01-23-2007, 02:01 AM
Given what I'm planning to do it should be possible to write a plug-in for BBeBinder. It's not cast in stone yet but the current interface for back ends is pretty simple.

geekraver
01-23-2007, 02:28 AM
It turns out with NYT that if you use the guid elements to get the story you need to login, but if you use the link elements you do not. I have published a working feed for NYT that you can subscribe to.

geekraver
01-23-2007, 02:55 AM
Newsweek is done and working too....

fritz_the_blank
01-24-2007, 06:07 PM
Dear GeekRaver--

Thank you once again for all of your fine work. The New York Times works like a charm. Newsweek is still giving me problems--the .rtf output looks like this:

7 Simple Ideas That Can Save the World
7 Simple Ideas That Can Save the World - Newsweek: International Editions - MSNBC.com .updateTime{font:10px Arial;color:#000000;} .credit{font:10px Arial;color:#999999;} .head{font:bold 18px Verdana;color:#CC0000;} .abstract{font:14px Verdana;color:#000000;} .title{font:bold 12px Verdana;color:#000000;padding:3px 0px 3px 0px;} .source{font:bold 11px Verdana;color:#CC0000;} .footerLink{font:bold 10px Verdana;color:#000000;} .caption{font:10px Verdana;color:#000000;} .textBodyBlack, .copyright{font:12px Verdana;color:#000000;} .copyright{font-style:italic;} var section_name='intl/davos';

I am probably doing something wrong, so if someone can point me in the right direction, I would be much obliged.

pclewis
01-24-2007, 06:24 PM
Dear GeekRaver:

I had this feed working fairly well to a file (not direct to the ereader). All of the sudden I am getting a error and program closure in the HTMLDOC v1.8.27

I reinstalled the HTML and the dll's. Rebooted. Still the same error. Can you guide me?

Thanks,

Phil

pclewis
01-24-2007, 06:51 PM
GeekRaver:

When feeding to pdf it crashes HTMLDOC. However, when I feed to pdf without checking the HTMLDOC it seems to work. But of course I get no TOC this way.

I am going to try it on my laptop again. Any ideas why HTMLDOC is crashing?

Cancel that. When I take some of the feeds out it works. I think it is one of the feeds that is causing the problem.

Phil

geekraver
01-25-2007, 12:19 AM
Phil:

htmldoc has a bug with TOCs. I think if the TOC is too big it crashes. You can try reduce the number of TOC levels (specified in the htmldoc options).

Fritz:

Are you using htmldoc or the built in PDF generator? The built-in one does not strip <script> tags which may be the problem. This is fixed in release 20 which I will probably release earlier than planned with a few features cut; watch this space.

Fugubot
01-25-2007, 03:19 AM
Geekraver,

If you don't mind, another future feature request: Select all/Deselect all checkbox on the main screen.

Just a thought.

geekraver
01-25-2007, 05:12 AM
Geekraver,

If you don't mind, another future feature request: Select all/Deselect all checkbox on the main screen.

Just a thought.

You asked just in time; I'm putting release 20 up now. You can enable/disable all from the Tools menu.

fritz_the_blank
01-25-2007, 10:33 AM
Dear geekraver--

Thank you once again for your help. I changed the options to use HTML doc and to output the format as .pdf. Everything looks as it should. I added a number of feeds from the NY Times, and it is truly fantastic. The Newsweek setting that you created works wonderfully as well. I tried duplicating it for the other sections with the same settings but with differing results. I wonder if the link extraction pattern needs to be different. I am going to look at the feeds to see if I can figure this out. In any event, I am thrilled with the direction this software is going.

geekraver
01-25-2007, 12:57 PM
Thats a shame as I'm looking at an RSS import for BBeBinder and your html output converts quite nicely to LRF. What I was thinking of was to wrap your output engine as a library and use that.


If you do change your mind at any point then let me know. (Note BBeBinder is totally opensource).

Andy, right now with release 20 you could write a plug-in for rss2book to generate LRF. You need to implement IHtmlConverter (add a reference to IHtmlConverter.dll), which is pretty straightforward:

[Flags]
public enum FontStyle
{
Normal = 0,
Italic = 1,
Bold = 2,
Underline = 4
}

public enum TypeFace
{
Courier,
Helvetica,
TimesRoman
}

public interface IHtmlConverter
{
string Name // e.g. "PDF"
{
get;
}
string Extension // e.g. ".pdf"
{
get;
}
void Initialize(int leftMargin,
int rightMargin,
int topMargin,
int bottomMargin,
int pageWidth,
int pageHeight,
TypeFace font, // for normal text
int fontSize // for normal text
);

void HandleText(string text, TypeFace face, FontStyle style);
void FlushParagraph(); // called at <p> elements
void LineBreak(); // called at <br> elements
void EnterHeader(int level); // called at <h1>, <h2>, etc
void ExitHeader(); // called at </h1>, ...
void StartUnorderedList(); // called at <ul>
void StartOrderedList(); // called at <ol>
void FlushListItem(); // called at end of <li>
void EndList(); // called at </ul> or </ol>
void AddImage(string fname); // called for <img>
byte[] GetBytes(); // called at end to get the converted output
}

The DLL for your plugin needs to have a name starting with "write", and go in the same directory as rss2book.exe.

As you can see most entry points are for simple demarcating elements. HandleText is called for the text between elements. Right now the typeface can't change so HandleText will always pass the default typeface through, but the style can change (to combinations of bold and italic).

There is a chance I'll make front-ends plug-ins too but that is a much more ambitious project as it requires custom UI for each (I'm thinking things like rss, html, wikipedia, Gutenberg, ...). So writing a back-end converter plug-in is the easiest approach for now.

pclewis
01-25-2007, 01:15 PM
Geekraver

You are "His Geekness". It was a TOC problem with one of the feeds. Cutting the levels allows the files to be combined and downloaded.

I'll move to 2.0 and send you some $$$.

Keep up the good work.

Phil

gr8drd
01-25-2007, 09:46 PM
Ok. So where are the binaries? I don't see them attached to any message on the thread. Am I missing something?

geekraver
01-26-2007, 01:15 AM
Sorry, I did upload them, or so I thought, but you're right, they weren't there. They are now.

bugsbunny14
01-26-2007, 06:32 PM
hello all ,

geekraver good work :)

is it possible to create .pdf format file of the engadget rss feeds for the past 10 days ( http://www.engadget.com/rss.xml ) using this software ?

geekraver
01-26-2007, 08:12 PM
hello all ,

geekraver good work :)

is it possible to create .pdf format file of the engadget rss feeds for the past 10 days ( http://www.engadget.com/rss.xml ) using this software ?

I published an entry you can subscribe to, butyou could have done it yourself; they don't get much easier to add new feeds than full-text RSS feeds like Engadgets. All you had to do was enter a name and the URL.

bugsbunny14
01-27-2007, 08:17 AM
geekraver , you are a genius .

1- now is it possible to make the .pdf font bolder ?

2- why i can't sync directly to my reader ? it says "Failed to sync C:\rss2book.pdf " .

1-the reader was attached
2- the sony connect software was not running
3- and in options i have set it to sync directly to my reader memory and i am sure that i've set the sony connect directory correctly .

geekraver
01-27-2007, 12:38 PM
geekraver , you are a genius .

1- now is it possible to make the .pdf font bolder ?

2- why i can't sync directly to my reader ? it says "Failed to sync C:\rss2book.pdf " .

1-the reader was attached
2- the sony connect software was not running
3- and in options i have set it to sync directly to my reader memory and i am sure that i've set the sony connect directory correctly .

1. I'm surprised you need it bolder than it is; could you send me the details of the feed you think is inadequate? What are you font settings on the page layout tab? I find Helvetica 14 very readable.

2. I'm not sure. What did you select on the Sync Files To option? For main memory I use /Data/media/books IIRC, for SD I use b:/, and for Memory Stick I use a:/, but I guessed at the latter as I don't have a memory stick. Is the Sony CONNECT Software Directory option set correctly? (on my system this is c:\Program Files\Sony\CONNECT Reader\Data\bin; note the \Data\bin part; you want to point to the directory that has the Sony DLLs for synching).

If none of the above tips help I can only suggest using a debug version of usbshim.dll; I'll attach it to this message. This will write some diagnostic info to the file c:\usbshim.log that may shed light on things.

bugsbunny14
01-27-2007, 09:15 PM
1- what do u mean by b:/ , a:/ , /Data/media/books IIRC ? i've set my pc output directory to c: and sync files to sd card .

2-geekraver on my pc where do i have to put UsbShim.dll ?

thanks .

geekraver
01-28-2007, 02:27 AM
1. I guess what I was asking is are you using a memory stick, as I just guessed at what the underlying setting was for that case. But as you're using SD that shouldn't be an issue; I use SD myself.

2. In the same directory as rss2book.exe (you'll already have one; you need to temporarily replace that with the debug version, re-run rss2book, try sync, and then look at the debug output in c:\usbshim.log

bugsbunny14
01-28-2007, 09:56 AM
geekraver now it's working , thanks alot .

loneBoat
01-29-2007, 05:05 PM
Whenever I have a large number of feeds selected, HTMLdoc crashes. Does anyone know anything about the limitations of HTMLdoc?

P.S. This is my first post ever at mobileread. Yahoo! :)

lhilden
01-29-2007, 10:30 PM
GeekRaver,

This tool rocks! I sat down to write something similar this weekend and it turns out you already did it. When I first picked this thing up I was really disappointed. The hardware is great, but the content bites. 8 1/2 x 11 PDFs are totally unreadable, the books on Sony Connect suck, and the RSS feeds are a joke.

Anyway thanks so much for writing this, I'll will be PayPal'n something. I was so impressed I contributed a bunch of RSS feeds for SF Bay Area content this weekend.

A couple of issues I ran into with release 20:

1. Consider putting in a check in feed name column to make sure there are no illegal filename characters. I got an error when I tried to put a colon in the Name since you use that when you create the output file.

2. The 'Days' field on the Customize feed dialog doesn't take '0' for a valid value like the "Number of Days" field does in the Options-->Sync tab. It would be nice if it did. To work around this I setup my feeds with a high days number like 30.

Thanks again,
Lou

firebird2k
01-29-2007, 11:30 PM
Love the program. I may have went a little overboard on RSS feeds. My pdf file came out to about 500 pages in PDF. Seems to take forever to load up. Any ideas on how to speed it up?

Also with Release 20, I set the Number of Days to 3 days, but it seems to be pulling the entire feed. Anyway to control that when I click on the go button and limit the number of days?

Thanks!

geekraver
01-30-2007, 02:17 AM
loneBoat, firebird2k - I recommend using the option to split each feed up into a separate file. It makes things much more manageable once you have a large number of feeds and will make htmldoc crashes less of an issue (also you can try use the build in PDF converter instead of htmldoc; it is nowhere near as complete but it works okay for me).


Lou, thanks for the feedback. Number 2 was already on my TODO list; I'll look into escaping colons to work around issue 1.

bugsbunny14
01-30-2007, 09:44 AM
hi,

i entered the rss feeds :

http://www.hdbeat.com/rss.xml
http://www.ps3fanboy.com/rss.xml

and set the days to to 3 and output file format to pdf .

the problem : the generated pdf file of each rss feed only consist the last 3 posts while it should consist about 20 posts , why is this happening ?

geekraver
01-30-2007, 01:03 PM
hi,

i entered the rss feeds :

http://www.hdbeat.com/rss.xml
http://www.ps3fanboy.com/rss.xml

and set the days to to 3 and output file format to pdf .

the problem : the generated pdf file of each rss feed only consist the last 3 posts while it should consist about 20 posts , why is this happening ?

Did you set the Days value in the Options menu to zero? If it is not zero it overrides the per-feed values.

bugsbunny14
01-30-2007, 05:55 PM
Did you set the Days value in the Options menu to zero? If it is not zero it overrides the per-feed values.

yes it is set to 0 .

i tried it again now it generates a pdf file with only 7 posts , it changes from time to time ( yesterday it was 15 posts ) . but in the log file it always says

Processing feed ps3fanboy
Added 20 articles

geekraver
02-01-2007, 03:25 AM
Thanks - I've found and fixed this bug. I'll release an update shortly once I've added a couple of other feature requests.

bugsbunny14
02-01-2007, 04:11 AM
thanks geekraver .

geekraver
02-09-2007, 03:29 PM
Just an update on what I've been doing, as I did say I was going to release v21 about a week back. I've been really busy with what has almost turned into a rewrite of the app, as I wanted to support pluggable sources. I expect to be done by the end of this weekend. Currently I have plugins for RSS, web pages (including recursive fetches), Wikipedia entries (also with recursive fetches), and FictionBook.ru books (for those that want to go there). Hopefully by the time I release I'll also have Gutenberg books, and sites that have X-Word.com crosswords. And of course it seems I may have to test against the updated Sony software. Watch this space.

Bob Russell
02-09-2007, 03:40 PM
Wow!

What else can be said, but "wow!"!???
Talk about adding nice functionality to the Sony Reader... this sure does it!

If possible, don't forget a map and/or driving directions plugin!
Heck, this could probably even lead to Outlook task list or address book plugins, recipe site plugins, and all kinds of neat stuff!

Wow!

Bob Russell
02-09-2007, 03:43 PM
One more thought that maybe our Treo readers can help us with... there's a program for the Treo that grabs info from a lot of different web sites in convenient smartphone form, and it's based on user-creatable plugins.

Even though that program is probably a bit more interactive in nature, I bet it could be a really good source of additional ideas for potential plugin development if it catches on.

Now if I could only remember the program's name.

geekraver
02-10-2007, 12:31 AM
Are you thinking of AvantGo?

Also, those are good ideas for sources. I had been thinking in a web-centric way in which each source had an URL, but clearly for an Outlook address book that wouldn't apply - so I should be even more general.

Bob Russell
02-10-2007, 02:10 AM
Nope, not AvantGo, but that's an interesting thought for ideas also.

Here's the program I was thinking of - Genius!
http://www.hobbyistsoftware.com/genius-plugins.php

bugsbunny14
02-13-2007, 05:46 AM
hi geekraver ,

when will v21 be released ?

geekraver
02-14-2007, 01:10 AM
I'm very busy with it - it has taken much longer than expected but the code is now much cleaner and I think it will be worth the wait. Right now I'm mostly just bug fixing and testing, and finishing up the new plugins for books and web pages. I'd say it should be at most one more week, hopefully less.

bugsbunny14
02-14-2007, 02:00 AM
I'm very busy with it - it has taken much longer than expected but the code is now much cleaner and I think it will be worth the wait. Right now I'm mostly just bug fixing and testing, and finishing up the new plugins for books and web pages. I'd say it should be at most one more week, hopefully less.

thanks .

lhilden
02-14-2007, 12:23 PM
GeekRaver,

Any idea why I'm getting these strange "Sun, 11 Feb 2007 01:24:48 EDT is out of range" errors for the attached feed? I had this working and now its busted :(.

Thanks,
Lou

Fubrite
02-23-2007, 03:29 PM
Hi,

I'm using ver 21 of this software, and am having a few problems - either the log shows 'processing....' indefinitely and nothing seems to happen, (Google news and Dilbert) or I get the following error

System.ComponentModel.Win32Exception: The system cannot find the file specified
at System.Diagnostics.Process.StartWithCreateProcess( ProcessStartInfo startInfo)
at System.Diagnostics.Process.Start()
at web2book.Utils.RunExternalCommand(String cmd, String args, String workdir, Boolean useShell, Int32 timeout, String& output)
at web2book.MainForm.ConvertHtml(StringBuilder htm, String fname, ContentSource cs)
at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source)

The html files are produced and I can run them manually through HTML2DOC and they come out perfectly, but rss2book doesn't seem to get that far. (This happens with BBC News, NYT and Engadget.)

The other thing is, I noticed that the name of the install file at the top of this thread is the same as that in the Alpha version thread, and also the installed program is called Web2book rather than rss2book. If I'm right in thinking I'm running the Alpha version, can you tell me where to get a copy of an older version of rss2book to try?

Thanks

geekraver
02-23-2007, 09:49 PM
Fubrite, the "alpha" version is no longer alpha, it is the most stable version.

The Dilbert problem is simple; RSS feeds are now separate from web pages. Delete the entry you have for Dilbert, then go to the Web Pages tab, and do a "File - Subscribe" and add the new Dilbert entry.

Something similar goes for the Google News page, although that is hitting a different bug in the app due to the page having empty HREFs. That bugfix will be in the next update in a day or two (hopefully by that point I'll also have auto-updating in place).

The exceptions you are getting are weird. HTMLDOC output works fine for me. The app does check the path to htmldoc.exe and should report an error rather than throwing an exception if you haven't configured it right. But I would suggest checking the HTMLDoc path setting in your options anyway to make sure it is right.

Fubrite
02-24-2007, 05:42 AM
Thanks for your quick reply, Geekraver!

I did check the path to HTML2DOC, thinking it might have been that, but it's fine....

Oh well, I'll just have to use basic PDF's for now. The other question I have is where would I get the Librie .dll's to generate an .lrf file?

Cheers

Edit: Just discovered that the exception happens for basic PDF and HTML too...

geekraver
02-24-2007, 02:47 PM
Do you by any chance have something specified in the HTMLTidy path option? Try removing that. It should be set to the full path of the directory containing tidyhtml.exe, or blank. Anything else will cause this error (I need to make this more robust, I admit). You may have an installation where the .exe is named "tidy.exe"; in this case rename it or copy it to "tidyhtml.exe". I will change the code so the next version lets you specify the path of the executable rather than the directory.

The Librie DLLs are bundled in the installer. Be warned though that they seem to be very fragile; in many cases they just blow up when you feed them HTML. They work for simple stuff only. In the future I'll probably run the HTML through the internal PDF/RTF converter but just to simplify the HTML first, so it contains just basic tags. That should help.

adinb
02-25-2007, 03:02 AM
I know you just got V21 out, and that there are a *lot* more options to a standard entry than is contained in an OPML, but an OPML import with defaults filled in would be *really* nice.

Fubrite
02-26-2007, 03:58 AM
Geekraver, I did have something in the HTMLTIDY box, that sorted it, thank you!

However, I do have another issue (tell me to go away if you like!) The HTMLDOC PDF's get generated fine - but without the table of contents. When I output in HTML and run it through HTMLDOC manually, I do get a TOC... It may be that (knowing me!) I've changed/deleted something in the HTMLDOC options box - what's the default supposed to be?

Many thanks for your help

Fubrite

geekraver
02-28-2007, 04:43 AM
Sorry, this is a known bug that will be fixed in the next release. For the time being you can replace the writehtmldoc.dll with the one attached to this message.

Fubrite
02-28-2007, 01:35 PM
Geekraver,

Thanks for the response, I tried that, (I renamed the old file to writeHtmlDoc.dll.old and copied and pasted the new file into the folder) and the option for HtmlDoc PDF disappeared from the options list completely.....

geekraver
03-01-2007, 01:12 PM
Aargh - sorry, you're right, I made a small change to the plugin interface which broke compatibility.

Anyway, I have a proper fix in; I will release an update tonight with this fix and the new XWord/Crossword Compiler plugin.

Fubrite
03-01-2007, 01:26 PM
Thanks for your efforts, geekraver!

I'm looking forward to the new release later on tonight!

geekraver
03-02-2007, 03:42 AM
Okay, rel 22 is up (if all is well, your rel 21 should prompt you for an update automatically when you start it up).

I have added the new XWord plugin. However, it doesn't work with LRF output (not that much does with the existing Sony DLLs), nor does it seem to work with htmldoc. As far as I can tell this is a bug with htmldoc (at least the version I have); htmldoc seems to break on images that have local file:// URLs (and in fact if I invoke htmldoc with the --no-localfiles option, which is supposed to make it reject such URLs, I just get a usage error, so hrmldoc seems to have multiple issues here). The plugin will work with the built-in PDF and RTF converters, although you lose the two column layout of clues (until such time as I finally implement table support in these plugins).

Note that the URLs for sites that have crosswords generated by Crossword Compiler must have Grid.class applet elements or they won't work (confirm this by viewing the web page source from your browser). I've published one source so far that you can look at as an example; the URL is http://www.sundaytimes.co.za//Entertainment/Funstuff/crossword/archives/crossword.html

LEE YONG HOON
03-12-2007, 05:00 AM
Can't I use Rss2Book that is support Korean or Japanese?? T-T(..surpport Extended Unix Coding ( EUC ) 8-bit character encoding used primarily for Korean or Japanese...)
please...develop Rss2Book support Korean or Japanese..

drgnbear
03-15-2007, 07:47 PM
This has to be one of the coolest tools I have found. Are there any writers out there writing ebook serials? It seems like it would be fun to set up a community based site doing just that. Publish it to RSS or something.

Hadrien
03-15-2007, 08:07 PM
This has to be one of the coolest tools I have found. Are there any writers out there writing ebook serials? It seems like it would be fun to set up a community based site doing just that. Publish it to RSS or something.

Ebooks serial would be nice yes. Maybe I could do something like this too for Feedbooks ? We already provide an easy way for authors to publish their works, I could add an RSS feed for each author too.
It could work quite well too, writing directly the text inside the RSS feed and then generating the whole stuff with tools such as rss2book/web2book or what we're currently working on in our news section.
The text version of a podcast... Although podcast work with embedded files in the RSS feed instead of providing the content inside the RSS itself.

adinb
03-16-2007, 03:20 AM
Has anyone else had problems getting the current version of web2book (v23, i believe) to "Apply extractor to linked content instead of link text"?

Here's the deets:

URL: http://www.abqtrib.com/feeds/headlines/
Link Element: link
(apply extractor to linked content)
Link Reformatter: {0}?printer=1/

So, I'm just appending "?printer=1/" to the original link found in the link element to try and make it go to the printer friendly page. Even though the log shows the link formatter coming up with the correct "printer friendly" links, the pdf output is the linked page. (example: the content of http://abqtrib.com/news/2007/mar/15/man-arrested-28th-dwi-charge/ is ending up in the pdf instead of http://abqtrib.com/news/2007/mar/15/man-arrested-28th-dwi-charge/?printer=1/)

This is all using the test function, so I haven't *absolutely* verified what will be put on my reader. But this really looks like it's not following the reformatted link. If there's a different preferred way of doing this (maybe something with the link extractor pattern?) I'd love to hear it. (I can probably extract text using the content reformatter, but then I miss small graphics accompanying the stories in print mode)

Log output (abbreviated):

Processing Albuquerque Tribune Today
Got link from RSS: http://abqtrib.com/news/2007/mar/15/man-arrested-28th-dwi-charge/
Thu, 15 Mar 2007 22:05:00 -0000 is in range

Done link extraction{0} = http://abqtrib.com/news/2007/mar/15/man-arrested-28th-dwi-charge/
Reformatted link is http://abqtrib.com/news/2007/mar/15/man-arrested-28th-dwi-charge/?printer=1/

HTML of "normal" page follows (vice printer friendly page)

EDIT: Same problem reproduced multiple times, like on the The Reg, etc.

Also, there seems to be some problem using the "link" element on RSS .91 and ATOM feeds. :(

EDIT2: There also seems to be something funky going on evaluating regexp's with logical "OR"s in the ( this | that)

jezlyn
03-27-2007, 01:47 PM
Hi, All. I can't seem to download the Web2book application from Geekraver's site. It's timing out. Is there a mirror for the application, or can it be uploaded directly to the forum? I'd love to try it out, considering I'm such an RSS feed junkie. :)

Thanks in advance for any help with this.

jezlyn
03-27-2007, 07:21 PM
Anybody? I'd at least like to know if other people have been able to download the Web2Book app in the last couple days, so I know that the problem might just be with my network connection somehow. I've tried downloading the program at home and work and so far haven't been able to get a proper copy.

geekraver
03-28-2007, 04:26 AM
Sorry, I've been upgrading my systems at home to Vista, and downloading lots of updated versions of software, etc, so the network has been up and down, and when its been up its been busy.

Anyway, you can get it from CNet now:

http://www.download.com/Web2book/3000-2017_4-10649164.html?tag=lst-0-1

adinb
03-29-2007, 04:57 AM
Now that I'm getting better with more complex .Net regex's, I can also articulate potential bug #2 a little more clearly to you:

-when the "apply extractor pattern to linked content" the Link Refomatter field is still using the groupings from the link element (i.e. guid, link, etc) and not the link extractor pattern.

I'll use "The Raw Story" as an example. It's a pretty basic RSS feed with the link element = 'link'. There's a printable version of each story, but you have to follow the link element and use the link extractor pattern on the followed link. (For this example I'll say that we grabbed 'http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html')

On the followed link, I'll apply the regex "action='(http://rawstory.com/printstory.php\?story=\d+ (http://rawstory.com/printstory.php%5C?story=%5Cd+))'>" to snag the proper url for the printable version. With this regex, I should be able to make the link reformatter just {0} since I was able to pull the entire link. (yeah, I could optimize the regex, but I like 'em a little more readable, vice using backreferences, etc)

Looking in the log, the reformatted link ends up as "http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html" instead of "http://rawstory.com/printstory.php?story=5513".

Doing a little more testing, if I move around the parens to make the regex "action='http://rawstory.com/printstory.php\?story=(\d+)'>" (which makes {0}=5513) and setting the the link reformatter field to "http://rawstory.com/printstory.php?story={0}" (which should again result in "http://rawstory.com/printstory.php?story=5513") results in the following reformatted link (copied from the log):
"http://rawstory.com/printstory.php?story=http://rawstory.com/news/2007/Colbert_invites_Rom_Emanuel_on_show_0327.html"

Which is why it initially looks like the extractor isn't being applied to linked content.

If there's just some sort of undocumented selector to force the link reformatter field to use the link extractor patter when following the link element, I'd ***love*** to see it. ;)

shawn
03-29-2007, 05:13 AM
Can someone please give me some advice on formatting a particular page?
This is the page:
http://www.econlib.org/library/Mises/msStoc.html

I don't know how to use the regex filter to properly format it, I think it's getting stuck on one link that creates a javascript popup. If I could tell it to just ignore those links I think it'll be fine.

This is the log text

Processing

System.ApplicationException: Getting web page http://www.econlib.org/library/Mises/javascript:shownotepad('/notepad.html#top');notepadwindow.focus(); returned error Got web exception The remote server returned an error: (404) Not Found.

at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.Utils.ExtractContent(String contentExtractor, String contentFormatter, String url, String html, String linkProcessor, Int32 depth, StringBuilder log)
at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log)
at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)

adinb
03-29-2007, 06:58 PM
@Shawn:
Are you trying to capture this as a web page and are you trying to follow all the links in the TOC?

The error that you're getting usually indicates an invalid regex. Give me a few more details and I'll pop out a regex for you.

adinb
03-29-2007, 07:02 PM
BTW, has anyone else noticed particularly strange behavior when you get more than 128 documents on a memory card? I haven't tried it on a memory stick, b/c I'm using a 2GB SD card with my reader.

Web2Book is failing to auto-transfer files to my SD card whenever I hit the 128 file boundary (though it looks more like a bug in the sony driver to me)--I was wondering if anyone else was experiencing this before I post it to the general areas of the forums (possibly to add to the FAQ's) and before I report it to sony.

adinb
03-29-2007, 07:30 PM
Can someone please give me some advice on formatting a particular page?
This is the page:
http://www.econlib.org/library/Mises/msStoc.html


Actually, after taking a look at the page, you might want to break up the book so that you have a query per section--web2book doesn't currently allow content extraction patterns to apply to followed links (AFAIK, geekraver, please correct me if I'm wrong).

All the chapters that belong to a particular section are on the same page anyways--so setting a content extraction pattern for the TOC and following links to depth of 2 would result in a lot of duplicated content.

The prefaces/introduction are all on one page, each part/section is on one page, the conclusion is on one page, and all the appendices are on one page--so you'll end up with 11 entries, with a link depth of 1 and a fairly simple regex..this worked for part 1 and will probably work for the other chapters:
(<h2>.*<!--endofchap-->)

I tried publishing this for you to just subscribe to, but publishing doesn't seem to be working ATM.

shawn
03-30-2007, 05:48 PM
adinb, thank you very much for your help, it formatted the page very nicely :)

adinb
03-31-2007, 06:10 AM
no worries. :) Just glad to be of some help.

And if anyone else needs some regexp assistance, feel free to PM/email me.

I've been working with geekraver and it appears that the bug I was seeing was really a much smaller bug with what's used in the {0} field when the supplied link extraction regexp doesn't match anything. (making it appear that the app is failing when the user's regex is really the problem)

I sent in a report to sony on the 128 file transfer bug, we'll see if that results in any connect software changes, but until then, be careful that you don't have more than 127 feeds that you're autoupdating on your memory car (or that you try to transfer more than 128 files at a time to a memory card in the connect reader software).

Publishing feeds is working great ATM, everyone seeing web2book should start seeing a *wide* variety of feeds to subscribe to.

If anyone has a particular site that they'd like me to work on getting into the directory, please feel free to pm/email me.

-adin

fritz_the_blank
04-01-2007, 04:41 PM
If someone could help me with this please, I would be much obliged:

Link: http://feeds.newsweek.com/Newsweek/CoverStory
Link Element: guid
Extractor Pattern: http://www.msnbc.msn.com/id/(\d+)/site/newsweek/?from=rss
Link Reformatter: {0}&displaymode=1098

I always get 0 articles regardless of the value set for days.

Thank you very much,

FtB

nmackay
04-03-2007, 09:16 AM
Like many others I was surprised at how poor the Sony Connect software is for such a good unit, and delighted when I found web2book. I use it for several RSS feeds I watch. Now, I have used computers for probably longer than many of the contributers to this forum (As a Capetonian, Geekraver might like to know that at UCT in the early 70's I used to work with the Psychology Department main frame - & yes, the units were literally mounted on a frame), however, I do not have the knowledge to customize my feed/web information to pick out particular sub feeds, or threads (eg this Mobileread one here) or to manage one that needs a password. Is there any chance that that someone might write a basic set of instructions for those like me? I expect that there are others who want this but feel too awed by the high geek quotient of the forum contributers to ask.

adinb
04-04-2007, 04:19 AM
If someone could help me with this please, I would be much obliged:

Link: http://feeds.newsweek.com/Newsweek/CoverStory


Are you wanting *just* the week's coverstory? Here's what I came up with for this entry (please pardon any typos since parallels clipboard isn't wanting to work tonight...but I did publish this particular feed for a known working version):

Link: http://feeds.newsweek.com/CoverStory
Link Element: origLink
Link Extractor Pattern: id/(\d+)/site
Link Reformatter: http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/
Content Extraction Pattern: (<div class="caption">.*)

The process I go through to get all this stuff: (may break this into a few messages)

- I enter the rss feed link (I try to get RSS 2.0 links, since some atom date formats aren't complete supported by web2book.) I set the days to "0" and I select test. If the full content of the articles is in the feed and everything is good, you don't have to do anything other than select the number of days you want, name the entry, and select the "enabled" box. If you are just getting a small snippet and want additional content, you need to fill in the "Link Element" so that web2book knows what link to follow.

- Since you have to find the right link for web2book to follow, view the source of the feed. I do this by typing in the URL of the feed into firefox, right clicking on the loaded page and select "view source". I then look for which tag in the page source holds the "real" link to the story, (not a link that goes through feedburner or some in between website.). In this case the source was really funky and tough to read, but the origLink tag had the real link....and presto, that's the "Link Element".

-The next step is to run the test again. The output will probably be weird, but if you have the correct link element, the log should show web2book following the link and then converting the raw html it got into pdf.

-Assuming that web2book grabbed the article page that you wanted, you just have to figure out the "content extraction pattern" that will pull out the content without all the ads. Finding the correct regular expression is a bit of an art. I would recommend using the regular expression helper in web2book in the tools menu to test/experiment to find the right content expression pattern. Copy the page source of the of the page that web2book grabbed the html from in the earlier steps and copy it into the input field. Type your Regular Expression into the RegExp field and click test. The "Group" field will be the html that would be sent on to be turned into PDF. A good guide that I refer to for building regular expressions is http://www.regular-expressions.info/tutorial.html . This is *definitely* an art form, and you might want to search the net for other, more complete tools to assist in building regular expressions. I know that I put in about a full week's worth of time to spin myself back up on complex regex's.

***Tip: Test your regular expressions before even trying them in web2book. web2book just takes the regular expressions and applies them to the html, so even if you *think* you have it right (which I did many, many times when I didn't have it right) you probably are missing a backslash or a parenthesis somewhere.

***Tip: If web2book doesn't actually generate a pdf during a test, take a look at the log. If the extracted link, and link reformatter both look good, then there is an error in your "content extraction pattern" regular expression. If you don't see a correct extracted or reformatted link, then there is an error in your "link element", "link extractor pattern" regular expression, or in your "link reformatter".

-If there is a "print me" link on the page and you want to use that page as your content source instead of the page at the destination of the "link element", then things get a little more complicated. You will have to find whether you can jump to the print page by grabbing the article ID from the "link element" URL or if you have to look on the destination of the "link element" for the URL of the print page. In this example we can grab the article ID directly out of the link element URL using another regular expression ("id/(\d+)/site") and pasting it into the middle of a fairly static URL for printing ("http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/").

If Newsweek didn't want to be nice and be complicated like Time, you would tick the box "Apply extractor to linked content instead of link text" and you would have to write *another* regular expression to be applied to the *content* of the *destination* of the "Link Element" to find the link to the printable version of the page. Take a look at the published Time feeds for a good example of having to go all the way down the rabbit hole to get to the printable versions of the page.

Some sites just plain won't let an automated "scraper" program like web2book to grab the printable versions of their page. They may "lie" and tell you they're going to the printable version of the page and not actually go there. It's tough to debug and will require a bit of intuition.

-Once you get the URL for the printable page, you need to still do the "Content Extraction Pattern" to be applied to the printable page; make sure that you exclude the "<title>" tag, or else you will have a funky title in the finished PDF.


So, that's it for the moment, time for bed tonight, but hopefully this helps a little in getting a good page. I've published a lot of examples, so subscribe to a few feeds using the File|Subscribe command and take a look.

Good luck, and good hunting!

fritz_the_blank
04-04-2007, 03:17 PM
Thank you for your detailed response.

Thank you also to GeekRaver for his/her work on this project.

As it turns out, I had the wrong URL for the cover story. The correct URL should be http://feeds.newsweek.com/CoverStory and now things are working. However, I get following error when testing:


See the end of this message for details on invoking
just-in-time (JIT) debugging instead of this dialog box.

************** Exception Text **************
System.ComponentModel.Win32Exception: No application is associated with the specified file for this operation
at System.Diagnostics.Process.StartWithShellExecuteEx (ProcessStartInfo startInfo)
at System.Diagnostics.Process.Start()
at web2book.Utils.RunExternalCommand(String cmd, String args, String workdir, Boolean useShell, Int32 timeout, String& output)
at web2book.MainForm.Test(ContentSourceList sourceClass, ContentSource source)
at web2book.MainForm.testButton_Click(Object sender, EventArgs e)
at System.Windows.Forms.Control.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnClick(EventArgs e)
at System.Windows.Forms.Button.OnMouseUp(MouseEventAr gs mevent)
at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
at System.Windows.Forms.Control.WndProc(Message& m)
at System.Windows.Forms.ButtonBase.WndProc(Message& m)
at System.Windows.Forms.Button.WndProc(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.O nMessage(Message& m)
at System.Windows.Forms.Control.ControlNativeWindow.W ndProc(Message& m)
at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)


************** Loaded Assemblies **************
mscorlib
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll
----------------------------------------
Web2Book
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/Web2Book.exe
----------------------------------------
Utils
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/Utils.DLL
----------------------------------------
System.Windows.Forms
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll
----------------------------------------
System
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll
----------------------------------------
System.Drawing
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll
----------------------------------------
IHtmlConverter
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/IHtmlConverter.DLL
----------------------------------------
ISyncDevice
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ISyncDevice.DLL
----------------------------------------
ISource
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ISource.DLL
----------------------------------------
System.Configuration
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Configuration/2.0.0.0__b03f5f7f11d50a3a/System.Configuration.dll
----------------------------------------
System.Xml
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll
----------------------------------------
ReadLit
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadLit.dll
----------------------------------------
ReadWeb
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadWeb.dll
----------------------------------------
ReadXWord
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadXWord.dll
----------------------------------------
ReadWeb
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ReadWeb.DLL
----------------------------------------
Accessibility
Assembly Version: 2.0.0.0
Win32 Version: 2.0.50727.42 (RTM.050727-4200)
CodeBase: file:///C:/WINDOWS/assembly/GAC_MSIL/Accessibility/2.0.0.0__b03f5f7f11d50a3a/Accessibility.dll
----------------------------------------
writeHtmlDoc
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writeHtmlDoc.dll
----------------------------------------
WriteLRF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/WriteLRF.dll
----------------------------------------
writePDF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writePDF.dll
----------------------------------------
ITextSharpConverter
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/ITextSharpConverter.DLL
----------------------------------------
itextsharp
Assembly Version: 3.1.8.0
Win32 Version: 3.1.8.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/itextsharp.DLL
----------------------------------------
writeRTF
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/writeRTF.dll
----------------------------------------
SyncPRS500
Assembly Version: 1.0.0.0
Win32 Version: 1.0.0.0
CodeBase: file:///C:/Program%20Files/GeekRaver/Web2Book/SyncPRS500.dll
----------------------------------------

************** JIT Debugging **************
To enable just-in-time (JIT) debugging, the .config file for this
application or computer (machine.config) must have the
jitDebugging value set in the system.windows.forms section.
The application must also be compiled with debugging
enabled.

For example:

<configuration>
<system.windows.forms jitDebugging="true" />
</configuration>

When JIT debugging is enabled, any unhandled exception
will be sent to the JIT debugger registered on the computer
rather than be handled by this dialog box.

fritz_the_blank
04-04-2007, 03:29 PM
As an addendum to my last post, I am using slightly different settings than the ones that you posted for me. For comparison:

Mine:

LE: guid
LEP: http://www.msnbc.msn.com/id/(\d+)/site/newsweek/?from=rss
LR: {0}&displaymode=1098


Yours:

LE: origLink
LEP: id/(\d+)/site
LR: http://www.msnbc.msn.com/id/{0}/site/newsweek/print/1/displaymode/1098/
CEP: (<div class="caption">.*)

Testing mine returns one article, yours returns none at the moment (most likely, I am doing something wrong).

Thank you once again for all of your help.

adinb
04-04-2007, 06:25 PM
There shouldn't be a problem with using the guid, in this case the guid and origLink are the same, though the guid has the "permalink=false" attribute, which usually doesn't matter, but I try to not use the guid when it uses that directive. But, it comes down to personal taste, tomahtoe, tomaytoe. ;)

Your LEP regular expression should put only the ID itself into field {0}; so your LR should probably include the link upto the ID, if its filling the entire html link into field {0}, the regex engine is being nice to ya.

My regex only grabs what's directly around the digits just because I try leave as much room as possible for site changes--if the link changes at all your regex won't match, mine isn't much more flexible, but either works--it's more a matter of taste.

Your LR leaves a bunch of gunk at the bottom of the entry ("More from Newsweek Health"), so just make sure to adjust your CEP regular expression to account for the extra gunk. I left mine open-ended so that it'd be a little more flexible in case the source html changed at all--but there's nothing wrong with putting something solid on the trailing part of your CEP regular expression.
You do need to include a CEP, when there's a <title> tag in the html that's sent to htmldoc, it'll make a title (even though the cmdline specifies "no title"). The title overrides the filename in the PRS-500's display, so in your "book" listing, it'll show as the contents of the title tag instead of "rss-Newsweek Cover".


There's only one article in the feed at a time, so one article is valid, though I'm going to attribute any errors in my message to it being late--the entry that I published last night to Geekraver's server should be correct. I'm testing a date format fix ATM, so my copy may be parsing dates that V23 isn't.

fritz_the_blank
04-04-2007, 09:58 PM
I just tried your setting again and it found an article, and the output from yours is soooo much cleaner!

I should be able to apply that setting to the remainder of the Newsweek feeds.

Thank you once again for your help.

FtB

geekraver
04-05-2007, 06:27 PM
Like many others I was surprised at how poor the Sony Connect software is for such a good unit, and delighted when I found web2book. I use it for several RSS feeds I watch. Now, I have used computers for probably longer than many of the contributers to this forum (As a Capetonian, Geekraver might like to know that at UCT in the early 70's I used to work with the Psychology Department main frame - & yes, the units were literally mounted on a frame), however, I do not have the knowledge to customize my feed/web information to pick out particular sub feeds, or threads (eg this Mobileread one here) or to manage one that needs a password. Is there any chance that that someone might write a basic set of instructions for those like me? I expect that there are others who want this but feel too awed by the high geek quotient of the forum contributers to ask.

He he - well, I do remember the old Sperry 1100 well, writing Fortran progs in punched cards.

By this stage Adin is probably more of an expert than I am. He gave a pretty detailed description of his approach (which I haven't yet read in detail). I'll add mine as it may be slightly different and have some value.

1. First you need to get the URL for the RSS feed of the site you care about. Enter it into your browser and look at the results. If they have the content you want, then all you really need to do is add the URL to web2book; you shouldn't even need to bother with the settings under 'Customize'

2. Assuming they don't have the content you want (e.g. they have an excerpt and end with "Read More" or something like that), then you will need to customize them. Typically I will at this point do two things:

i) right click in the browser and select 'View Source", and look at the RSS XML, to make sure that the permalink or other link has an XML tag that web2book expects; you can see which one web2book expects by going to Customize and clicking on Help. If this feed for some reason has an unusual XML element tag, then you'll need to enter its name in the Link Element field

ii) in the original page in the browser, click on the title link of the first story to have the browser load up the referenced page. We now want to deal with this page, which we'll do in step 3.

3. If the page has a "Printable version" or "Print" link at the top or bottom, we probably want to use that version of the page, as it will have less fluff like ads that needs to be stripped out (if there is no such link go to step 4). So we have to figure out how to get at the link for that. I'll typically hover over the "Print" or "Printable Version" button/link, and see in the status bar of the browser what the URL is for that version. We want to either munge the original article link into this new print one (which we might be able to do just with the link extraction patter and link reformatter), or we may have to suck the link out of the page we are now viewing (which requires checking the checkbox which says "Apply extractor to linked content instead of link text). In the latter case we have to look at the web page source and find the part that has the HREF for the printable version and figure out a regexp pattern to get at that. Regexp patterns and reformatting are a whole separate topic that I will discuss later. Once the link extractor and link reformatter are done, we should have an URL that refers to the low-fluff version of the content. Load up that content in your browser.

4. Now we want to remove ads, etc, from the page. You have to 'View Source' in your browser, and look for the start and end of the content you care about. Then comes the tricky part, which is trying to find some unique delimiters that bracket this content. Once you've found these (and sometimes it isn't possible) you can create a content extraction pattern, and perhaps a content reformatter (if necessary) for getting the content out. A content reformatter is usually just useful if you need to rebalance some HTML tags in the extracted content, or in cases where the content extraction pattern is complex and extracts the content in multiple pieces ("groups") that must be reassembled.

The regular expression helper in the tools menu is very useful for testing your regular expressions. You can do a "View Source" in your browser and paste the full HTML content of a page in the Input box, and enter your regular expression in the RegExp box, and click the Test button, to see what parts of a page your pattern will extract. You must use grouping (which is done with parentheses) to specify the content you want to keep, and if you use more than one group you will need a reformatter to specify how that groups get put back into a single piece of text. When learning to use the regexps also pay attention to "greedy" (match as much text as possible) versus non-greedy (match as little text as possible) matching, as sometimes you need one style and sometimes the other.

If you're lucky you might find DIV html tags with "class" attributes that bracket the content you want. This is fairly common. Comment blocks are also commonly used to identify the article content start and end. An excellent way to master this stuff is to look at the existing published feeds, and work through them yourself, trying to understand how the existing settings make them tick. Do test them though, as websites change and some of the published entries may break, and you might go nuts trying to understand how something works when in fact it doesn't work any more!

nmackay
04-05-2007, 10:13 PM
Thank you, Geekraver and AdinB, for those details and your work. I am already helped by your replies. I hope others are too. & I will spend some of the Easter weekend finding how to extract more material from various pages.
NM

adinb
04-06-2007, 04:13 AM
Well, I doubt that I'm all *that* much of an "expert" (I shudder at the word ;) ), but I'm glad that I've been able to help out.

And Geekraver has done an excellent job with web2book and his posting on his approach. *excellent* tips on what to look for. I know that it took me awhile to realise that the "view source chart" firefox extension was changing quote types in spans and divs) -- I still highly recommend the extension to quickly make sense of the page source (at https://addons.mozilla.org/en-US/firefox/addon/655 ), just know that if your regex isn't working, check the raw page source.

And if there's anything I can do for anyone, (help debug regex's, point to tutorials and tools for regex's on windows or osx) feel free to PM or email me.

-adin

InspectorGadget
04-06-2007, 08:13 PM
If you'll pardon the remedial question, I can't even get the "Subscribe" function to work. When I click on "File | Subscribe", it just freezes up for 20 seconds and then comes up with an empty list in a "Subscribe to Feed" window (if I'm on the "Feed" tab). It does the same with any of the tabs in the main window. It's behaved this way consistently over the last few days, at home and at work.

I downloaded Web2Book from GeekRaver's original post on this thread, but everyone else is saying, "rss2book". Do I have the correct program??

I downloaded the accessory DLLs but turned out I already had them. I installed the non-beta .NET framework 2.0 fresh. I also downloaded and installed HtmlDoc but I haven't gotten that far yet.

Any ideas to get me going?

geekraver
04-06-2007, 08:45 PM
The problem here really is just my dear ISP (Verizon) behaving badly. I can't access my server either from outside right now. I think I am connected to a flaky router on Verizon's end. I will disconnect tonight for a while and then re-attach and hopefully I'll get a different DHCP address and a better router (unfortunately I don't have a static IP so there are a few moving parts in keeping my server visible and accessible).

If anyone else wants to volunteer a more reliable WebDAV server for the publish/subscribe facility feel free to contact me.

InspectorGadget
04-06-2007, 11:32 PM
Oh, it's Verizon.

They're probably too busy doing "you know what" to actually provide a product or service. Thanks for letting me know.

BTW, I'm really into this eBook idea and the extra dimensions that rss gives it. I also have a very reliable FreeBSD server on a permanent link. Email me and let's talk about it.

Echoloc8
04-07-2007, 12:15 PM
Greetings all, just wanted to mention that, probably from the same outage keeping everyone from geekraver's feed server, the MSI is unavailable for the Rss2Book app itself.

Sigh, and I just got HTMLDoc installed. :rolleyes5

-Rich

fritz_the_blank
04-07-2007, 03:13 PM
Dear GeekRaver--

I have have a site that has 99.9% uptime. I can't install components on it (it is a web hosting server and I don't have RDP access) but I am happy to provide space if that helps.

FtB

geekraver
04-08-2007, 02:50 AM
The basic requirement of a publish/subscribe server is WebDAV filesystem support; I believe IIS has this and Apache certainly does via the DAV module. If anyone has a reliable server with that then I can make the switch.

geekraver
04-08-2007, 02:57 AM
Greetings all, just wanted to mention that, probably from the same outage keeping everyone from geekraver's feed server, the MSI is unavailable for the Rss2Book app itself.

Sigh, and I just got HTMLDoc installed. :rolleyes5

-Rich

Don't forget you can get it from download.com too. Not the latest version (I think they have rel 22 at present while I have rel 24), but its a start.

adinb
04-08-2007, 06:07 PM
I also have some hosting space, and I'd love to make ASmallOrange earn their pennies since I don't use my account all that much.

Echoloc8
04-09-2007, 01:17 AM
Don't forget you can get it from download.com too. Not the latest version (I think they have rel 22 at present while I have rel 24), but its a start.

When I search on C|net's download.com, "rss2book" and "geekraver" both give me no results in regular or advanced search. Am I just being dense? :-)

-Rich

Echoloc8
04-09-2007, 01:19 AM
When I search on C|net's download.com, "rss2book" and "geekraver" both give me no results in regular or advanced search. Am I just being dense? :-)
Whoops, yes I was. It's "web2book".

Thanks!

-Rich

mantici
04-09-2007, 01:30 PM
i'm trying. . .and i'm a developer. . .so you'd think that this regex stuff would be simple. . . but it's just not workin. . . anyone???

here's the details:

im trying to pull:
http://www.newstimeslive.com/rss/local_news.xml

the place for content is:
http://www.newstimeslive.com/storyprint.php?id=1043582

XML link data is:
<link>http://www.newstimeslive.com/news/story.php?id=1043582</link>

so you say . . .OH that's easy!!!!
LE = link
LEP = (\d+)
LR = http://www.newstimeslive.com/storyprint.php?id={0}

and boom i'm done. . . NOT. :blink:


i get:
Processing News Times Local
Got link from RSS: http://www.newstimeslive.com/news/story.php?id=1043639
Mon,09 Apr 2007 11:04:20 -0400 is in range

Done link extraction{0} = 3
Reformatted link is http://www.newstimeslive.com/storyprint.php?id=3

where on EARTH does 3 come from???? the LAST character. . . but that's not the regex i entered


so i try LEP = id=(\d+)

i get:

Processing News Times Local
Got link from RSS: http://www.newstimeslive.com/news/story.php?id=1043639
Mon,09 Apr 2007 11:04:20 -0400 is in range

Done link extraction{0} = 949
Reformatted link is http://www.newstimeslive.com/storyprint.php?id=949
Final content:

HUH??? :blink: 949??? that's not even in the string!!!!
this doesn't make any sense. . . when i put this into REGEX testers online, it seems to work.
Any thoughts to what i'm doing wrong???


an example of an XML RSS item is:

<item>
<link>http://www.newstimeslive.com/news/story.php?id=1043583</link>
<description xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005" cf:type="html">An Evening of Brahms. Join the Ridgefield Symphony Orchestra Saturday at 8 p.m. for a &amp;quot;Brahms Blockbuster.&amp;quot; The concert will be at Ridgefield High School's auditorium, 700 North Salem Road (Route 116). Pianist Rui Shi will perform with the orchestra. Tickets are $25, $40 and $50 for adults and $15, $25 and $30 for youths 18 and under. For reservations call the RSO office at (203) 438-3889. </description>
<pubDate>Mon, 09 Apr 2007 12:04:03 GMT</pubDate>
<atom:published xmlns:atom="http://www.w3.org/2005/Atom">2007-04-09T12:04:03Z</atom:published>
<atom:updated xmlns:atom="http://www.w3.org/2005/Atom">2007-04-09T12:04:03Z</atom:updated>
<cfi:id>988</cfi:id><cfi:read>true</cfi:read>
<cfi:downloadurl>http://www.newstimeslive.com/rss/local_news.xml</cfi:downloadurl>
<cfi:lastdownloadtime>2007-04-09T12:15:12.647Z</cfi:lastdownloadtime>
</item>

<item>

geekraver
04-10-2007, 04:17 AM
You don't have the "Apply extractor to linked content..." checkbox checked, do you?

pclewis
04-10-2007, 10:18 AM
Hi Geekraver:

I upgraded to 2.4 from 2.3 on a Vista Machine. When I try to do a RSS to LRF I get a crash of Web2Book. I turned off direct sycn directely to the reader, and a file is generated in HTM even though I set LRF. This file reads using the browser. I assume you must download .htm and convert to LRF. I assume this is where the crash is. I went back and tested 2.3 and it also makes a HTM when set to LRF, however, it does not crash.

When I use RTF, all is well. So, what do you think? Thought you might want to know.

Also, I cannot get RSS Subscriptions to load. Does this come from your server that is down?

Phil

mantici
04-10-2007, 10:49 AM
ahhh sweet Jesus. . . why yes it is geekraver. . . . i suppose i should have heeded my own advise. . . RTFM.

Geekraver, you da man! AMAZING software.. . expect a donation.

geekraver
04-10-2007, 12:18 PM
Hi Geekraver:

I upgraded to 2.4 from 2.3 on a Vista Machine. When I try to do a RSS to LRF I get a crash of Web2Book. I turned off direct sycn directely to the reader, and a file is generated in HTM even though I set LRF. This file reads using the browser. I assume you must download .htm and convert to LRF. I assume this is where the crash is. I went back and tested 2.3 and it also makes a HTM when set to LRF, however, it does not crash.

When I use RTF, all is well. So, what do you think? Thought you might want to know.

Also, I cannot get RSS Subscriptions to load. Does this come from your server that is down?

Phil

Once the HTML is generated there is not much left to do other than run it through the Librie DLLs that do the conversion. You may find a diagnostic log in your root directory in the C: drive for this process. Let me look into this tonight; I have to run out now and am in training all day.

As for subscribing, yes, my Internet connectivity is flaky. Verizon has some bad routers and it can be difficult to get disassociated from them (and only once in 5 years have I ever got hold of a tech support person there who wasn't near braindead and reading from a script, telling me to reboot my Windows PCs when my whole network, including Macs and Linux box, where equally affected). Inspector Gadget has offered to help and I'll make a rel 25 soon that will fall back to his site.

Fugubot
04-10-2007, 09:32 PM
Geekraver,

I heard Leo Laporte saying on net@nite that he bought a Sony Reader and I saw him discussing it on Jaiku. I posted to let him know about my enthusiasm for web2book and he said he'd try it. I'm waiting to see if he mentions it on any of his podcasts.

fritz_the_blank
04-11-2007, 12:19 AM
The basic requirement of a publish/subscribe server is WebDAV filesystem support; I believe IIS has this and Apache certainly does via the DAV module. If anyone has a reliable server with that then I can make the switch.

My server runs windows with IIS. If you would like to try using that, and it works, you are more than welcome to use me as a mirror or as your primary hosting. It is the least that I can do after all of your work.

FtB

geekraver
04-11-2007, 02:52 AM
It seems that the problem may have been with the router I use (WRT54G) and the way it was configured (long idle timeouts and small connection table size). I have updated to the latest DD-WRT firmware and reconfigured it, so I'd be interested to hear if people start finding the connectivity issues improve.

geekraver
06-12-2007, 02:40 PM
I have just finished migrating my server at home from Linux to FreeBSD. I switched from about 12 years of using FreeBSD to (Gentoo) Linux about 4 years ago, and the server worked well, until a couple of months back when I did my 6-monthly system update. Those of you who use web2book have probably found the publish/subscribe feature almost totally unusable in this period. I never managed to figure out why the machine had become so unreliable, and decided to flatten it, and also go back to FreeBSD, as I have many years experience with the latter and think it will cause me less work. I'd be interested to hear if people find the server more reliable now (if not then I have router trouble I guess).

Hadrien
06-12-2007, 08:30 PM
I have just finished migrating my server at home from Linux to FreeBSD. I switched from about 12 years of using FreeBSD to (Gentoo) Linux about 4 years ago, and the server worked well, until a couple of months back when I did my 6-monthly system update. Those of you who use web2book have probably found the publish/subscribe feature almost totally unusable in this period. I never managed to figure out why the machine had become so unreliable, and decided to flatten it, and also go back to FreeBSD, as I have many years experience with the latter and think it will cause me less work. I'd be interested to hear if people find the server more reliable now (if not then I have router trouble I guess).

FreeBSD is a kickass OS. We're running Feedbooks on FreeBSD+Lighttpd. Better and faster than the Linux+Apache combo.

squeezebag
06-26-2007, 02:54 PM
All:

My Wife and I decided to sell the house, the car, and move to Mexico. One of the things that I would really miss is my subscription to the NewYorker (the surface mail there is useless). Someone told me that I could get a Sony Reader and pick up the weekly subscription via RSS. Imagine my dissapointment when I brought it home and discovered the out-of-the-box RSS support - weak.

Then I stumbled upon GeekRavers excellent software - and thought that my problems were solved. I'm trying to work my way through the examples given in the post but without a lot of luck. I'm wondering if someone could show me what the feed settings should be to achieve my goal.

What I'd like is to retrieve, on a weekly basis, is the full printable versions of the content at this URL: http://www.newyorker.com/services/rss/summary.

If I could end up with a weekly, sony-reader-friendly, version of the NewYorker, I'd be a happy, happy, camper. Any assistance with the settings would be greatly appreciated.

geekraver
06-29-2007, 04:43 AM
I published an entry for the New Yorker; you should be able to use Subscribe to add it.

It doesn't include the leading pictures; if you want those modify the content extraction pattern to say "start article rail" instead of "start article body".

BTW it doesn't work well with the built-in PDF and RTF converters (due to unprocessed 16-bit characters); seems to work fine with LRF though. I haven't tried it with the htmldoc PDF conversion option.

geekraver
06-29-2007, 05:02 PM
BTW you may have trouble hitting the server. I'm now convinced the issue is with my router (Linksys WRT54G with DD-WRT firmware), and not the server. I can access the server just fine from behind the router, and from outside I have no problem hitting the server on IMAP and SMTP ports, but for some reason HTTP isn't being forwarded even though it is configured just the same.

I think I'll try different firmware on the router tonight.

adinb
07-01-2007, 02:41 AM
I published a feed for the New Yorker as well.

The issue seems to be that the New Yorker won't allow the robot fetch to go directly to the printable version of the page/article. (cookies, spoofing referring pages, and/or spoofing the user agent string might fix that..I hope to see something like that in future versions of web2book)

One thing that I can't seem to remember how to do--how to get the link reformatter to reference the original link element *and* the regex string fetched via the link extractactor pattern when the "apply extractor to linked content instead of link text" option is selected. From my dim memory, I don't remember being able to really use the link reformatter if the follow option is checked, but I could be *totally* wrong.

Oh, BTW, publish appeared to work for me tonight.

-adin

squeezebag
07-02-2007, 01:31 AM
Regarding the NewYorker feeds,

Thanks a ton. I'm now able to pick up the full articles from the print links (including the pictures and captions). I used the following settings:

URL: http://feeds.newyorker.com/services/rss/feeds/everything.xml
Link Element: Link
Apply extractor to linked content:(checked)
Link Reformatter: {0}?printable=true
Content Extraction pattern: <!-- start article rail -->(.*) <!-- end article body -->

Converts to LRF perfectly. I have two remaining questions.

-I've been able to filter out most of the garbage with the Content Extraction Pattern but I'm still picking up a "keywords" section that I'd like to exclude. Does the Content Extraction thing allow me to extract from A to B, and then from C to D? In other words, there is stuff in the beginning and stuff on the end that I'd like to exclude. There is also a block of stuff in the middle that I'd like to filter out. What's the format for this?

-Also, is there any way to build a table of contents? I can pick up the section summaries from: http://feeds.newyorker.com/services/rss/feeds/everything.xml but is there any way that I can prepend the full extraction with this file? A perfect world would allow me to link from the TOC to the full articles but I'll live with whatever I can get.

Thanks again for your help.

Also, the subscribe function works flawlessly now!

geekraver
07-02-2007, 09:44 PM
For TOC, you have a couple of options: using htmldoc for PDF, or writing your own output plugin that pre-massages the HTML. I may add this as a feature later.

For content extraction, in the regular expression pattern you need to group the various parts you want in parentheses; you then use {0}, {1}, {2}, etc in the formatter to represent the matched blocks. So you might use a pattern like:

<!-- start article rail -->(.*)<foo>.*<bar>(.*)<!-- end article body -->

assuming <foo> started the tag section you wanted to skip and <bar> ended it (".*" represent any sequence of characters, in case you don't know that already) .

ddavtian
07-08-2007, 01:40 AM
Asked a question that was answered right in front of my post.
Sorry for stupid post.

Waiting for GeekRaver to add a ToC feature.

_underzcore_
07-20-2007, 03:33 PM
geekraver, your app is tantalizingly good . . . but I'm getting killed on the feed I'm trying to save (The Economist print edition). There seem to be two big hurdles the way they have the feed set up:

1) It's hosted by a second party, so there's an intermediary link through pheedo.com that then points the browser back to the article at economist.com.

2) The articles at economist.com are stored in different directories (e.g. ".../opinion/," ".../world/la/"). This seems to be fouling up my efforts to change a "displaystory.cfm?story_id=" into a "PrinterFriendly.cfm?story_id=" with just one set of regular expressions. And it's hard to tell if I'm properly sidestepping the pheedo.com blind alley.

Help?

flamaest
07-27-2007, 03:53 PM
Does someone have a list of RSS feeds which are FULL article feeds..?

from any news sources, I don't care...

Most of the one's i find are intro-snippet only..

POST YOUR FULL RSS URLs.. please..?

Help??
F.

geekraver
07-27-2007, 04:15 PM
Does someone have a list of RSS feeds which are FULL article feeds..?

from any news sources, I don't care...

Most of the one's i find are intro-snippet only..

POST YOUR FULL RSS URLs.. please..?

Help??
F.

Much of the usefulness of web2book is that it turns partial feeds into full feeds. Just try use the subscribe feature (on the file menu). My server might be a bit slow as it is doing a on-line backup but be patient and you should get a list.

geekraver
07-27-2007, 04:17 PM
I'm heading out on vacation; I'll respond when I get back.

geekraver, your app is tantalizingly good . . . but I'm getting killed on the feed I'm trying to save (The Economist print edition). There seem to be two big hurdles the way they have the feed set up:

1) It's hosted by a second party, so there's an intermediary link through pheedo.com that then points the browser back to the article at economist.com.

2) The articles at economist.com are stored in different directories (e.g. ".../opinion/," ".../world/la/"). This seems to be fouling up my efforts to change a "displaystory.cfm?story_id=" into a "PrinterFriendly.cfm?story_id=" with just one set of regular expressions. And it's hard to tell if I'm properly sidestepping the pheedo.com blind alley.

Help?

flamaest
07-27-2007, 06:24 PM
Much of the usefulness of web2book is that it turns partial feeds into full feeds. Just try use the subscribe feature (on the file menu). My server might be a bit slow as it is doing a on-line backup but be patient and you should get a list.

cool, can't wait to try your software!!!

question, I have read this whole thread, and I can't seem to determine if your software come with a bunch of RSS URLs built in as default..?

Thanks,
F.

JSWolf
07-27-2007, 08:01 PM
Have you tried web2lrf which is part of the libprs500 package?

See the thread over at http://www.mobileread.com/forums/showthread.php?t=12149

flamaest
07-30-2007, 02:24 AM
Much of the usefulness of web2book is that it turns partial feeds into full feeds. Just try use the subscribe feature (on the file menu). My server might be a bit slow as it is doing a on-line backup but be patient and you should get a list.

Got the app and all the components working great.. I downloaded someone's full.xml file from this thread and have a bunch of feeds i can start with..

I tried the subscribe feature on the app, but the app just sits-there and nothing seems to happen.. is this normal?

Thanks,
Fabian.

flamaest
07-31-2007, 12:55 AM
I can't seem to get these feeds working quite right.. can someone with more expereince help me out..?




Need help with:

http://www.hot-deals.org/rss/xml/
http://feeds.feedburner.com/PSP-Spot
http://www.leftlanenews.com/wp-rss2.php
http://feeds.gawker.com/consumerist/full
http://feeds.feedburner.com/pocketables/PpUx
http://www.the-gadgeteer.com/feed/rss.xml
http://www.ps3-hacks.com/
http://www.dvorak.org/blog/?feed=rss2
http://feeds.feedburner.com/SonyPs3Modding-HomebrewUpgradesModsAndHacks
http://feeds.feedburner.com/RealityWired
http://digg.com/rss/index.xml
http://simplefeed.informationweek.com/rss/?f=6173d5d0-01dc-11dc-3f66-00304887398a
http://feeds.feedburner.com/grouchygeek
http://feeds.feedburner.com/fortuneaskannieblog
http://rss.cnn.com/rss/money_retirement.rss
http://feeds.computerworld.com/Computerworld/News
http://feeds.feedburner.com/typepad/munjal/recognizing_deven
http://kiplinger.com/rss/s.php/k46aa2d7b019e1/headlines.rss


Thanks!
Fabian.

flamaest
07-31-2007, 02:13 AM
Here are my sources, those enabled, which seem to work the best for me. I started from the original poster which shared his XML sources, and grew it from there.

attached.

frank10
08-03-2007, 02:03 AM
I'm trying to process this rss feed and it's only picking up the headlines

http://blog.cleveland.com/sports/atom.xml

flamaest
08-03-2007, 12:49 PM
I have this same problem with several of my listed feeds above, I guess I'll have to wait for the web2book experts.. :)

F.

guardianx
08-04-2007, 12:28 PM
I cant get the program to work it keep crazying left and right.
it crash when i goto sucribe.
it crashes when i click go..

help?
i use winxp service 2 pack

glitchu1
08-05-2007, 09:45 AM
I downloaded the .net 2 framework thing but i can't get it to work...i subscribed to a couple from the list, and when i hit 'test' it brings up the little box and then says '0 articles'

do you know what im doing wrong?

flamaest
08-05-2007, 07:38 PM
the feedbooks.com sync program looks very interesting.. this might be more of what i am looking for..

F.

geekraver
08-09-2007, 08:26 PM
I cant get the program to work it keep crazying left and right.
it crash when i goto sucribe.
it crashes when i click go..

help?
i use winxp service 2 pack

Can you be more specific? What are the crashes?

geekraver
08-09-2007, 08:28 PM
Got the app and all the components working great.. I downloaded someone's full.xml file from this thread and have a bunch of feeds i can start with..

I tried the subscribe feature on the app, but the app just sits-there and nothing seems to happen.. is this normal?

Thanks,
Fabian.

Sorry, while I was on vacation last week I tried using Mozy to do an online backup of my system which chewed up all my bandwidth. Should work better now (I gave up on Mozy in the end; it killed my bandwidth and was still so slow that I'd rather just buy more hard drives).

dietric
08-19-2007, 02:40 PM
I dont want to alarm anyone unnecessarily, but McAffee VirusScan reports that the temporary files created thru the RSS conversion process are infected with the Exploit-ObscureHtml trojan. Tis might well be VirusScan being overzealous aobut the HTML content, but yo should know nevertheless (since it also prevents the program from working correctly).

guardianx
08-20-2007, 10:08 PM
I've been thinking about what the best approach is for collecting them. There are various options:

- I could collect them and put them on my website
- I could keep adding them as attachments in the initial post; that may become unwieldy
- I could keep adding them to a single big Xml file that is kept with the initial post
- we could use the wiki
- we could just keep them on a thread

The main drawbacks to the last approach seem to be the haphazard organization that would result. Right now it seems like the wiki might be the best approach, and I can roll up the submissions on occasion into a single file and attach that to the first post.

So I've started a page at http://wiki.mobileread.com/wiki/Xml_feed_files

What do i do with this info? sorry i'm new.

toomanybarts
08-28-2007, 08:13 PM
Is anyone else getting this program to actually sync with their Sony Reader?
I have clicked on the boxes to allow the sync to the reader, the log shows that final content has been received, but nothing appears on the reader at all?

geekraver
08-29-2007, 08:34 PM
What do i do with this info? sorry i'm new.

You use the Subscribe option on the File menu.

dietric
09-02-2007, 06:42 PM
I dont want to alarm anyone unnecessarily, but McAffee VirusScan reports that the temporary files created thru the RSS conversion process are infected with the Exploit-ObscureHtml trojan. Tis might well be VirusScan being overzealous aobut the HTML content, but yo should know nevertheless (since it also prevents the program from working correctly).

Would the developer be inclined to look into this problem? I would really love to use this software, but mentioned problems prevents me form doing so.

Best
-ds

JSWolf
09-02-2007, 08:00 PM
Would the developer be inclined to look into this problem? I would really love to use this software, but mentioned problems prevents me form doing so.

Best
-ds
McAffee is giving you a false positive. Either update McAffee or find a virus scanner that actually works. Or you could always turn it off, get your RSS feed sorted and then turn it back on.

guardianx
09-05-2007, 02:14 PM
You use the Subscribe option on the File menu.

when i do that the program crashes. everything i do the program crashes wtf.
i followed the d/l instruction. I'm not that a newbie when it comes to computer. I guess i'm out of luck then oh well i will stick with book design. when all of the bugs is fixed i will give this program another shot.

squeezebag
09-06-2007, 06:30 PM
Anyone else having problem with the subscribe or publish functions? I'm using version 24 and it hangs everytime i invoke either.

flamaest
09-06-2007, 07:30 PM
This tool definitely has its merits and I used it for a long time.

Honestly, after feedbooks.com showed up with their newspaper feature and their synchronization tool for my Sony Reader, I can now dock my reader and load up all my RSS feeds from feedbooks in seconds.

I still do appreciate RSS2book for introducing me to properly formatted PDF RSS feeds and for those stubborn websites with limited RSS feeds.

F.

geekraver
09-07-2007, 07:35 PM
Anyone else having problem with the subscribe or publish functions? I'm using version 24 and it hangs everytime i invoke either.

The problem is my DSL speed. Verizon cannot upgrade me as I am on frame relay and there is no ATM or FIOS available in my area. I will look into other solutions for hosting this.

angrytrousers
09-09-2007, 06:48 AM
Hi! Great program.

Is it possible to generate SEPARATE pdfs for each story in a feed?
I'm trying to create an archive of stories from a particular site, and I'd rather have separate pdfs than one giant one with a months worth of stories.

HTMLDoc doesn't seem to natively have this feature either. Maybe I'd have to recursively run your program for each link?

Thanks!

toomanybarts
09-11-2007, 06:23 PM
If someone can help me understand how I would pull content from the following website (using the "Web Page Tab of rss2book) it will go a long way to me understanding not only how this program works, but also the REGEX expressions rqd to get at the content (and only the content) we are all using this program for :
"http://www.timesonline.co.uk/tol/comment/columnists/jeremy_clarkson/"

There are a number of links on the page that reference the various blog entries I want to pull, but when I change rss2book settings for "followlinks" to depth 2 (or more) I get this error
"Processing clarkson

System.UriFormatException: Invalid URI: The URI scheme is not valid.
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at System.Uri..ctor(String uriString)
at web2book.Utils.ExtractContent(String contentExtractor, String contentFormatter, String url, String html, String linkProcessor, Int32 depth, StringBuilder log)
at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log)
at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)"

IF I leave it set at 1 I get
"Processing clarkson

Final content:
===================

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><meta name="ROBOTS" content="NOARCHIVE" /><script type="text/javascript">
// Variables required for DART. MUST BE IN THE HEAD.
var time = new Date();
randnum = (time.getTime());
</script><!-- Code to display title of the HTML page --><title> Jeremy Clarkson Columns & Comment | Times Online </title><meta name="Description" content="The UKs favourite motoring journalist comments on British society and culture in his weekly columns on Times Online"><link rel="shortcut icon" type="image/x-icon" href="/tol//img/favicon.ico" type="image/x-icon" /><link rel="stylesheet" type="text/css" href="/tol/css/alternate.css" title="Alternate Style Sheet" /><link rel="stylesheet" type="text/css" href="/tol/css/tol.css"/>
<link rel="stylesheet" type="text/css" href="/tol/css/ie.css"/><link rel="stylesheet" type="text/css" href="/tol/css/typography.css"/><script language="javascript" type="text/javascript" src="/tol/js/tol.js"></script></head><body><div id="top"/><div id="shell"><div id="page"><!-- START REVENUE SCIENCE PIXELLING CODE --><script language="javascript" type="text/javascript" src="/tol/js/DM_client.js"></script><script language="javascript" type="text/javascript">
DM_addToLoc("Network",escape("Times"));
DM_addToLoc("SiteName",escape("Times Online"));
</script><script language="javascript" type="text/javascript">
// Index page for Revenue sciences"


....there's loads more, this is just part of the content. The point is, I thought that changing the "Follow links to Depth" setting to 2 would grab not only the page referred to in the URL, but also follow the links from that URL's page?
I would then need to work on what REGEX would be needed to tidy up the resulting mass of content. (That would be problem / lesson 2, but one thing at a time!)

Am I missing something?
(I realise there is a RSS feed page where I can pull the current top 4 or 5 blog entries and adinb has helped me clean this up to be readable, what I want to understand is how do I manipulate Webpages)

(Thank-you again to adinb who has been helping me with this problem using the rss feed and the "Feed" tab of rss2book, via PM, it's people like him that keep these types of forums useful...I thought it may be useful for others to understand how it all works and to lighten the load on adinb!)

Thank-you all in advance.

Liviu_5
09-13-2007, 12:44 AM
Hi,

I tried to use Rss2book to pull down some newspapers feeds, one worked nicely after I figured out a good regex to get just the text, but for the other whatever I try I get the following message repeated as many times as the #feeds and of course with the appropriate time/date I try (US Eastern +7 hrs - so I tried at 11.31 pm US Eastern, I get exactly the following, I try seven minutes later I get the message with 06.38...):

Processing Evenimentul
Thu, 13 Sep 2007 06:31:55 EEST is out of range
Thu, 13 Sep 2007 06:31:55 EEST is out of range
....

Is there anything I can do about it?

The feed link is not in English, but the same was true for the other newspaper that works just fine:

http://www.evz.ro/rss.php/evz.xml

squeezebag
09-13-2007, 04:07 PM
Okay. I'm must be losing my mind.

I've been able to extract the The New Yorker with the following setup:

URL: http://feeds.newyorker.com/services/...everything.xml
Link Element: Link
Apply extractor to linked content is checked
Link Reformatter: {0}?printable=true
Content Extraction pattern: <!-- start article rail -->(.*) <!-- end article body -->

Then I changed computers, installed the latest .net updates, downloaded Web2Book, and duplicated the settings and it's not working. I only get the article headings - it doesn't seem to be following the link.

Any ideas? What's changed?

thanks,
Andy

rkellmer
09-19-2007, 12:01 AM
I just bought a Sony Reader last week. It is great.

Here is my problem: I have about 3,000 webpages that are on my local computer. Each one is a conversion of a single book. I have tried to convert them to PDF by opening them in Internet Explorer, and using the local address as the URL in Web2book. Web2book gives me the following message:
--------------------------------------------------------------------------
System.UriFormatException: Invalid URI: A port was expected because of there is a colon (':') present but the port could not be parsed.
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at System.Uri..ctor(String uriString)
at web2book.Utils.GetUrlResponse(String url, String& error, String postData, ICredentials creds, String contentType)
at web2book.Utils.GetWebResponse(String url, String& error, String postData, ICredentials creds, String contentType)
at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log)
at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)
--------------------------------------------------------------------------
I can get around this by posting each webpage on my Geocities site, but that is a lot of extra work. Any idea how I can convert the local html file without doing all that?

Thanks!! :D

dietric
09-23-2007, 03:39 PM
I'm trying to create a Web2Book feed for
http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml

I would like to rewrite the links to link to the printable version, but the pattern to replace the link is somewhat complex:
The link in the feed looks like this:
http://www.spiegel.de/politik/ausland/0,1518,506744,00.html
The printable version like this:
http://www.spiegel.de/politik/ausland/0,1518,druck-506744,00.html

From what I can see by examining other links the constants are:
- http://www.spiegel.de/ (obviously)
- one or more folder names
- the actual file name consists of three numbers separated by comma
- in the printable version, the string "druck-" is added before the third number
- the extension is .html

I'm not so good with RegEx, help would be appreciated.

adinb
09-24-2007, 06:00 AM
I'm trying to create a Web2Book feed for
http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml

I would like to rewrite the links to link to the printable version, but the pattern to replace the link is somewhat complex:
The link in the feed looks like this:
http://www.spiegel.de/politik/ausland/0,1518,506744,00.html
The printable version like this:
http://www.spiegel.de/politik/ausland/0,1518,druck-506744,00.html

From what I can see by examining other links the constants are:
- http://www.spiegel.de/ (obviously)
- one or more folder names
- the actual file name consists of three numbers separated by comma
- in the printable version, the string "druck-" is added before the third number
- the extension is .html

I'm not so good with RegEx, help would be appreciated.

how about (http://www.spiegel.de.*/\d,\d{4},)(\d+,\d\d\.html)
then in the link constructor you could use {1}druck-{2}

I'm all ears for a more efficient regex that is more efficient.

-adin

dietric
09-24-2007, 09:10 PM
how about (http://www.spiegel.de.*/\d,\d{4},)(\d+,\d\d\.html)
then in the link constructor you could use {1}druck-{2}

I'm all ears for a more efficient regex that is more efficient.

-adin
That worked out great, thanks. I have tested and published the feed.

toomanybarts
09-26-2007, 04:11 PM
adinb is The Man! He is definately THE Regex expert on here.

geekraver
09-27-2007, 02:20 PM
I just bought a Sony Reader last week. It is great.

Here is my problem: I have about 3,000 webpages that are on my local computer. Each one is a conversion of a single book. I have tried to convert them to PDF by opening them in Internet Explorer, and using the local address as the URL in Web2book. Web2book gives me the following message:
--------------------------------------------------------------------------
System.UriFormatException: Invalid URI: A port was expected because of there is a colon (':') present but the port could not be parsed.
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at System.Uri..ctor(String uriString)
at web2book.Utils.GetUrlResponse(String url, String& error, String postData, ICredentials creds, String contentType)
at web2book.Utils.GetWebResponse(String url, String& error, String postData, ICredentials creds, String contentType)
at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log)
at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)
--------------------------------------------------------------------------
I can get around this by posting each webpage on my Geocities site, but that is a lot of extra work. Any idea how I can convert the local html file without doing all that?

Thanks!! :D

Can you give an example of the type of URL you are using?

geekraver
09-27-2007, 02:25 PM
As I'm sure folks have noticed, I haven't been very active on these forums. This is a reflection of the fact that I barely use my reader anymore (I want higher resolution and color; I'm mostly interested in comics and tech books and the reader is not great for either, notwithstanding some of the excellent software some people have produced). I thought I'd use the reader for RSS but it's hard to beat the wonderful pRSSReader on my Pocket PC with a data plan.

My plan at this point is to open-source web2book; I need a bit of time to get that done but I hope that that will allow others to build on the foundation. I'll post an update once the source is available (most likely on Codeplex).

rkellmer
09-30-2007, 02:16 AM
Can you give an example of the type of URL you are using?
Yes, here is one of the URL's:

C:\books\Angels\Angels\1.html

Thanks!!

geekraver
10-02-2007, 10:10 PM
Yes, here is one of the URL's:

C:\books\Angels\Angels\1.html

Thanks!!


Try file://c:/books/Angels/Angels/1.html

MissLissa
10-11-2007, 12:18 AM
Geekraver's rss2book app is really great! I can't recommend it enough. It took me about 15 minutes to get my PC set up to use it.

First I downloaded and installed .NET framework 2.0 here...

http://msdn2.microsoft.com/en-us/netframework/aa731542.aspx




I so want to try this - but I don't know anything about .NET! When I click the link I see all sorts of things I could download. How do I know which one of the Framework files to download?

I am running Windows XP Home Edition on a Gateway laptop. Can you tell me which .NET Framework I should install?

Thanks for helping out a newbie :pray:

geekraver
10-12-2007, 02:40 AM
I so want to try this - but I don't know anything about .NET! When I click the link I see all sorts of things I could download. How do I know which one of the Framework files to download?

I am running Windows XP Home Edition on a Gateway laptop. Can you tell me which .NET Framework I should install?

Thanks for helping out a newbie :pray:

The easiest way to add .Net 2.0 if you are a n00b is to do it through Windows Update; it should show up as an optional update.

geekraver
03-03-2008, 02:42 AM
Once again sorry for the delay; the project is finally up: http://www.codeplex.com/web2book

moz
03-03-2008, 04:07 AM
The page is good but there's only source code to download. Can you add a link to where I can download the executable version? presumably http://www.download.com/3000-20-10649163.html ? I'm about to have a play now :)

Hmm, www.publicaddress.net doesn't work as rss or web site, System.UriFormatException as web page, just headlines as rss. http://blog.greens.org.nz/index.php/feed/ just doesn't work - no output at all. mozbike.blogspot.com causes it to hang, http://mozbike.blogspot.com/feeds/posts/default just produces no output. The log window gets a bit annoying after a while - do you absolutely have to show it every time? I think you might need to write a wizard to set up feeds, or make the inputs more tolerant. But at least Process Explorer can kill it when it hangs, that bit works.

Cool, http://smh.com.au/text works. Is there any way to tell it "ignore links that don't start with http://smh.com.au/text"? Also, converting this to rtf takes avery long time (I think, after a minute or two I killed it). It looks as though you just cat the HTML of all the links together - perhaps it would be better if you at least removed all the extra HTML and BODY tags? I suspect that stripping the non-text context would help that, as the html page currently produced has all sorts of images and formatting as well as embedded scripts and styles. Using OpenOffice Writer to import the HTML is slow to the point where I killed that too. Using a text editor to remove the start and end blocks of "stuff" plus all the img and href tags makes it possible to load the HTML. Now I need to use MS-Office to convert HTML to RTF because I don't (yet) have a better solution.

Clicking Help-Report Bug takes me to a page that doesn't work.

I will download the source and have a play sometime, I hope.

geekraver
03-03-2008, 12:35 PM
You can get the executable version from http://www.download.com/Web2book/3000-2017_4-10649163.html?tag=lst-1

I haven't used the app or my Sony Reader in months - pRSSReader on my Tilt is just much more convenient - - so I'm not actively fixing or enhancing it. Nontheless it represents a fair amount of development effort so that's why I'm putting it on Codeplex - so others can contribute if they want.

I'll look into the bug report issue; publish/subscribe and bug reporting should still work. If I can move bug reporting over to Codeplex that would be better.

Rick C
03-18-2008, 07:07 PM
I just tryed downloading and installing web2book from the download.com site-now when I try and run the sony e-reader software the left side is all squashed up and the program is pretty much broken.
I uninstalled web2book,e-reader, re-installed e-reader, all to no avail. I am running winxp sp2 with net framework 3 sp1.
Any suggestions for fixing this?

Edit- Found the 'rough' fix in another post-thanks CurtW !
http://www.mobileread.com/forums/showthread.php?t=16796

Originally Posted by Sony


To help correct your issue with the eBooks software please follow the
below steps:

Go to Start>Control Panel and open Add/Remove programs. Find and
uninstall/remove "eBook Library by Sony".

Open My Computer.
Go to Tools>Folder Options.
Click View and select “Show hidden files and folders” and click ok

Please delete the following folders as listed below:

For XP
C:\Documents and Settings\All Users\Application Data\kinoma
C:\Documents and Settings\All Users\Application Data\Marlin
C:\Documents and Settings\%username%( i.e. you windows log in name)
\Local Settings\Application Data\kinoma
C:\Documents and Settings\your user name\Local Settings\Application
Data\Sony Corporation
C:\Program Files\Sony\eBook Library

For Vista

C:\ProgramData\kinoma
C:\ProgramData\Marlin
C:\Program Files\Sony\eBook Library
C:\Users\%username%\AppData\Local\Sony Corporation
C:\Users\%usernam%\AppData\Local\kinoma

After doing so, reinstall your Connect Reader software from:
http://ebooks.connect.com/downloadclient.html

You might also want to experiment with connecting the reader in
different USB ports.

2nd edit-I backed up the above files to a safe folder (I have lost one device permission for CONNECT), and reinstalled web2book. Not sure what happened,but it all seems okay now.Perhaps all the various editors/installers/hacks over the last two weeks and it was just the straw that broke the camel's back?

slex
11-30-2008, 01:43 PM
Hi, geekraver!

It's a great program you've made! I know you don't develop it anymore and it's at Codeplex, but I have a question regarding multilanguage support and I will appreciate it you find the time to respond (if there is a quickifix, so it's not a burden for you).

I tried to use the program for cyrillic websites but it didn't work. I also read lot of German websites. And here comes the embarassment.

Special characters for German (umlauts) work just fine in the Title but not below that. Do you have any idea why that might be. In principle the program supports such characters if they appear at one place. But why not on the other?

I attach a screenshot to see what I mean.

Hansgeorg
01-21-2009, 05:19 PM
@ slex: I have the same Problem! I read, that the main problem is the old version of htmldoc (I think a converter) which is used by web2book. Htmldoc won't be updated anymore and doesn't support the UTF-8 characters, which is used by many newer websites. web2book supports iso-8859-15, which includes german umlaute, but I don't know, if there is a way to transform UTF-8 in iso.

If anybody has help, very welcome!!

slex
01-26-2009, 04:03 PM
@ slex: I have the same Problem! I read, that the main problem is the old version of htmldoc (I think a converter) which is used by web2book. Htmldoc won't be updated anymore and doesn't support the UTF-8 characters, which is used by many newer websites. web2book supports iso-8859-15, which includes german umlaute, but I don't know, if there is a way to transform UTF-8 in iso.

If anybody has help, very welcome!!

Actually, they might implement it if you believe this post here:

http://www.htmldoc.org/str.php?L162

Hansgeorg
01-28-2009, 07:28 AM
@slex: thanks for that hint, it would be very usefull, if they would implement the UTF-8 charset!!

grimborg
11-02-2009, 09:35 AM
In GNU/Linux with Mono it just appears to wait for a while and then I get the following error:


~/.wine/drive_c/Program Files/GeekRaver/Web2Book % mono Web2Book.exe

Unhandled Exception: System.IndexOutOfRangeException: Array index is out of range.
at web2book.MainForm..ctor () [0x00000]
at (wrapper remoting-invoke-with-check) web2book.MainForm:.ctor ()
at web2book.Program.Main () [0x00000]