View Single Post
Old 03-13-2008, 10:24 PM   #5
soilwork
useR!
soilwork will become famous soon enoughsoilwork will become famous soon enoughsoilwork will become famous soon enoughsoilwork will become famous soon enoughsoilwork will become famous soon enoughsoilwork will become famous soon enough
 
soilwork's Avatar
 
Posts: 299
Karma: 651
Join Date: Nov 2007
Location: NY
Device: Onyx Boox Max 2, Kobo Libra H2O, iRiver Story HD
Quote:
Originally Posted by kovidgoyal View Post
--match-regexp printthread\S+page=\d+
Hi, Kovid,

Thanks for the tip. However, with the given regular expression, I got an error message looks as follows.

==============
D:\My_Documents\Download - Files\00>web2lrf -u "https://www.mobileread.com/forums
/printthread.php?t=19142&pp=40" default -r 1 -t "Reading" -a "Mobileread" --link
-levels=1 --ignore-tables --match-regexp printthread\S+page=\d+
Downloading
.
https://www.mobileread.com/forums/pri...?t=19142&pp=40 saved to
Traceback (most recent call last):
File "convert_from.py", line 194, in <module>
File "convert_from.py", line 188, in main
File "convert_from.py", line 165, in process_profile
WindowsError: [Error 123] The filename, directory name, or volume label syntax i
s incorrect: ''
===================

I am not familiar with HTML but the problem seems to occur since one of the link from the original url is identical to the url itself.
Using the example above, let me denote the original link as A and the links from the original url as B~F.

A. original URL: https://www.mobileread.com/forums/pri...?t=19142&pp=40
B. href="printthread.php?t=19142&amp;pp=40&amp;page=2 "
C. href="printthread.php?t=19142&amp;pp=40&amp;page=3 "
D. href="printthread.php?t=19142&amp;pp=40&amp;page=4 "
E. href="printthread.php?t=19142&amp;pp=40&amp;page=5 "
F. href="printthread.php?t=19142&amp;pp=40

The problem is F is identical to A. The regular expression seems to remove both A and F leading to the error message.

For now, I decided to use the following command.
Code:
web2lrf -u "https://www.mobileread.com/forums/printthread.php?t=19142&pp=40" default -r 1 -t "Reading" -a "Mobileread" --link-levels=1 --ignore-tables --match-regexp="printthread"
It gives me A-B-C-D-E-F(=A) rather than A-B-C-D-E, but I can read up to E and stop there. Since the file is an electronic one, there is no wasted paper anyway.

Again, thanks for your help and providing wonderful program to users.
soilwork is offline   Reply With Quote