Originally Posted by kovidgoyal
Thanks for the tip. However, with the given regular expression, I got an error message looks as follows.
D:\My_Documents\Download - Files\00>web2lrf -u "https://www.mobileread.com/forums
/printthread.php?t=19142&pp=40" default -r 1 -t "Reading" -a "Mobileread" --link
-levels=1 --ignore-tables --match-regexp printthread\S+page=\d+
Traceback (most recent call last):
File "convert_from.py", line 194, in <module>
File "convert_from.py", line 188, in main
File "convert_from.py", line 165, in process_profile
WindowsError: [Error 123] The filename, directory name, or volume label syntax i
s incorrect: ''
I am not familiar with HTML but the problem seems to occur since one of the link from the original url is identical to the url itself.
Using the example above, let me denote the original link as A and the links from the original url as B~F.
A. original URL: https://www.mobileread.com/forums/pri...?t=19142&pp=40
B. href="printthread.php?t=19142&pp=40&page=2 "
C. href="printthread.php?t=19142&pp=40&page=3 "
D. href="printthread.php?t=19142&pp=40&page=4 "
E. href="printthread.php?t=19142&pp=40&page=5 "
The problem is F is identical to A. The regular expression seems to remove both A and F leading to the error message.
For now, I decided to use the following command.
web2lrf -u "https://www.mobileread.com/forums/printthread.php?t=19142&pp=40" default -r 1 -t "Reading" -a "Mobileread" --link-levels=1 --ignore-tables --match-regexp="printthread"
It gives me A-B-C-D-E-F(=A) rather than A-B-C-D-E, but I can read up to E and stop there. Since the file is an electronic one, there is no wasted paper anyway.
Again, thanks for your help and providing wonderful program to users.