web2lrf - Page 4

kovidgoyal · 11-19-2007, 03:22 PM

https://libprs500.kovidgoyal.net/wiki/UserProfiles

DaleDe · 11-19-2007, 06:53 PM

Quote:

Originally Posted by kovidgoyal

https://libprs500.kovidgoyal.net/wiki/UserProfiles

By the way, your wiki reference reminded me that I put a short article about libprs500 in the MobileRead wiki. you may want to flush it out with more data.

Dale

kovidgoyal · 11-19-2007, 07:00 PM

Code:

pydoc str

Look for rpartition

Silvayn · 11-20-2007, 10:48 AM

If I understand it correctly, rpartition divides a string into a 3-member array. This doesn't really help me that much, as I don't "speak" python and it's different from the languages that I know. So... if I could ask some python-knowledgable person to give me the exact command for the string conversion... I assume it would cost you about 5 secs of your life

Thank you in advance... in return I offer (rusty) pascal & vbscript support

i need
http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html

to become
http://www.sme.sk/clanok_tlac.asp?cl=3592953

replace('/c/', '/clanok_tlac.asp?cl=') is step one... but after that i'm stuck

kovidgoyal · 11-20-2007, 12:17 PM

Ah well here you go

Code:

url = 'http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html'.rpartition('/')[0].replace('c/', 'clanok_tlac.asp?cl=')

JTravers · 11-21-2007, 04:45 AM

Kovid,
I noticed that web2lrf ignores/deletes words entirely that have underlying links. This makes some articles a little hard to understand since key words are sometimes left out.

As an example, in the following article the names "David Beckham," "Adidas," and "Pepsi" are all deleted/ignored when it is converted to an lrf.
http://www.nytimes.com/2007/11/17/bu...gewanted=print

I noticed the same thing happens when downloading the html file and running it through html2lrf. I've attached the lrf I generated as an example.

Is there something about linked text that makes it difficult to parse? Or is this simply a bug that needs to be eliminated?

Thanks a lot for your help.

BTW, still trying to get some profiles made. Not knowing Python is proving to be a rather large stumbling block, however.

kovidgoyal · 11-21-2007, 10:31 AM

That's a bug, actually a regression I introduced a few versions back. It will be fixed in the next release.

kovidgoyal · 11-21-2007, 04:41 PM

Quote:

Originally Posted by JTravers

BTW, still trying to get some profiles made. Not knowing Python is proving to be a rather large stumbling block, however.

Here's a link to a python tutorial that may be of some help

http://docs.python.org/tut/tut.html

JTravers · 11-21-2007, 07:01 PM

Quote:

Originally Posted by kovidgoyal

Here's a link to a python tutorial that may be of some help

http://docs.python.org/tut/tut.html

Thanks for the link

I'm really looking forward to getting some more interesting web content onto my 505.

BTW, does web2lrf only accept RSS feeds as input, or can one give it a regular webpage to process?

kovidgoyal · 11-22-2007, 12:58 PM

web2lrf --url http://mypage default

will process a website.

tompe · 11-22-2007, 03:47 PM

Can you stop the processing after the html has been cleaned up but before the html file tree is removed? (Or how do you get web2html?)

kovidgoyal · 11-22-2007, 05:27 PM

web2disk

tompe · 11-22-2007, 06:32 PM

Does web2disk really do the cleanup ot the html code? If I only want the files I suppose wget will work also. Or do web2disk do something that wget does not do?

kovidgoyal · 11-22-2007, 07:43 PM

It's optimized for downloading websites for conversion to ebooks. Has link filters and recursion level control and a bunch of other features

Code:

web2disk --help

cleanup is done by regexps, I dont remeber whether the regexps are passed to web2disk or html2lrf, i think it is web2disk, but there may not be a command line interface to it.

tompe · 11-22-2007, 08:19 PM

But if you run web2lrf it seems like the cleanup is done just before the conversion to another format. With --debug it says:

[INFO] convert_from.py:330: Processing 7108374.stm
[INFO] convert_from.py:283: Parsing HTML...
[INFO] convert_from.py:318: Written preprocessed HTML to /tmp/html2lrf-verbose.html
[INFO] convert_from.py:333: Converting to BBeB...

But since "web2disk bbc" is not implemented I have not been able to get the result after the preprocessing so I have not been able to check how it looks.

11-19-2007, 07:00 PM	#48
kovidgoyal creator of calibre Posts: 44,482 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: pydoc str Look for rpartition

11-20-2007, 12:17 PM	#50
kovidgoyal creator of calibre Posts: 44,482 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Ah well here you go Code: url = 'http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html'.rpartition('/')[0].replace('c/', 'clanok_tlac.asp?cl=')

11-22-2007, 07:43 PM	#59
kovidgoyal creator of calibre Posts: 44,482 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It's optimized for downloading websites for conversion to ebooks. Has link filters and recursion level control and a bunch of other features Code: web2disk --help cleanup is done by regexps, I dont remeber whether the regexps are passed to web2disk or html2lrf, i think it is web2disk, but there may not be a command line interface to it.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
web2lrf to capture blog archive?	Deputy-Dawg	Sony Reader Dev Corner	1	02-14-2008 11:41 PM
web2lrf: La Repubblica	alexxxm	Sony Reader	1	11-13-2007 12:27 PM

11-19-2007, 03:22 PM	#46
kovidgoyal creator of calibre Posts: 44,482 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	https://libprs500.kovidgoyal.net/wiki/UserProfiles

11-20-2007, 10:48 AM	#49
Silvayn Member Posts: 10 Karma: 10 Join Date: Jun 2007 Location: Slovakia Device: HTC Touch Diamond, Sony Reader 505	If I understand it correctly, rpartition divides a string into a 3-member array. This doesn't really help me that much, as I don't "speak" python and it's different from the languages that I know. So... if I could ask some python-knowledgable person to give me the exact command for the string conversion... I assume it would cost you about 5 secs of your life Thank you in advance... in return I offer (rusty) pascal & vbscript support i need http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html to become http://www.sme.sk/clanok_tlac.asp?cl=3592953 replace('/c/', '/clanok_tlac.asp?cl=') is step one... but after that i'm stuck

11-21-2007, 10:31 AM	#52
kovidgoyal creator of calibre Posts: 44,482 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That's a bug, actually a regression I introduced a few versions back. It will be fixed in the next release.

11-22-2007, 12:58 PM	#55
kovidgoyal creator of calibre Posts: 44,482 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	web2lrf --url http://mypage default will process a website.

11-22-2007, 03:47 PM	#56
tompe Grand Sorcerer Posts: 7,452 Karma: 7185064 Join Date: Oct 2007 Location: Linköpng, Sweden Device: Kindle Voyage, Nexus 5, Kindle PW	Can you stop the processing after the html has been cleaned up but before the html file tree is removed? (Or how do you get web2html?)

11-22-2007, 05:27 PM	#57
kovidgoyal creator of calibre Posts: 44,482 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	web2disk

11-22-2007, 06:32 PM	#58
tompe Grand Sorcerer Posts: 7,452 Karma: 7185064 Join Date: Oct 2007 Location: Linköpng, Sweden Device: Kindle Voyage, Nexus 5, Kindle PW	Does web2disk really do the cleanup ot the html code? If I only want the files I suppose wget will work also. Or do web2disk do something that wget does not do?

11-22-2007, 08:19 PM	#60
tompe Grand Sorcerer Posts: 7,452 Karma: 7185064 Join Date: Oct 2007 Location: Linköpng, Sweden Device: Kindle Voyage, Nexus 5, Kindle PW	But if you run web2lrf it seems like the cleanup is done just before the conversion to another format. With --debug it says: [INFO] convert_from.py:330: Processing 7108374.stm [INFO] convert_from.py:283: Parsing HTML... [INFO] convert_from.py:318: Written preprocessed HTML to /tmp/html2lrf-verbose.html [INFO] convert_from.py:333: Converting to BBeB... But since "web2disk bbc" is not implemented I have not been able to get the result after the preprocessing so I have not been able to check how it looks.