11-19-2007, 03:22 PM | #46 |
creator of calibre
Posts: 44,482
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
11-19-2007, 06:53 PM | #47 | |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Quote:
Dale |
|
11-19-2007, 07:00 PM | #48 |
creator of calibre
Posts: 44,482
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
pydoc str |
11-20-2007, 10:48 AM | #49 |
Member
Posts: 10
Karma: 10
Join Date: Jun 2007
Location: Slovakia
Device: HTC Touch Diamond, Sony Reader 505
|
If I understand it correctly, rpartition divides a string into a 3-member array. This doesn't really help me that much, as I don't "speak" python and it's different from the languages that I know. So... if I could ask some python-knowledgable person to give me the exact command for the string conversion... I assume it would cost you about 5 secs of your life
Thank you in advance... in return I offer (rusty) pascal & vbscript support i need http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html to become http://www.sme.sk/clanok_tlac.asp?cl=3592953 replace('/c/', '/clanok_tlac.asp?cl=') is step one... but after that i'm stuck |
11-20-2007, 12:17 PM | #50 |
creator of calibre
Posts: 44,482
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah well here you go
Code:
url = 'http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html'.rpartition('/')[0].replace('c/', 'clanok_tlac.asp?cl=') |
11-21-2007, 04:45 AM | #51 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Text links being dropped
Kovid,
I noticed that web2lrf ignores/deletes words entirely that have underlying links. This makes some articles a little hard to understand since key words are sometimes left out. As an example, in the following article the names "David Beckham," "Adidas," and "Pepsi" are all deleted/ignored when it is converted to an lrf. http://www.nytimes.com/2007/11/17/bu...gewanted=print I noticed the same thing happens when downloading the html file and running it through html2lrf. I've attached the lrf I generated as an example. Is there something about linked text that makes it difficult to parse? Or is this simply a bug that needs to be eliminated? Thanks a lot for your help. BTW, still trying to get some profiles made. Not knowing Python is proving to be a rather large stumbling block, however. Last edited by JTravers; 11-21-2007 at 04:55 AM. Reason: added lrf attachment |
11-21-2007, 10:31 AM | #52 |
creator of calibre
Posts: 44,482
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's a bug, actually a regression I introduced a few versions back. It will be fixed in the next release.
|
11-21-2007, 04:41 PM | #53 | |
creator of calibre
Posts: 44,482
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
http://docs.python.org/tut/tut.html |
|
11-21-2007, 07:01 PM | #54 | |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Quote:
I'm really looking forward to getting some more interesting web content onto my 505. BTW, does web2lrf only accept RSS feeds as input, or can one give it a regular webpage to process? |
|
11-22-2007, 12:58 PM | #55 |
creator of calibre
Posts: 44,482
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
11-22-2007, 03:47 PM | #56 |
Grand Sorcerer
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
Can you stop the processing after the html has been cleaned up but before the html file tree is removed? (Or how do you get web2html?)
|
11-22-2007, 05:27 PM | #57 |
creator of calibre
Posts: 44,482
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
web2disk
|
11-22-2007, 06:32 PM | #58 |
Grand Sorcerer
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
Does web2disk really do the cleanup ot the html code? If I only want the files I suppose wget will work also. Or do web2disk do something that wget does not do?
|
11-22-2007, 07:43 PM | #59 |
creator of calibre
Posts: 44,482
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It's optimized for downloading websites for conversion to ebooks. Has link filters and recursion level control and a bunch of other features
Code:
web2disk --help |
11-22-2007, 08:19 PM | #60 |
Grand Sorcerer
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
But if you run web2lrf it seems like the cleanup is done just before the conversion to another format. With --debug it says:
[INFO] convert_from.py:330: Processing 7108374.stm [INFO] convert_from.py:283: Parsing HTML... [INFO] convert_from.py:318: Written preprocessed HTML to /tmp/html2lrf-verbose.html [INFO] convert_from.py:333: Converting to BBeB... But since "web2disk bbc" is not implemented I have not been able to get the result after the preprocessing so I have not been able to check how it looks. |
Tags |
libprs500, web2lrf |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
web2lrf to capture blog archive? | Deputy-Dawg | Sony Reader Dev Corner | 1 | 02-14-2008 11:41 PM |
web2lrf: La Repubblica | alexxxm | Sony Reader | 1 | 11-13-2007 12:27 PM |