|
|
#46 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,626
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
|
|
|
|
|
#47 | |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Quote:
Dale |
|
|
|
|
| Advert | |
|
|
|
|
#48 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,626
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
pydoc str |
|
|
|
|
|
#49 |
|
Member
![]() Posts: 10
Karma: 10
Join Date: Jun 2007
Location: Slovakia
Device: HTC Touch Diamond, Sony Reader 505
|
If I understand it correctly, rpartition divides a string into a 3-member array. This doesn't really help me that much, as I don't "speak" python and it's different from the languages that I know. So... if I could ask some python-knowledgable person to give me the exact command for the string conversion... I assume it would cost you about 5 secs of your life
![]() Thank you in advance... in return I offer (rusty) pascal & vbscript support ![]() i need http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html to become http://www.sme.sk/clanok_tlac.asp?cl=3592953 replace('/c/', '/clanok_tlac.asp?cl=') is step one... but after that i'm stuck |
|
|
|
|
|
#50 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,626
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah well here you go
Code:
url = 'http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html'.rpartition('/')[0].replace('c/', 'clanok_tlac.asp?cl=')
|
|
|
|
| Advert | |
|
|
|
|
#51 |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Text links being dropped
Kovid,
I noticed that web2lrf ignores/deletes words entirely that have underlying links. This makes some articles a little hard to understand since key words are sometimes left out. As an example, in the following article the names "David Beckham," "Adidas," and "Pepsi" are all deleted/ignored when it is converted to an lrf. http://www.nytimes.com/2007/11/17/bu...gewanted=print I noticed the same thing happens when downloading the html file and running it through html2lrf. I've attached the lrf I generated as an example. Is there something about linked text that makes it difficult to parse? Or is this simply a bug that needs to be eliminated? Thanks a lot for your help. BTW, still trying to get some profiles made. Not knowing Python is proving to be a rather large stumbling block, however. Last edited by JTravers; 11-21-2007 at 05:55 AM. Reason: added lrf attachment |
|
|
|
|
|
#52 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,626
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's a bug, actually a regression I introduced a few versions back. It will be fixed in the next release.
|
|
|
|
|
|
#53 | |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,626
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
http://docs.python.org/tut/tut.html |
|
|
|
|
|
|
#54 | |
|
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Quote:
I'm really looking forward to getting some more interesting web content onto my 505. BTW, does web2lrf only accept RSS feeds as input, or can one give it a regular webpage to process? |
|
|
|
|
|
|
#55 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,626
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
|
|
|
|
|
#56 |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
Can you stop the processing after the html has been cleaned up but before the html file tree is removed? (Or how do you get web2html?)
|
|
|
|
|
|
#57 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,626
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
web2disk
|
|
|
|
|
|
#58 |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
Does web2disk really do the cleanup ot the html code? If I only want the files I suppose wget will work also. Or do web2disk do something that wget does not do?
|
|
|
|
|
|
#59 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,626
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It's optimized for downloading websites for conversion to ebooks. Has link filters and recursion level control and a bunch of other features
Code:
web2disk --help |
|
|
|
|
|
#60 |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
But if you run web2lrf it seems like the cleanup is done just before the conversion to another format. With --debug it says:
[INFO] convert_from.py:330: Processing 7108374.stm [INFO] convert_from.py:283: Parsing HTML... [INFO] convert_from.py:318: Written preprocessed HTML to /tmp/html2lrf-verbose.html [INFO] convert_from.py:333: Converting to BBeB... But since "web2disk bbc" is not implemented I have not been able to get the result after the preprocessing so I have not been able to check how it looks. |
|
|
|
![]() |
| Tags |
| libprs500, web2lrf |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| web2lrf to capture blog archive? | Deputy-Dawg | Sony Reader Dev Corner | 1 | 02-15-2008 12:41 AM |
| web2lrf: La Repubblica | alexxxm | Sony Reader | 1 | 11-13-2007 01:27 PM |