![]() |
#2521 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
That's OK. I looked at your site and ran your recipe (thank you for using code tags - you may also want to add spoiler tags to reduce the length).
I now understand your problem. The site has bad html. The page you are trying to parse to get feeds is seen as a giant NavigableString inside a single tag. There are no other tags within it as far as BeautifulSoup is concerned. I don't know exactly why, but I suspect it isn't solely due to the fact that it is using the " />" format to immediately close div tags, then trying to close them again with the normal </div>, so there are two closings (another bit of bad html.) Whatever is going on is confusing Beautiful Soup to the extent that it can't find anything except the first surrounding tag. It should still be possible to extract feeds, but it will require much trickier programming to get links out of the giant string which is soup.contents[0].string. You will need to treat it as a string, then extract from the string, instead of trying to find tags within that string (although you may be able to use BS to convert it into a tag-based structure with some trickery). It's an interesting problem, and I regret that I don't have time now to attack it. If you solve it, post your solution. |
![]() |
![]() |
#2522 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Aug 2010
Device: Kindle
|
The Financial Times recipe that's currently posted isn't the complete print edition. I'm a subscriber and trying to put something together that will allow me to get the day's print edition (http://www.ft.com/us-edition). Unfortunately, there is no RSS feed for this.
Can anyone help, either by putting a recipe together, or directing me to a template I might be able to use to give it a shot myself (note I am a complete novice at this). Thanks. |
![]() |
![]() |
#2523 |
Addict
![]() Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
|
Starson17, I thought I was getting the knack of this but apparently not. If it is an indent problem then hit me. I don't think it is though but I can't seem to understand why the print is not being appended to the url. I looked at the newyorker as an example and I to the most part use the same print_version()
my code Spoiler:
I also am using the the test you gave me of ebook-convert tmz.recipe output_dir --test -vv > myrecipe.txt I don't really see any errors in there and I also don't see the print_url is: ... either ![]() Sorry to bug you I will get it one of these days. ![]() |
![]() |
![]() |
#2524 | |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Jun 2010
Device: none
|
Quote:
The only thing I can think of is to try and introduce some "remove_tags" type code to try and simplyfy the html so it can be converted. This could take some time (not that familiar with html or python code). Any suggestions as to what I can and can't remove? Thanks. |
|
![]() |
![]() |
#2525 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
![]() |
![]() |
#2526 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
print_url is: http://www.tmz.com/2010/08/18/tmz-seinfeld-uncle-leo-crank-call-police-burbank-kim-kardashian-facebook/print print_url is: http://www.tmz.com/2010/08/25/mel-gibson-extortion-case-oksana-grigorieva-domestic-violence-sheriff-district-attorney-investigation/print Last edited by Starson17; 08-26-2010 at 11:57 AM. |
|
![]() |
![]() |
#2527 | |
Member
![]() Posts: 17
Karma: 10
Join Date: Aug 2010
Device: Kindle DX
|
Quote:
|
|
![]() |
![]() |
#2528 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
OK, your problem was so interesting, I couldn't resist looking at it further. Your problem is the bad html code in your source page http://www.51voa.com/. The closing angle bracket of each opening tag is ' />', instead of ' >'. That results in each tag being closed twice. Beautiful Soup is confused and sees the entire page as a single [document] element having a single NavigableString of text, not as multiple tags, so none of the tag-based searches or manipulation commands will work. There are no tags for BeautifulSoup to find or work with.
To fix this, you first grab that page (as you have already done in your code): Code:
soup = self.index_to_soup('http://www.51voa.com/') Code:
rawc = soup.contents[0].string.replace(' />', ' >') Code:
soup = BeautifulSoup(rawc, fromEncoding=self.encoding) Code:
from calibre.ebooks.BeautifulSoup import BeautifulSoup That's it! Put the two extra lines above after your first index_to_soup line. Be aware that any legitimate single element tags, such as <img>, <br> etc. will get mangled with the simple search and replace above. You may have to special case any tags that are allowed to have a closing slash inside the opening tag so they don't get mangled. Edit: I forgot, you also need this line: encoding = 'utf-8' or else the final step will fail. Last edited by Starson17; 08-26-2010 at 04:08 PM. |
![]() |
![]() |
#2529 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
change the file extension from .epub to .zip and unpack it into a directory. You will find the css file there and will be able to open the unpacked book with your browser.
|
![]() |
![]() |
#2530 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
keep_only_tags = dict(name='div', attrs={'id':['ds-headline','viewarticle']}) Last edited by Starson17; 08-26-2010 at 11:57 AM. |
|
![]() |
![]() |
#2531 | |
Addict
![]() Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
|
Quote:
The reason I suspected that it is not hitting the print_url is because when i navigate to the page manually I do not see any of those tags at all (as in if i look at the original html /print version) However, I do notice the tags if I do not append /print . Any thoughts? notice that when you goto http://www.tmz.com/2010/08/24/michae...r-pistol-whip/ You will get the div tag of <breadcrumbs> that i wish to remove. But when you go to http://www.tmz.com/2010/08/24/michae...tol-whip/print You do not get the div tag of <breadcrumbs> maybe I've got an indention problem going on here that just isn't obvious to me or something but for sure something is going on ![]() I'm a dumb_____ for whatever reason the recipe that i posted to you is the correct version. geany didn't save it yet I thought it had. I went and opened it back up after rebooting my computer today and noticed i still had the old version haha. Live and learn. Sorry for trouble. Last edited by TonytheBookworm; 08-26-2010 at 02:19 PM. Reason: added more info |
|
![]() |
![]() |
#2532 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
![]() |
|
![]() |
![]() |
#2533 |
Addict
![]() Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
|
Working copy of TMZ
Here is the tmz.
|
![]() |
![]() |
#2534 |
Junior Member
![]() Posts: 9
Karma: 18
Join Date: Sep 2008
Device: UTstarcom
|
New England Journal of Medicine Recipe
The New England of Journal changed their format and the current Calibre version (0.7.15) recipe does not work
![]() Ebookerr |
![]() |
![]() |
#2535 |
Addict
![]() Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
|
Starson I know ask before but I'm still not clear on how to do it
![]() if i have for instance the following link (example only) http://www.testpage.com/doi/abs/10.1...495?ai=rv&af=R but I need it to be the following http://www.testpage.com/doi/full/10....viewType=Print how would I do this? I figure it would be some form or search and replace like i seen in beautiful soup but again i'm kinda clueless (any examples of this in action ? ) the abs needs to be replaced with full and then simply append the ?view part yet not sure how this is done ![]() Maybe something like ? Spoiler:
the above is just a non-working guess Last edited by TonytheBookworm; 08-26-2010 at 09:48 PM. |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Custom column read ? | pchrist7 | Calibre | 2 | 10-04-2010 02:52 AM |
Archive for custom screensavers | sleeplessdave | Amazon Kindle | 1 | 07-07-2010 12:33 PM |
How to back up preferences and custom recipes? | greenapple | Calibre | 3 | 03-29-2010 05:08 AM |
Donations for Custom Recipes | ddavtian | Calibre | 5 | 01-23-2010 04:54 PM |
Help understanding custom recipes | andersent | Calibre | 0 | 12-17-2009 02:37 PM |