![]() |
#1 |
Enthusiast
![]() ![]() Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
Multipage questions (Sueddeutsche Magazin)
Hello,
Next to Sueddeutsche Zeitung (newspaper) there is also a magazine. This one sometimes has articles with multiple pages, a print-version is available for some articles, but very inconsistent. I looked at the "Adventure Gamers" multi page example and adopted the code. I had the advantage that all subsequent pages are linked on the first page and hence skipped the recursion and implemented it more simple the iterative way. Here is my current code. Spoiler:
Things look pretty good right now, except I have one issue with HTML comments in preprocess_html. Code:
<!-- ad tag --> Code:
<!--<!-- ad tag -->--> Code:
comments = next_article.findAll(text=lambda text:isinstance(text, Comment)) [comment.extract() for comment in comments] Why are the HTML tags re-commented again? |
![]() |
![]() |
![]() |
#2 |
Connoisseur
![]() Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
|
Really nice work for "Süddeutsche Magazin"!
Though I cannot give any hints to the question itself let me suggest the following improvements:
+Karma! |
![]() |
![]() |
![]() |
#3 | |
Enthusiast
![]() ![]() Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
Quote:
![]() I added the conversion options, the publisher and the UTF-8 text for title etc. with Umlauts. I also took a look again at the comments in preprocess_html. Actually, the comments were still correct at when logging them. Apparently, they would really be modified (incorrectly?) after preprocess_html? After removing the banner ad the only comment left was google_ads. Removing the comments as in the beautifulsoup documentation would not work, the comments would not be found. I found them and removed the comments with this code Code:
comments = next_article.findAll(text=re.compile('google_ad')) [comment.extract() for comment in comments] Spoiler:
The following could still be done
|
|
![]() |
![]() |
![]() |
#4 |
Enthusiast
![]() ![]() Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
[obsolete]
Last edited by aerodynamik; 04-25-2011 at 12:43 PM. |
![]() |
![]() |
![]() |
#5 |
Enthusiast
![]() ![]() Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
Süddeutsche Zeitung Magazin
Okay, I think this should work now. There is still room for improvement (images, line breaks, additional blogs).
The issue regarding HTML comments is still unclear to me. In addition, I understand that remove_tags is applied after preprocess_html. Is there a smart way to re-implementing remove_tags? Is there a way to process the subsequent pages equally as any other downloaded page? But for now, have fun with this, let me know if it works for you as well. Spoiler:
|
![]() |
![]() |
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The method you are looking for is called
postprocess_html |
![]() |
![]() |
![]() |
#7 |
Enthusiast
![]() ![]() Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
The obvious choice
![]() Can you shed some light on the HTML-comments issue I have had and worked around, i.e. <!-- xyz --> becomes <!--<!-- xyz -->-->. Thanks in advance. |
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That can happen in various ways when you are manipulating the HTML. To avoid it, I typically just strip all comments with a regexp in preprocess_regexps
|
![]() |
![]() |
![]() |
#9 | |
Enthusiast
![]() ![]() Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
Quote:
Gave postprocess_html a quick try. Obviously it has the processed page that was downloaded by adding it to feeds. However, the additional multi-pages that I download within this method are obviously not processed with remove_tags. Not sure, I understood your original comment correctly. Did you mean that I should implement "remove all tags in remove_tags" in postprocess_html, since the pages I download in preprocess_html would then also be processed in postprocess_html? |
|
![]() |
![]() |
![]() |
#10 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
All pages are processed by postprocess_html and all pages have remove_tags applied to them.
|
![]() |
![]() |
![]() |
#11 | |
Enthusiast
![]() ![]() Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
Quote:
For additional pages, I download in preprocess or postprocess_html with self.index_to_soup(url), remove_tags is not applied. (In my case, a certain div is not removed.) If I log the soup given to preprocess_html, remove_tags has already been applied. (In my case, that certain div is already removed.) If I download additional pages with self.index_to_soup(url) in preprocess_html and add it to the original first page with "insert", this very page then is processed by postprocess_html. remove_tags is not re-applied to this complete page. (In my case, that certain div is not removed from the complete page then.) I'm not complaining here, this sounds more then logical ![]() Cheers, - aero |
|
![]() |
![]() |
![]() |
#12 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Look at the function get_soup in fetch/simple.py. This function is called by process_links to create soup for every link that is followed. And it explicitly applies remove_tags and co.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Request: Multipage recipe for Reuters | wutangmark | Recipes | 1 | 12-31-2010 08:24 PM |
Good e-Reader Magazin | Marc_liest | Deutsches Forum | 0 | 10-04-2010 04:08 AM |
Calibre, Instapaper, multipage articles and ordering | flyash | Calibre | 1 | 06-10-2010 07:03 PM |
Multipage HTML file > Mobi or PDF? | Dinah-Moe Humm | Other formats | 4 | 06-01-2010 03:43 PM |
BeBook Zusammenfassung zum BeBook im c't Magazin No.9 '09 | beachwanderer | Andere Lesegeräte | 0 | 04-14-2009 04:22 AM |