02-08-2013, 04:07 PM | #1 |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
Two XMLs cannot be parsed by Calibre
[Very sorry for using current thread to replace the previous one.]
My problem is that some XMLs cannot be handled by Calibre's built-in parser. Please see two problematic examples in the attachment. For both, only the first item was exported. Hope to have your advice on what I can do. Thanks! |
02-09-2013, 01:21 AM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.
|
02-09-2013, 02:39 AM | #3 |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
|
02-09-2013, 02:40 AM | #4 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
02-09-2013, 02:41 AM | #5 |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
|
02-09-2013, 02:55 AM | #6 | |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
Quote:
Attached are two XML, both generated by Google Reader API, but one can be parsed, the other cannot. The two files should be basically same in its structure. Could you help with a quick look? Tons of thank! |
|
02-09-2013, 03:35 AM | #7 |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
Another XML containg 3 items, but only last two items fetched by calibre
|
02-09-2013, 03:48 AM | #8 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.
|
02-09-2013, 05:06 AM | #9 | |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
Quote:
Thanks for your notice. The problem in #6 is due to oldest_article After I change to value below, XML in #6 would allow all 3 items fetcheds well oldest_article = 365 max_articles_per_feed = 400 But for other XML, still only the first item was fetched. by all means, thanks a lot! Last edited by surf; 02-09-2013 at 05:10 AM. |
|
02-09-2013, 05:21 AM | #10 | ||
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
Quote:
Quote:
I, trying to narrow down the issue For the attached XML (very simple, containing 3 items), calibre would only fetch the first item. My recipe and the log are as below Code:
class GoogleReader(BasicNewsRecipe): title = 'z - GR-pipe-WSJ' description = '' __author__ = 'Surf' oldest_article = 365 max_articles_per_feed = 400 use_embedded_content = True auto_cleanup = False feeds = [(u'GR-pipe-WSJ', 'file:///D:/GR-pipe-WSJ.xml')] Resolved conversion options calibre version: 0.8.68 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_download_recipe': False, 'dont_split_on_page_breaks': True, 'duplicate_links_in_toc': False, 'enable_heuristics': False, 'epub_flatten': False, 'extra_css': None, 'extract_to': None, 'filter_css': None, 'fix_indents': True, 'flow_size': 260, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x03C73810>, 'insert_blank_line': False, 'insert_blank_line_size': 0.5, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'no_chapters_in_toc': False, 'no_default_epub_cover': False, 'no_inline_navbars': False, 'no_svg_cover': False, 'output_profile': <calibre.customize.profiles.GenericEink object at 0x03C73A10>, 'page_breaks_before': None, 'prefer_metadata_cover': False, 'preserve_cover_aspect_ratio': False, 'pretty_print': True, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'search_replace': None, 'series': None, 'series_index': None, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'start_reading_at': None, 'tags': None, 'test': False, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'unsmarten_punctuation': False, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} InputFormatPlugin: Recipe Input running Synthesizing mastheadImage Downloading Fetching file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html WARNING: Encoding detection confidence 99% Processing images... Fetching http://g1.cn.nytimes.com/images/2010...icleInline.jpg Recursion limit reached. Skipping links in file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html saved to C:\Users\LMH\AppData\Local\Temp\calibre_0.8.68_tmp _nrhgty\jcjhdy_plumber\feed_0\article_0\xj7edb_fee ds2disk.xhtml Downloaded article: “虚拟中产阶级”的崛起 from http://cn.nytimes.com/tools/r.html?f...iedman%2F&cid= Parsing all content... Parsing index.html ... Forcing index.html into XHTML namespace Parsing feed_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_0/index.html as HTML Parsing feed_0/article_0/index.html ... Forcing feed_0/article_0/index.html into XHTML namespace Referenced file u'feed_1/index.html' not found Reading TOC from NCX... Merging user specified metadata... Detecting structure... Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Found 6 items of level: div_1 Found 2 items of level: div_2 Found 16 items of level: p_2 Found 1 items of level: div_4 Ignoring level p_2 Ignoring level div_4 div_1 left margin stats: Counter({u'': 1}) div_1 right margin stats: Counter({u'': 1}) div_2 left margin stats: Counter() div_2 right margin stats: Counter() Cleaning up manifest... Trimming unused files from manifest... Creating EPUB Output... Found non-unique filenames, renaming to support broken EPUB readers like FBReader, Aldiko and Stanza... {u'feed_0/article_0/index.html': u'feed_0/article_0/index_u2.html', u'feed_0/index.html': u'feed_0/index_u1.html'} Rescaling image from 590x750 to 566x720 cover.jpg Rescaling image from 600x60 to 566x56 mastheadImage.jpg Splitting markup on page breaks and flow limits, if any... Looking for large trees in feed_0/article_0/index_u2.html... No large trees found Looking for large trees in index.html... No large trees found Looking for large trees in feed_0/index_u1.html... No large trees found The cover image has an id != "cover". Renaming to work around bug in Nook Color EPUB output written to C:\Users\LMH\AppData\Local\Temp\calibre_0.8.68_tmp _nrhgty\j7m479_recipe_out.epub |
||
02-09-2013, 07:18 AM | #11 |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
Hi, Kovid,
I land the problem finally For Atom generated by Google Reader API, in some cases, the value of each entry's "grriginal-id" is blank '', and then calibre would parse only the FIRST entry and ignore all the others. I'm trying to find a solution to this. Last edited by surf; 02-09-2013 at 07:28 AM. |
02-10-2013, 06:19 AM | #12 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Simply add the original-id attribute yourself before parsing using the uuid module.
|
02-10-2013, 09:29 AM | #13 | |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
Quote:
It seems that I need rewrite a new parse_feeds to overrid the built-in one. I know how to handle soup in parse_index. After reading the source code of parse_feeds, I'm confused about how to preprocess the XML to add an attribute. Could you give me some clues? Thanks! Last edited by surf; 02-10-2013 at 09:36 AM. |
|
02-10-2013, 10:53 AM | #14 | |
Member
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
|
Quote:
but it does not work and blank ePub would still be produced incase of no new feeds published. Could you have a look? Thanks! Code:
def parse_feeds(self): feeds = BasicNewsRecipe.parse_feeds(self) remove = [f for f in feeds if len(f) == 0 and self.remove_empty_feeds] for f in remove: feeds.remove(f) if len(feeds) == 0: self.abort_recipe_processing(''); return feeds |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
*.parsed files? | techie42 | Kobo Reader | 16 | 03-21-2014 01:08 AM |
Touch N3_FULL.parsed | WS64 | Kobo Reader | 2 | 04-01-2012 05:04 PM |
Patch: Calibre adds tags to identify ebook formats created by calibre. | siebert | Calibre | 1 | 07-18-2011 02:07 PM |
How should file names be parsed and prepared for calibre import? Use cases requested | GlennMaples | Calibre | 0 | 01-09-2011 12:41 AM |
XMLs von gutenberg.de. wohin damit? | MarcusVenedi | E-Books | 7 | 07-05-2010 04:37 AM |