Two XMLs cannot be parsed by Calibre

surf · 02-08-2013, 04:07 PM

[Very sorry for using current thread to replace the previous one.]

My problem is that some XMLs cannot be handled by Calibre's built-in parser.
Please see two problematic examples in the attachment.
For both, only the first item was exported.

Hope to have your advice on what I can do.

Thanks!

kovidgoyal · 02-09-2013, 01:21 AM

calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.

surf · 02-09-2013, 02:39 AM

Quote:

Originally Posted by kovidgoyal

calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.

It seems to be a difficult job.

Could you give me some clues?

Or I need learn something about feedparser ?

Thanks!

kovidgoyal · 02-09-2013, 02:40 AM

http://pythonhosted.org/feedparser/index.html

surf · 02-09-2013, 02:41 AM

Quote:

Originally Posted by kovidgoyal

calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.

Hi, Kovid

Could you adivse which Calibre API for fixing XML?

Or any example recipe?

Thanks!

surf · 02-09-2013, 02:55 AM

Quote:

Originally Posted by kovidgoyal

http://pythonhosted.org/feedparser/index.html

Hi, Kovid

Attached are two XML, both generated by Google Reader API, but
one can be parsed, the other cannot.
The two files should be basically same in its structure.

Could you help with a quick look?

Tons of thank!

surf · 02-09-2013, 03:35 AM

Quote:

Originally Posted by kovidgoyal

calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.

Another XML containg 3 items, but only last two items fetched by calibre

kovidgoyal · 02-09-2013, 03:48 AM

I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.

surf · 02-09-2013, 05:06 AM

Quote:

Originally Posted by kovidgoyal

I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.

Dear Sir,

Thanks for your notice.
The problem in #6 is due to oldest_article
After I change to value below, XML in #6 would allow all 3 items fetcheds well
oldest_article = 365
max_articles_per_feed = 400

But for other XML, still only the first item was fetched.

by all means, thanks a lot!

surf · 02-09-2013, 05:21 AM

Quote:

Originally Posted by kovidgoyal

I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.

Quote:

Originally Posted by kovidgoyal

I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.

Hi, Kovid, sorry for troubling you again

I, trying to narrow down the issue

For the attached XML (very simple, containing 3 items), calibre would only fetch the first item.

My recipe and the log are as below

Code:

class GoogleReader(BasicNewsRecipe):

    title   = 'z - GR-pipe-WSJ'
    description = ''
    __author__ = 'Surf'

    oldest_article = 365
    max_articles_per_feed = 400
 
    use_embedded_content = True
    auto_cleanup = False

    feeds = [(u'GR-pipe-WSJ', 'file:///D:/GR-pipe-WSJ.xml')]

Fetch news from z - GR-NYT
Resolved conversion options
calibre version: 0.8.68
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_download_recipe': False,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'enable_heuristics': False,
'epub_flatten': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'flow_size': 260,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x03C73810>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'no_chapters_in_toc': False,
'no_default_epub_cover': False,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.GenericEink object at 0x03C73A10>,
'page_breaks_before': None,
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: Recipe Input running
Synthesizing mastheadImage
Downloading
Fetching file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html
WARNING: Encoding detection confidence 99%
Processing images...
Fetching http://g1.cn.nytimes.com/images/2010...icleInline.jpg

Recursion limit reached. Skipping links in file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html
file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html saved to C:\Users\LMH\AppData\Local\Temp\calibre_0.8.68_tmp _nrhgty\jcjhdy_plumber\feed_0\article_0\xj7edb_fee ds2disk.xhtml
Downloaded article: “虚拟中产阶级”的崛起 from http://cn.nytimes.com/tools/r.html?f...iedman%2F&cid=
Parsing all content...
Parsing index.html ...
Forcing index.html into XHTML namespace
Parsing feed_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_0/index.html as HTML

Parsing feed_0/article_0/index.html ...
Forcing feed_0/article_0/index.html into XHTML namespace
Referenced file u'feed_1/index.html' not found
Reading TOC from NCX...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Found 6 items of level: div_1
Found 2 items of level: div_2
Found 16 items of level: p_2
Found 1 items of level: div_4
Ignoring level p_2
Ignoring level div_4
div_1 left margin stats: Counter({u'': 1})
div_1 right margin stats: Counter({u'': 1})
div_2 left margin stats: Counter()
div_2 right margin stats: Counter()
Cleaning up manifest...
Trimming unused files from manifest...
Creating EPUB Output...
Found non-unique filenames, renaming to support broken EPUB readers like FBReader, Aldiko and Stanza...
{u'feed_0/article_0/index.html': u'feed_0/article_0/index_u2.html',
u'feed_0/index.html': u'feed_0/index_u1.html'}
Rescaling image from 590x750 to 566x720 cover.jpg
Rescaling image from 600x60 to 566x56 mastheadImage.jpg
Splitting markup on page breaks and flow limits, if any...
Looking for large trees in feed_0/article_0/index_u2.html...
No large trees found
Looking for large trees in index.html...
No large trees found
Looking for large trees in feed_0/index_u1.html...
No large trees found
The cover image has an id != "cover". Renaming to work around bug in Nook Color
EPUB output written to C:\Users\LMH\AppData\Local\Temp\calibre_0.8.68_tmp _nrhgty\j7m479_recipe_out.epub

surf · 02-09-2013, 07:18 AM

Hi, Kovid,

I land the problem finally

For Atom generated by Google Reader API, in some cases, the value of each entry's "gr

riginal-id" is blank '', and then calibre would parse only the FIRST entry and ignore all the others.

I'm trying to find a solution to this.

kovidgoyal · 02-10-2013, 06:19 AM

Simply add the original-id attribute yourself before parsing using the uuid module.

surf · 02-10-2013, 09:29 AM

Quote:

Originally Posted by kovidgoyal

Simply add the original-id attribute yourself before parsing using the uuid module.

Hi, Kovid, thanks for the notice!

It seems that I need rewrite a new parse_feeds to overrid the built-in one.
I know how to handle soup in parse_index.
After reading the source code of parse_feeds, I'm confused about how to preprocess the XML to add an attribute.

Could you give me some clues?

Thanks!

surf · 02-10-2013, 10:53 AM

Quote:

Originally Posted by kovidgoyal

Simply add the original-id attribute yourself before parsing using the uuid module.

Hi Kovid, I include following code to avoid exporting/sending blank ePub,
but it does not work and blank ePub would still be produced incase of no new feeds published.
Could you have a look? Thanks!

Code:

    def parse_feeds(self):

        feeds = BasicNewsRecipe.parse_feeds(self)

        remove = [f for f in feeds if len(f) == 0 and
                self.remove_empty_feeds]
        for f in remove:
            feeds.remove(f)

        if  len(feeds) == 0: self.abort_recipe_processing('');

        return feeds

02-09-2013, 07:18 AM	#11
surf Member Posts: 22 Karma: 10 Join Date: Feb 2013 Device: kindle	Hi, Kovid, I land the problem finally For Atom generated by Google Reader API, in some cases, the value of each entry's "grriginal-id" is blank '', and then calibre would parse only the FIRST entry and ignore all the others. I'm trying to find a solution to this. Last edited by surf; 02-09-2013 at 07:28 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
*.parsed files?	techie42	Kobo Reader	16	03-21-2014 01:08 AM
Touch N3_FULL.parsed	WS64	Kobo Reader	2	04-01-2012 05:04 PM
Patch: Calibre adds tags to identify ebook formats created by calibre.	siebert	Calibre	1	07-18-2011 02:07 PM
How should file names be parsed and prepared for calibre import? Use cases requested	GlennMaples	Calibre	0	01-09-2011 12:41 AM
XMLs von gutenberg.de. wohin damit?	MarcusVenedi	E-Books	7	07-05-2010 04:37 AM

02-09-2013, 01:21 AM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.

02-09-2013, 02:40 AM	#4
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://pythonhosted.org/feedparser/index.html

02-09-2013, 03:48 AM	#8
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.

02-10-2013, 06:19 AM	#12
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Simply add the original-id attribute yourself before parsing using the uuid module.