Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-08-2013, 04:07 PM   #1
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Two XMLs cannot be parsed by Calibre

[Very sorry for using current thread to replace the previous one.]

My problem is that some XMLs cannot be handled by Calibre's built-in parser.
Please see two problematic examples in the attachment.
For both, only the first item was exported.

Hope to have your advice on what I can do.

Thanks!
Attached Files
File Type: xml pipe.xml (10.1 KB, 67 views)
File Type: xml NYT.xml (24.1 KB, 69 views)
surf is offline   Reply With Quote
Old 02-09-2013, 01:21 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,262
Karma: 4961457
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.
kovidgoyal is online now   Reply With Quote
Old 02-09-2013, 02:39 AM   #3
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.
It seems to be a difficult job.

Could you give me some clues?

Or I need learn something about feedparser ?

Thanks!
surf is offline   Reply With Quote
Old 02-09-2013, 02:40 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,262
Karma: 4961457
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
http://pythonhosted.org/feedparser/index.html
kovidgoyal is online now   Reply With Quote
Old 02-09-2013, 02:41 AM   #5
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.
Hi, Kovid

Could you adivse which Calibre API for fixing XML?

Or any example recipe?

Thanks!
surf is offline   Reply With Quote
Old 02-09-2013, 02:55 AM   #6
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
Hi, Kovid

Attached are two XML, both generated by Google Reader API, but
one can be parsed, the other cannot.
The two files should be basically same in its structure.

Could you help with a quick look?

Tons of thank!
Attached Files
File Type: xml GR-DF (all items fetched by calibre).xml (16.6 KB, 39 views)
File Type: xml GR-NYT (only 1st item fetched by calibre).xml (24.1 KB, 41 views)
surf is offline   Reply With Quote
Old 02-09-2013, 03:35 AM   #7
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
calibre uses feedparser to parse feeds. You will need to fix whatever is in the XML that is preventing them from being parsed.
Another XML containg 3 items, but only last two items fetched by calibre
Attached Files
File Type: xml GR-DF (only last two items fetched by calibre).xml (15.1 KB, 40 views)
surf is offline   Reply With Quote
Old 02-09-2013, 03:48 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,262
Karma: 4961457
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.
kovidgoyal is online now   Reply With Quote
Old 02-09-2013, 05:06 AM   #9
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.
Dear Sir,

Thanks for your notice.
The problem in #6 is due to oldest_article
After I change to value below, XML in #6 would allow all 3 items fetcheds well
oldest_article = 365
max_articles_per_feed = 400

But for other XML, still only the first item was fetched.

by all means, thanks a lot!

Last edited by surf; 02-09-2013 at 05:10 AM.
surf is offline   Reply With Quote
Old 02-09-2013, 05:21 AM   #10
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.
Quote:
Originally Posted by kovidgoyal View Post
I dont have time to investigate parsing issues in feedparser, if it is indeed a aprsing issue, but make sure you have set the oldest_article setting in your recipe correctly.
Hi, Kovid, sorry for troubling you again

I, trying to narrow down the issue

For the attached XML (very simple, containing 3 items), calibre would only fetch the first item.

My recipe and the log are as below

Code:
class GoogleReader(BasicNewsRecipe):

    title   = 'z - GR-pipe-WSJ'
    description = ''
    __author__ = 'Surf'

    oldest_article = 365
    max_articles_per_feed = 400
 
    use_embedded_content = True
    auto_cleanup = False

    feeds = [(u'GR-pipe-WSJ', 'file:///D:/GR-pipe-WSJ.xml')]
Fetch news from z - GR-NYT
Resolved conversion options
calibre version: 0.8.68
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_download_recipe': False,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'enable_heuristics': False,
'epub_flatten': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'flow_size': 260,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x03C73810>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'no_chapters_in_toc': False,
'no_default_epub_cover': False,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.GenericEink object at 0x03C73A10>,
'page_breaks_before': None,
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: Recipe Input running
Synthesizing mastheadImage
Downloading
Fetching file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html
WARNING: Encoding detection confidence 99%
Processing images...
Fetching http://g1.cn.nytimes.com/images/2010...icleInline.jpg

Recursion limit reached. Skipping links in file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html
file:C:\Users\LMH\AppData\Local\Temp\calibre_0.8.6 8_tmp_nrhgty\xj7edb_feeds2disk.html saved to C:\Users\LMH\AppData\Local\Temp\calibre_0.8.68_tmp _nrhgty\jcjhdy_plumber\feed_0\article_0\xj7edb_fee ds2disk.xhtml
Downloaded article: “虚拟中产阶级”的崛起 from http://cn.nytimes.com/tools/r.html?f...iedman%2F&cid=
Parsing all content...
Parsing index.html ...
Forcing index.html into XHTML namespace
Parsing feed_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_0/index.html as HTML

Parsing feed_0/article_0/index.html ...
Forcing feed_0/article_0/index.html into XHTML namespace
Referenced file u'feed_1/index.html' not found
Reading TOC from NCX...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Found 6 items of level: div_1
Found 2 items of level: div_2
Found 16 items of level: p_2
Found 1 items of level: div_4
Ignoring level p_2
Ignoring level div_4
div_1 left margin stats: Counter({u'': 1})
div_1 right margin stats: Counter({u'': 1})
div_2 left margin stats: Counter()
div_2 right margin stats: Counter()
Cleaning up manifest...
Trimming unused files from manifest...
Creating EPUB Output...
Found non-unique filenames, renaming to support broken EPUB readers like FBReader, Aldiko and Stanza...
{u'feed_0/article_0/index.html': u'feed_0/article_0/index_u2.html',
u'feed_0/index.html': u'feed_0/index_u1.html'}
Rescaling image from 590x750 to 566x720 cover.jpg
Rescaling image from 600x60 to 566x56 mastheadImage.jpg
Splitting markup on page breaks and flow limits, if any...
Looking for large trees in feed_0/article_0/index_u2.html...
No large trees found
Looking for large trees in index.html...
No large trees found
Looking for large trees in feed_0/index_u1.html...
No large trees found
The cover image has an id != "cover". Renaming to work around bug in Nook Color
EPUB output written to C:\Users\LMH\AppData\Local\Temp\calibre_0.8.68_tmp _nrhgty\j7m479_recipe_out.epub
Attached Files
File Type: xml GR-NYT.xml (24.1 KB, 33 views)
File Type: epub GR-NYT (output).epub (137.6 KB, 28 views)
surf is offline   Reply With Quote
Old 02-09-2013, 07:18 AM   #11
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Hi, Kovid,

I land the problem finally

For Atom generated by Google Reader API, in some cases, the value of each entry's "grriginal-id" is blank '', and then calibre would parse only the FIRST entry and ignore all the others.

I'm trying to find a solution to this.

Last edited by surf; 02-09-2013 at 07:28 AM.
surf is offline   Reply With Quote
Old 02-10-2013, 06:19 AM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,262
Karma: 4961457
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Simply add the original-id attribute yourself before parsing using the uuid module.
kovidgoyal is online now   Reply With Quote
Old 02-10-2013, 09:29 AM   #13
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
Simply add the original-id attribute yourself before parsing using the uuid module.
Hi, Kovid, thanks for the notice!

It seems that I need rewrite a new parse_feeds to overrid the built-in one.
I know how to handle soup in parse_index.
After reading the source code of parse_feeds, I'm confused about how to preprocess the XML to add an attribute.

Could you give me some clues?

Thanks!

Last edited by surf; 02-10-2013 at 09:36 AM.
surf is offline   Reply With Quote
Old 02-10-2013, 10:53 AM   #14
surf
Member
surf began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Feb 2013
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
Simply add the original-id attribute yourself before parsing using the uuid module.
Hi Kovid, I include following code to avoid exporting/sending blank ePub,
but it does not work and blank ePub would still be produced incase of no new feeds published.
Could you have a look? Thanks!

Code:
    def parse_feeds(self):

        feeds = BasicNewsRecipe.parse_feeds(self)

        remove = [f for f in feeds if len(f) == 0 and
                self.remove_empty_feeds]
        for f in remove:
            feeds.remove(f)

        if  len(feeds) == 0: self.abort_recipe_processing('');

        return feeds
surf is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
*.parsed files? techie42 Kobo Reader 16 03-21-2014 01:08 AM
Touch N3_FULL.parsed WS64 Kobo Reader 2 04-01-2012 05:04 PM
Patch: Calibre adds tags to identify ebook formats created by calibre. siebert Calibre 1 07-18-2011 02:07 PM
How should file names be parsed and prepared for calibre import? Use cases requested GlennMaples Calibre 0 01-09-2011 12:41 AM
XMLs von gutenberg.de. wohin damit? MarcusVenedi E-Books 7 07-05-2010 04:37 AM


All times are GMT -4. The time now is 08:46 AM.


MobileRead.com is a privately owned, operated and funded community.