During my bible creation Saga, I have done the following:
1. Cleaned up as much as possible from the files (I use linux&perl's full power on regex plus this great website to test/learn regex:
http://gskinner.com/RegExr/ ). This way I cleaned up:
- all CSS I knew (style)
- fonts, colors, JS,
It's a simple html, nothing more. I don't think there is anything more I can clean (besides the text itself)
2. Merged all the files in one (this way I reached a ~20Mb html file)
3. Created the TOC at the beginning of the file (so the TOC can be created when I set the bf instead of depth first.
4. imported in calibre (with calibredb, as the gui crashes), and now it's a zip.
Now I tried:
5a. To export it in epub (with split on) - > after few steps in the split process, it gives the MemoryError (see the log in my previous post)
5b. To export it in moby -> gives error (see the log in my previous post)
5c. To export it in epub without split (I've set the split above the size of the html, e.g. 30Mb), still it tries to split for some reason and I get again MemoryError on split (just at the beginning of the split)-> see log here:
Spoiler:
calibre, version 0.8.27
ERROR: Conversion Error: <b>Failed</b>: Convert book 1 of 1 (Biblia Ortodoxa sau Sfânta Scriptură adnotata Bartolomeu Anania)
Convert book 1 of 1 (Biblia Ortodoxa sau Sfânta Scriptură adnotata Bartolomeu Anania)
Processing archive...
Resolved conversion options
calibre version: 0.8.27
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0.0,
'book_producer': None,
'breadth_first': False,
'change_justification': u'original',
'chapter': u'/',
'chapter_mark': u'none',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_package': False,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'enable_heuristics': False,
'epub_flatten': False,
'extra_css': None,
'extract_to': None,
'filter_css': u'',
'fix_indents': True,
'flow_size': 30000,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x03F631B0>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': u'//h:h1',
'level2_toc': u'//h:h2',
'level3_toc': u'//h:h3',
'line_height': 0.0,
'linearize_tables': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_levels': 5,
'max_toc_links': 100,
'minimum_line_height': 120.0,
'no_chapters_in_toc': False,
'no_default_epub_cover': True,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.GenericEink object at 0x03F633B0>,
'page_breaks_before': u'/',
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': 'c:\\temp\\calibre_0.8.27_tmp_b5wlxt\\e3wgx7.opf',
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': u'',
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': None,
'sr1_search': None,
'sr2_replace': None,
'sr2_search': None,
'sr3_replace': None,
'sr3_search': None,
'tags': None,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: HTML Input running
on c:\temp\calibre_0.8.27_tmp_b5wlxt\zffyot_plumber_a rchive\content.opf
Parsing all content...
Manifest item 'toc.ncx' not found
Parsing _allhtm.htm ...
Parsing index.htm ...
Generating default TOC from spine...
Merging user specified metadata...
Detecting structure...
Auto generated TOC with 93 entries.
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Parsing stylesheet.css ...
Found 541 items of level: p_10
Found 103 items of level: p_11
Found 8 items of level: div_1
Found 4222 items of level: div_3
Found 27 items of level: div_7
Found 42505 items of level: div_6
Found 1 items of level: div_10
Found 20 items of level: p_8
Found 14 items of level: p_9
Found 65 items of level: p_6
Found 21 items of level: p_7
Found 1071 items of level: p_3
Found 1336 items of level: p_1
Ignoring level div_10
Ignoring level p_7
Ignoring level p_8
Ignoring level p_9
p_10 left margin stats: Counter({u'0': 541})
p_10 right margin stats: Counter({u'0': 541})
p_11 left margin stats: Counter({u'0': 103})
p_11 right margin stats: Counter({u'0': 103})
div_1 left margin stats: Counter()
div_1 right margin stats: Counter()
div_3 left margin stats: Counter({u'': 4222})
div_3 right margin stats: Counter({u'': 4222})
div_7 left margin stats: Counter({u'': 27})
div_7 right margin stats: Counter({u'': 27})
div_6 left margin stats: Counter({u'': 42505})
div_6 right margin stats: Counter({u'': 42505})
p_6 left margin stats: Counter({u'0': 65})
p_6 right margin stats: Counter({u'0': 65})
p_3 left margin stats: Counter({u'0': 1071})
p_3 right margin stats: Counter({u'0': 1071})
p_1 left margin stats: Counter({u'0': 1336})
p_1 right margin stats: Counter({u'0': 1336})
Cleaning up manifest...
Trimming unused files from manifest...
Creating EPUB Output...
Rescaling image from 861x1159 to 558x751 06-palestina-vechiului-testament.jpg
Rescaling image from 945x613 to 566x367 01-vechiul-orient.jpg
Rescaling image from 1722x958 to 566x315 05-calatoria-captivitatii-apostolului-pavel.jpg
Rescaling image from 1732x2376 to 547x751 03-ierusalimul-noului-testament.jpg
Rescaling image from 1704x1278 to 566x425 04-calatoriile-misionare-ale-apostolului-pavel.jpg
Rescaling image from 1749x2370 to 554x751 07-palestina-noului-testament.jpg
Looking for large trees in _allhtm.htm...
Found large tree #0
Splitting...
Split point: {http://www.w3.org/1999/xhtml}div /*/*[2]/*[720]
Python function terminated unexpectedly
(Error Code: 1)
Traceback (most recent call last):
File "site.py", line 132, in main
File "site.py", line 109, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 187, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 31, in gui_convert_override
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 1087, in run
File "site-packages\calibre\ebooks\epub\output.py", line 169, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 57, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 67, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 205, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 406, in split_to_size
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 27, in tostring
File "lxml.etree.pyx", line 2860, in lxml.etree.tostring (src/lxml/lxml.etree.c:53681)
File "serializer.pxi", line 139, in lxml.etree._tostring (src/lxml/lxml.etree.c:87439)
MemoryError
I am completely out of ideas... I think I have found the book which is best suited for making calibre crash
Here is the book (it's in romanian, but I think this doesn't matter, if you want to see how clean the html is...):
a) Book before I import in calibre:
HERE - But it will take few hours to import
b) Book as it appears in the Calibre repository:
HERE (This is the one I tried to export in various formats: epub, moby, epub without split).
Note: There are some places where the characters are non-ascii (in around 20 words across the 20 Mb), but never caused any issue.
If anyone can give some help/ideas on what I'm doing wrong or what else I should try, or review the html/zip above, please let me know.