Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 02-06-2019, 02:15 PM   #1
wyoung
Junior Member
wyoung began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2019
Device: Voice Dream Reader
Paragraph breaks being eaten in Markdown to EPUB conversion

I've got a collection of books in Markdown format, which is a good choice for them, there being little need for formatting in these books: chapter heads, maybe a little boldface and italic markup, maybe a horizontal rule to break up a second...that's enough.

However, I'd also like to have EPUB versions of these because that adds a few features I want such as remembering the last-read position, binding the author name into the book, etc.

The problem is that I end up with a wall-of-text no matter how I format the source Markdown files or what adjustments I make to the various paragraph styling options in the dialog Calibre presents when you click it Convert Books button.

I won't claim to have exhaustively tested all possible combinations, but I've got to be up to about 20 combinations so far. No matter what, all the paragraphs get run together until the next header break, horizontal rule, etc.

Is this a bug, or is there some secret to getting proper paragraph breaks?

In case it matters, my ideal Markdown flavor is UTF-8, LF-only line endings, and no hard line breaks in the source text. That is, each paragraph is on a single line, soft-wrapped to the window width in my Markdown editor, with at least two LFs separating each paragraph. (Sometimes I put in extra vertical space between major sections.)

I want the resulting EPUB to show a blank line between paragraphs, with no first-line indent, so that it looks approximately like the source Markdown.

You can correctly infer from my double LFs that I'm working on a POSIX type platform, macOS 10.14, specifically.

These Markdown files render just fine in all the other Markdown renderers I've tried, so I'm confident that they're well-formed. I've even run them through "od -c" to make sure there aren't some odd hidden characters causing problems, but no, they're pretty much plain ASCII with the occasional UTF-8 character. (em dashes and curly quotes, mainly.) I've got the text input option set to utf-8 in the Calibre conversion dialog.

I've also gone through these files to strip trailing spaces from lines, except in those rare cases where I put in 2 spaces at the end to force a soft line break.

I've tried hard-wrapping these paragraphs to 72 columns, but that doesn't help, and I don't want to format these docs that way anyway.

After fighting with Calibre conversion settings, I've gone and reset all the settings to the defaults to make sure it isn't some kind of configuration error on my part, and it still gives me wall-fo-text EPUB output.
wyoung is offline   Reply With Quote
Old 02-06-2019, 05:31 PM   #2
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 10,479
Karma: 60960973
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
Why not attach an example file so we can easily see what's going on?

Also what version of calibre is being used?

Sent from my Nexus 7 using Tapatalk
PeterT is offline   Reply With Quote
Advert
Old 02-06-2019, 06:28 PM   #3
wyoung
Junior Member
wyoung began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2019
Device: Voice Dream Reader
I didn't attach a sample because I'm not doing anything fancy here. The symptom is not particular to the input text.

As presented on this forum, both of our messages are valid Markdown, so you can cut and paste either message text, save it to a plain text file called x.md, drag that to Calibre, and click the "Convert books" button in the toolbar. It'll have the "Input format" set to "MD" due to the file name extension; set the "Output format" to EPUB if necessary.

When the conversion completes, the EPUB will take precedence over the MD, so just double-click the book entry in Calibre to view the EPUB version. Here's what I see in the Calibre E-Book Reader for the first page of my own rendered text:



So, wall-of-text, as I said.

EDIT: As for the Calibre version, the above was produced with the latest, 3.39.1, but it's not a new symptom in that version. I'm only posting now because I've given up trying to hack my way around it from the end user side.

Last edited by wyoung; 02-06-2019 at 06:31 PM.
wyoung is offline   Reply With Quote
Old 02-06-2019, 06:50 PM   #4
jackie_w
Wizard
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 4,760
Karma: 11363005
Join Date: Sep 2009
Location: UK
Device: PRS-350, Kobo: Aura6", H2O, GloHD, KA1, ClaraHD, Forma
@wyoung,

What calibre conversion settings have you used? Specifically those in TXT Input:
- 'Paragraph style'
- 'Formatting style'

I used 'single' and 'markdown' respectively and it seems to work OK for me.
Attached Thumbnails
Click image for larger version

Name:	markdown.jpg
Views:	20
Size:	118.4 KB
ID:	169542  
jackie_w is offline   Reply With Quote
Old 02-06-2019, 07:04 PM   #5
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 10,479
Karma: 60960973
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
Sorry I have no further interest in helping. Had you been prepared to post sample MD file I would have experimented.

Good luck
PeterT is offline   Reply With Quote
Advert
Old 02-06-2019, 07:16 PM   #6
wyoung
Junior Member
wyoung began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2019
Device: Voice Dream Reader
As I wrote in the original message, I've tried a lot of settings. The above screenshot was produced with the default settings.

I've previously tried the settings you give, but just out of completeness, I've tried them again, with those two changes being the only two differences from the defaults. Same result.

That suggests a platform-specific bug, so I decided to try the Debug option in the conversion dialog.

Immediately I see that the input/*.html file is being produced incorrectly: the whole document text is inside a single HTML <p> tag.

I found a log by clicking the Jobs button in the lower right corner, but it doesn't tell me what I really want to know, which is who produces that HTML file, and according to what rules?

I guess the "input plugin" it refers to is whatever's behind the "TXT input" tab in the Calibre Convert dialog, so is it entirely internal to Calibre? It isn't doing something like calling out to pandoc or similar, which would open us to platform-specific behavior?
wyoung is offline   Reply With Quote
Old 02-06-2019, 07:18 PM   #7
wyoung
Junior Member
wyoung began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2019
Device: Voice Dream Reader
Quote:
Originally Posted by PeterT View Post
Sorry I have no further interest in helping. Had you been prepared to post sample MD file I would have experimented.
I gave you detailed instructions for producing as many Markdown files as you like.

But if you must have one pre-prepared, I've attached the one I've been testing with.

EDIT: ...which this forum appears to have eaten, perhaps because I'm too new to be trusted to create attachments? No matter, I can fake it. Cut and paste the following text into a file called x.md:

Code:
This is a markdown file...

...with multiple paragraphs.
Result:


Last edited by wyoung; 02-06-2019 at 07:24 PM.
wyoung is offline   Reply With Quote
Old 02-06-2019, 08:30 PM   #8
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 16,210
Karma: 26150342
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo,Aura H2O,Glo HD,Aura ONE,Clara HD,Forma;tolino epos
Quote:
Originally Posted by wyoung View Post
I gave you detailed instructions for producing as many Markdown files as you like.

But if you must have one pre-prepared, I've attached the one I've been testing with.

EDIT: ...which this forum appears to have eaten, perhaps because I'm too new to be trusted to create attachments? No matter, I can fake it. Cut and paste the following text into a file called x.md:

Code:
This is a markdown file...

...with multiple paragraphs.
Result:

The issue is that you are seeing a problem and others are not. There are two possibilities, you are using different options that mess things up or something is different about the files being tested. By you posting a sample that doesn't work for you, when others test, we know that we are starting from the same place. Suggesting we take text from the forum page doesn't always work as it might be a code page issue, or line-ends or something else that the browser obscures.

And you should be able to attach a file as a new user. But, the forum doesn't allow .md files, so that might be it. There would have been a message when you tried to attach it, but I know I missed it now. And I usually forget to hit the "Upload" button on the attachments dialog and wonder why the attachment isn't there later.

In any case, I did what you said and put the three lines into a file. Added that to calibre (3.39.1 on a Linux box), hit the conversion without changing any options. Looked at the generated epub and I have two paragraphs. Then I repeated the test using a .txt file as .md cannot be attached here. Same results. I have attached the input .txt file and the generated epub for you to see.

Below is the output log from the conversion. This includes the options used for the conversion. Comparing yours to this might give a hint for what is going on.

Spoiler:

Code:
Convert book 1 of 1 (Markdown)
Conversion options changed from defaults:
  read_metadata_from_opf: u'/tmp/calibre_3.39.1_tmp_B2IXMv/7LBSRa.opf'
  duplicate_links_in_toc: True
  verbose: 2
  output_profile: 'tablet'
  markdown_extensions: u'tables, footnotes, toc'
Resolved conversion options
calibre version: 3.39.1
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0.0,
 'book_producer': None,
 'change_justification': u'original',
 'chapter': u"//*[((name()='h1' or name()='h2') and re:test(., '\\s*((chapter|book|section|part)\\s+)|((prolog|prologue|epilogue)(\\s+|$))', 'i')) or @class = 'chapter']",
 'chapter_mark': u'pagebreak',
 'comments': None,
 'cover': None,
 'debug_pipeline': None,
 'dehyphenate': True,
 'delete_blank_paragraphs': True,
 'disable_font_rescaling': False,
 'dont_split_on_page_breaks': False,
 'duplicate_links_in_toc': True,
 'embed_all_fonts': False,
 'embed_font_family': None,
 'enable_heuristics': False,
 'epub_flatten': False,
 'epub_inline_toc': False,
 'epub_toc_at_end': False,
 'epub_version': u'2',
 'expand_css': False,
 'extra_css': None,
 'extract_to': None,
 'filter_css': u'',
 'fix_indents': True,
 'flow_size': 260,
 'font_size_mapping': None,
 'format_scene_breaks': True,
 'formatting_type': u'auto',
 'html_unwrap_factor': 0.4,
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x7ffb6b0119d0>,
 'insert_blank_line': False,
 'insert_blank_line_size': 0.5,
 'insert_metadata': False,
 'isbn': None,
 'italicize_common_cases': True,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0.0,
 'linearize_tables': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'markdown_extensions': u'tables, footnotes, toc',
 'markup_chapter_headings': True,
 'max_toc_links': 50,
 'minimum_line_height': 120.0,
 'no_chapters_in_toc': False,
 'no_default_epub_cover': False,
 'no_inline_navbars': False,
 'no_svg_cover': False,
 'output_profile': <calibre.customize.profiles.TabletOutput object at 0x7ffb6b02c610>,
 'page_breaks_before': u"//*[name()='h1' or name()='h2']",
 'paragraph_type': u'auto',
 'prefer_metadata_cover': False,
 'preserve_cover_aspect_ratio': False,
 'preserve_spaces': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': u'/tmp/calibre_3.39.1_tmp_B2IXMv/7LBSRa.opf',
 'remove_fake_margins': True,
 'remove_first_image': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'renumber_headings': True,
 'replace_scene_breaks': u'',
 'search_replace': '[]',
 'series': None,
 'series_index': None,
 'smarten_punctuation': False,
 'sr1_replace': None,
 'sr1_search': None,
 'sr2_replace': None,
 'sr2_search': None,
 'sr3_replace': None,
 'sr3_search': None,
 'start_reading_at': None,
 'subset_embedded_fonts': False,
 'tags': None,
 'timestamp': None,
 'title': None,
 'title_sort': None,
 'toc_filter': None,
 'toc_threshold': 6,
 'toc_title': None,
 'transform_css_rules': '[]',
 'txt_in_remove_indents': False,
 'unsmarten_punctuation': False,
 'unwrap_lines': True,
 'use_auto_toc': False,
 'verbose': 2}
InputFormatPlugin: TXT Input running
on /tmp/calibre_3.39.1_tmp_B2IXMv/bhxWev.txt
Reading text from file...
Detected input encoding as ascii with a confidence of 100.0%
Auto detected paragraph type as unformatted
Auto detected formatting as heuristic
Running text through basic conversion...
Language not specified
Creator not specified
Building file list...
	Found files...
		 HTMLFile:0:a:/tmp/calibre_3.39.1_tmp_B2IXMv/index-2.html
Normalizing filename cases
Rewriting HTML links
Parsing index-2.html ...
*********  Heuristic processing HTML  *********
flow is too short, not running heuristics
Forcing index-2.html into XHTML namespace
Merging user specified metadata...
Detecting structure...
Auto generated TOC with 0 entries.
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Found 3 items of level: p_1
Ignoring level p_1
Cleaning up manifest...
Trimming unused files from manifest...
Creating EPUB Output...
Splitting markup on page breaks and flow limits, if any...
	Looking for large trees in index-2.html...
	No large trees found
Generating default cover
This EPUB file has no Table of Contents. Creating a default TOC
EPUB output written to /tmp/calibre_3.39.1_tmp_B2IXMv/YKChvq.epub



For the method that calibre does the conversion, calibre uses internal libraries. For Markdown, it looks to be a Python library from elsewhere, but it is included in the calibre codebase for all platforms.
Attached Files
File Type: epub Markdown - Unknown.epub (134.9 KB, 13 views)
File Type: txt Markdown - Unknown.txt (57 Bytes, 11 views)
davidfor is offline   Reply With Quote
Old 02-06-2019, 09:13 PM   #9
wyoung
Junior Member
wyoung began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2019
Device: Voice Dream Reader
Quote:
Originally Posted by davidfor View Post
you are using different options
By resetting my Calibre settings per these instructions and diffing my old settings against that default set, I've managed to narrow the problem down to a single setting: Prefs > Input options > TXT input > Remove indents at the beginning of lines.

Apparently the "Restore defaults" button is page-specific, and I didn't manage to reset this particular page in my less drastic settings resets earlier.

Anyway, with this setting enabled, you get the symptom I've shown above.

This has got to be a bug: there are no "indents at the beginning of lines" to remove! I've placed my example text above on a public web server in case someone feels the need to have a byte-for-byte perfect input source to test this with. But, I really need to stress this: the bug affects pretty much any Markdown input: I've been seeing this for quite a while now, and I've got hundreds of Markdown files in my Calibre library from many sources.

I've got my solution, but I hope someone fixes this problem so it doesn't bite anyone else in the future.
wyoung is offline   Reply With Quote
Old 02-06-2019, 09:22 PM   #10
wyoung
Junior Member
wyoung began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2019
Device: Voice Dream Reader
Incidentally, if there's a setting one of the developers of Calibre can enable that will sort the keys of the JSON settings objects, that'd make it a lot easier to do this sort of settings directory diffing.

I get that JSON is based on Python dictionaries, and nodes are stored in the dictionary in a semi-unpredictable order, but it's an option in some JSON serialization libraries to sort the keys for this very sort of reason. Perl's JSON module calls this "canonical form", for instance.
wyoung is offline   Reply With Quote
Old 02-06-2019, 10:01 PM   #11
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 16,210
Karma: 26150342
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo,Aura H2O,Glo HD,Aura ONE,Clara HD,Forma;tolino epos
Quote:
Originally Posted by wyoung View Post
By resetting my Calibre settings per these instructions and diffing my old settings against that default set, I've managed to narrow the problem down to a single setting: Prefs > Input options > TXT input > Remove indents at the beginning of lines.

Apparently the "Restore defaults" button is page-specific, and I didn't manage to reset this particular page in my less drastic settings resets earlier.
Yes, I see that behaviour as well.
Quote:
Anyway, with this setting enabled, you get the symptom I've shown above.

This has got to be a bug: there are no "indents at the beginning of lines" to remove! I've placed my example text above on a public web server in case someone feels the need to have a byte-for-byte perfect input source to test this with. But, I really need to stress this: the bug affects pretty much any Markdown input: I've been seeing this for quite a while now, and I've got hundreds of Markdown files in my Calibre library from many sources.

I've got my solution, but I hope someone fixes this problem so it doesn't bite anyone else in the future.
Yes, it is probably a bug. If you report it at https://bugs.launchpad.net/calibre, it should get fixed.
davidfor is offline   Reply With Quote
Old 02-06-2019, 10:20 PM   #12
wyoung
Junior Member
wyoung began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Feb 2019
Device: Voice Dream Reader
Quote:
Originally Posted by davidfor View Post
report it
Done.
wyoung is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
.txt to .epub conversion with option to remove extra paragraph breaks citac Conversion 9 12-01-2016 09:00 AM
Losing paragraph format in txt to epub with block and markdown Tattvadarzin Conversion 4 10-25-2013 02:49 AM
EPUB to RTF w/out paragraph breaks arslonga Conversion 2 02-06-2012 04:40 AM
Paragraph breaks in ePub? rocalisa Calibre 3 10-29-2010 03:53 PM
PDF to EPUB - spurious paragraph breaks RichieTheK Calibre 2 09-08-2010 11:27 AM


All times are GMT -4. The time now is 06:02 PM.


MobileRead.com is a privately owned, operated and funded community.