MobileRead Forums - View Single Post - reformatting: text with unwanted linebreaks

ldolse · 12-22-2010, 08:14 PM

Quote:

Originally Posted by kiwidude

I know I am at risk of going O/T with Calibre discussion in a Sigil forum here but this is all related to recommended ways of conversion to make the Sigil editing work easier...

One thing that I have found with Calibre is due to the way it stores the conversion metadata I have to be careful to "unselect" stuff when doing different conversions. i.e. I always want EPUB to be my "master copy" since it converts so easily to other formats. So the first conversion will be from something else to EPUB for tidying up in Sigil. After that I then need to convert to MOBI for use on my Kindle. However I found I need to make sure I deselect any Calibre conversion options before I do the EPUB->MOBI conversion or else some of my careful Sigil work gets undone.

Is this what you would expect or am I doing something wrong? Because of this I don't really set much in the way of "global defaults" for conversions since so many settings are common to all formats but you actually only want them to be applied to the first conversion. The "re-run" factor to other formats becomes an issue when you turn these things on. Maybe I just got unlucky or imagined it...

Might be worth opening a thread in the Calibre forum with more details about what it changes. Calibre shouldn't do much, but there are things under the 'look and feel' options which may change things - in particular font size rescaling, line spacing handling, and margins may get added based on your output profile. The other thing is that when you go from epub to mobi you're downgrading from html 4 to html 3.2 - there are a lot of things you can do with epub that aren't supported in mobi, and Calibre needs to change the content to support that.

Quote:

Originally Posted by tscamera

if you are living in ne countries, cleaning up text in your own language...
don't forget to put additional chars - words may ending with it , in the formula!!!
i.e. german (ß)

The full list of characters I've put together so far would be:

Code:

([a-zäëïöüàèìòùáćéíóńśúâêîôûçąężı,:)\IA\u00DF]|(?<!\&\w{4});)

This is the full regex that I use for unwrapping:

Code:

(?<=.{85}([a-zäëïöüàèìòùáćéíóńśúâêîôûçąężı,:)\IA\u00DF]|(?<!\&\w{4});))\s*</(span|p|div)>\s*(</(p|span|div)>)?\s*(?P<up2threeblanks><(p|span|div)[^>]*>\s*(<(p|span|div)[^>]*>\s*</(span|p|div)>\s*)</(span|p|div)>\s*){0,3}\s*<(span|div|p)[^>]*>\s*(<(span|div|p)[^>]*>)?\s*

The number 85 in the beginning should be changed to the median line length for your document. This doesn't require any \1 \2 replacement as it uses a zero width lookahead on the last letter of the first line, and doesn't bother with the matching the first letter of the second line. I use this one with Python, it may need a few tweaks to work with Sigil's regex engine.