Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 12-04-2017, 09:37 PM   #1
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Writing spaces for regex

Hi

I am looking for advice on this.

\x20 Plain space

Recently a friend of mine told me it could be convenient to use \x20 in regex to refer to plain spaces.

The \x20 code means just plain space in hexadecimal speak. It's just slightly more restrictive than \s.
In regex mode you can use it litterally in both search and replacement fields which means that you can publish regex exactly as you use them while:
- using \s does not work in the replacement field
- mimicking space (for example with underscore _) when you publish a regex means that you must add a warning about it
- a plain blank space can easily be forgotten specially if it's set at the end line of a regex

The \x20 code is recognized with Sigil (PCRE) and the Calibre editor (Python) in regex mode and you can test it on regex 101 site (https://regex101.com/) in both flavours.

\xa0 No-break space

The hexadecimal \xa0 code refers to the no-break space. Sigil has a problem with it.

It works with the Calibre editor in both search and replacement fields in regex mode. The replacement from &#_160; to \xa0 using PCRE flavour is also working on regex101.

Unhappily it becomes just a plain space when used in the replacement field of a regex with Sigil. This is quite dangerous because it means that just one replacement could make all your no-break spaces disappear (Oldest users have been there before..).

As the \x20 (see above) seems to work quite well, I wonder why the \xa0 does not play well with Sigil. Is this due to the infamous upstream Qt bug? Is it possible to make it work?

Last edited by roger64; 12-04-2017 at 09:47 PM. Reason: Oldest
roger64 is offline   Reply With Quote
Old 12-05-2017, 05:44 AM   #2
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
The no-break space character has long been a thorn in Sigil's side. A big part of the reason why is WYSIWYG editing in Book View. When that finally changes, we'll shout it from the rooftops and have a parade--believe me. But until such time, users will just have to choose which is more important in their workflow: Sigil, or no-break space characters.
DiapDealer is offline   Reply With Quote
Advert
Old 12-05-2017, 08:54 AM   #3
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
@DiapDealer
You explained us not long ago where the trouble was coming from (underline is mine):
Quote:
I (and I suspect Kevin, as well) would love to be able to support the use of the unicode character for the nobreak space in Sigil. The current parser (Google's Gumbo) could handle them just fine. Unfortunately, the unicode character cannot survive Qt's QTextEdit environment, which Sigil uses for Code View/Book View. They get changed to "normal" spaces. Hence the entity to keep that from happening.
Then you had spoken about "the unicode character". That's why I had hoped that using one hexadecimal character could provide a workaround for writing the no-break space with a discrete display. It could also have given the user a possibility to share the same character for both the Calibre Editor and Sigil.

Does the -now- notorious QTextEdit bug has really so long arms and extend to all kinds of representation of the nobreak space?
roger64 is offline   Reply With Quote
Old 12-05-2017, 09:57 AM   #4
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Yes, as I explained the last time this came up.

If you store any non-breaking space in a QTextEdit widget (such as Codeview) and ask for the text from the widget it will automatically convert that non-breaking space to a normal space forever losing it. That is why numeric or named entities are always used to represent the non-breaking space in xhtml code in Sigil.

Calibre has worked around this bug by subclassing the entire QTextEdit widget and overloading the routine that converts to text (plaintext) and replacing it with a routine that literally behind the scenes effectively highlights the entire file and copies the highlighted text out instead of using the standard Qt function.

Sigil has no plans to use that approach as leaving non-breaking spaces in entity form makes the most sense since their is no visual difference between a normal space and non-breaking space in most applications (including Sigil) leading to lots of issues down the road (especially for newbies). The use of the entity for a non-breaking space in no way prevents you from using regex in a proper way. If you decide to circumvent the entity encoding using regex and replacing it with the normal character, you end up getting bitten by the Qt bug.

If you simply can not live with non-breaking space as an entity (numeric or named), then use an output plugin to convert them on the fly on the way out of Sigil. If you reload that file in Sigil, it will convert all non-breaking spaces to their numeric or named entity equivalent (depending on epub version).

Last edited by KevinH; 12-05-2017 at 10:03 AM.
KevinH is online now   Reply With Quote
Old 12-05-2017, 07:33 PM   #5
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
For the record ...

Quote:
Originally Posted by roger64 View Post
Then you had spoken about "the unicode character". That's why I had hoped that using one hexadecimal character could provide a workaround for writing the no-break space with a discrete display.
There's no such thing as a no-break space "hexadecimal character." \xa0 IS the unicode character (and vice-versa). It's just another way of referencing the exact same thing. Used in a regex expression, \xa0 matches the unicode no-break space character. In a regex replace expression, \xa0 inserts the unicode no-break character. \xa0 is useful in regex mainly because one can't actually type the no-break space character (though it could be copy/pasted into an expression) with a key.

Quote:
Originally Posted by roger64 View Post
Does the -now- notorious QTextEdit bug has really so long arms and extend to all kinds of representation of the nobreak space?
No. Just the unicode character ... which \xa0 represents. Entity representations of the no-break space character are fine.
DiapDealer is offline   Reply With Quote
Advert
Old 12-05-2017, 08:17 PM   #6
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
@DiapDealer

Forgive my mistakes about it and thanks for your needed explanations.

Quote:
Originally Posted by KevinH View Post
.../.. there is no visual difference between a normal space and non-breaking space in most applications .../...

If you simply can not live with non-breaking space as an entity (numeric or named), then use an output plugin to convert them on the fly on the way out of Sigil. If you reload that file in Sigil, it will convert all non-breaking spaces to their numeric or named entity equivalent (depending on epub version).
Absurdistan

You said the "there is no visual difference between a normal space and non-breaking space in most applications"

It's certainly true for many applications or languages where the nobreak space is a sparsely used character. But this statement cannot apply to French language texts.

In a book containing usually thousands of nobreak spaces, this confusion causes a lot of visual defects. Many punctuation signs need a nobreak space. If this one is replaced by a plain space, you will spot many of them at the beginning of a line which makes really awful reading. Not to speak with the unhappy breaks of compound names like Louis XIV or Airbus 320 (the same for Boeings). In a French book at least, this provokes big visual differences, a real typographical damage.

This situation (rendering one character with another) is all the more absurd than the narrow nobreak space is rendered quite normally in Sigil (it's not highlighted in code view). I use it and it replaces about 3/4 of the nobreak spaces and -for me- alleviates in this proportion the current situation. But this does not change the nature of the problem.

Finding a way out: plugin or option?

I currently uses a regex to reprocess files coming out of Sigil on this regard. It's manual though and sometimes I forget to use it.

I did not think about the idea of a plugin (see above). It would need to be coupled with the export or save as functions, as well as the open or import functions just to convert both ways the current &#_160; used by Sigil for its inner processings to something else (unicode or hexadecimal). It just would need to be automatic because forgetting to use it once could wreak havoc to all nobreak characters (as you say "get bitten by the Qt bug").

As you explained it, Kovid Goyal took an impressive approach to circumvent this bug. As this situation may last some more years, could Sigil, if not following the same path, try to do something about it too? Let me remind you that Sigil does use already a de facto plugin of sort converting automatically nobreak unicode characters in and out of Sigil...

As this plugin would need to be very² tighly linked with Sigil, maybe adding a different import/export option to mainline Sigil (could we name it the "French" option?) would be the best solution. I certainly could send you a bottle of Champagne for the opening ceremony...

² =very very

Last edited by roger64; 12-05-2017 at 08:24 PM.
roger64 is offline   Reply With Quote
Old 12-05-2017, 08:47 PM   #7
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
I have *no* plans to implement the workaround approach done by Kovid any time soon.

When parsed by an html engine all entities (numeric and named) are converted to their character (byte sequence) equivalent during display parsing, so using entities changes nothing for the person viewing the ebook.

FYI: Sigil's gumbo parser handles that conversion from nbsp entity to byte sequence and back and not a plugin.

That only leaves the editing of html. If you do not want to deal with entities during editing, simply choose *any* other character from the huge unicode set that is not being otherwise used and replace all non-breaking spaces with that character until you are done editing the ebook. Then substitute them back to non-breaking space entities right before saving.

Both of these can be done easily with regular expressions and can be stored as searches or clips you can easily invoke.

Sorry, not being able to tell the difference between non-breaking spaces and normal spaces when editing an ebook is a real problem no matter the language and using a placeholder of some sort (entity or some other character) is the way to make things work.

Last edited by KevinH; 12-05-2017 at 08:49 PM.
KevinH is online now   Reply With Quote
Old 12-05-2017, 09:05 PM   #8
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
OK.

I tried, we will follow on living with this thorn on our side.

I did not know that the Sigil gumbo parser was the "de facto plugin of sort" maker. It seems it works one way only.

Sorry for having pestered you -again- with this "historical" question and thank you for your answers.

Last edited by roger64; 12-05-2017 at 09:10 PM. Reason: Gumbo
roger64 is offline   Reply With Quote
Old 12-05-2017, 09:16 PM   #9
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
FYI, any unique placeholder will work. There is even a non-breaking Figure space character in unicode (size of a numeric digit). If your text does not use the unicode word joiner character, you could use that as a placeholder as well. Or simply use a simple but easy to type two character sequence as long as it is unique. That is the whole idea behind the concept of a placeholder/entity.

Last edited by KevinH; 12-05-2017 at 09:20 PM.
KevinH is online now   Reply With Quote
Old 12-05-2017, 09:21 PM   #10
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
Actually gumbo is not plugin related at all. You can control the mending / cleaning of text to set it on file open or save or both; or manually run mend on all xhtml files or just the current one. It handles just the non-breaking space to entity conversion. All other entities are controlled by the preserve entities preference settings.

So changing it to *not* mend on save and then running a regular expression search and replace of your placeholder to a real non-breaking space should work.

Last edited by KevinH; 12-05-2017 at 09:23 PM.
KevinH is online now   Reply With Quote
Old 12-05-2017, 09:26 PM   #11
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
In all fairness: it's not really even a Sigil issue. Sigil works as advertised. Non-breaking space functionality is accommodated. No unicode non-breaking space characters will be lost. They're just converted to entities.

It's not really fair to expect us to overhaul Sigil's codebase merely because you want it to be easier to bounce back and forth between using calibre's editor and Sigil when editing your epubs. There's differences between the two programs. Utterly seamless integration between Sigil and calibre is just not one of our priorities (not that we seek to create stumbling blocks either).

Last edited by DiapDealer; 12-05-2017 at 09:30 PM.
DiapDealer is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex to find multiple spaces between HTML tags mikapanja Editor 10 11-18-2017 07:11 AM
How to make regex to replace 2 spaces between words, with one space? crankypants Sigil 4 10-29-2015 11:51 AM
regex - issue with spaces? cybmole Editor 43 12-31-2013 12:49 PM
Regex Find and Replace - Spaces essayhead Sigil 2 08-10-2012 07:41 PM
RegEx: Removing Page Numbers that have Spaces captainslow Conversion 2 02-27-2011 04:14 PM


All times are GMT -4. The time now is 10:39 AM.


MobileRead.com is a privately owned, operated and funded community.