Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 12-25-2015, 07:19 AM   #31
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,788
Karma: 6000000
Join Date: Nov 2009
Device: many
Please verify that the words sorted before the first A entry did not include any leading whitespace including nbsp, thin spaces, etc.
KevinH is offline   Reply With Quote
Old 12-25-2015, 08:56 AM   #32
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,731
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by roger64 View Post
2. Letters with diacritics classed at the end of the alphabetical order.
That is the default sort order, if words are sorted by character codes, since the character code for â (226/00E0) is higher than the character code for a (97/0061). Most likely the index generation code doesn't do locale-specific sorting.

@KevinH: Does the Sigil index generation code use built-in c++ sorting functions that allow you to specify a locale for sorting? If so would it be possible to use the language defined in the epub metadata as the locale?

@roger64: As a work-around you could add the unaccented version of the index entry in the index entries field. For example:

Code:
Text to include Index entries
âge             age
Of course, you'd have fix the spelling of the index entry in the generated index afterwards.

BTW, there's a Python package that'll automatically transform accented characters to unaccented characters: Unidecode. (IIRC, this package is also used by Calibre for transliterating non-Latin alphabets.)

Since all index entries are stored in a text file (sigil_index.ini), you might be able to write a simple Python script that'll add the unaccented version as the second entry.

This might also be a good first Sigil plugin project. For example, you could access sigil_index.ini and display all index entries from a Sigil plugin as follows:

Spoiler:

Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals, division, absolute_import, print_function
import os, sys

PY2 = sys.version_info[0] == 2

if PY2:
    import ConfigParser as configparser
else:
    import configparser

# main routine
def run(bk):
    # get sigil_index.ini path
    index_ini = os.path.abspath(os.path.join(bk._w.usrsupdir, 'sigil_index.ini'))
    print('sigil_index.ini path:', index_ini)
    
    # read values 
    config = configparser.ConfigParser(allow_no_value = True)
    config.read(index_ini)
    number_of_entries = config.getint('index_entries', 'size')
    
    # print entries
    for index_entry in range(1, number_of_entries + 1):
        if PY2:
            entry = unicode(config.get('index_entries', str(index_entry) + '\Text%20to%20Include'), 'unicode-escape')
        else:
            entry = bytes(config.get('index_entries', str(index_entry) + '\Text%20to%20Include'), "utf-8").decode("unicode_escape")
        print(entry)

def main():
    print('I reached main when I should not have\n')
    return -1

if __name__ == "__main__":
    sys.exit(main())
Doitsu is offline   Reply With Quote
Old 12-25-2015, 09:28 AM   #33
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,788
Karma: 6000000
Join Date: Nov 2009
Device: many
Doitsu,
The complete entries are never actually sorted. Entries are built "sorted" by being inserted in order into the IndexEditorModel by QString comparison (so by unicode character value).

Any fixups need to be done inside the IndexEditor by the user before the Index itself is generated.
KevinH is offline   Reply With Quote
Old 12-25-2015, 07:45 PM   #34
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Quote:
Originally Posted by KevinH View Post
Please verify that the words sorted before the first A entry did not include any leading whitespace including nbsp, thin spaces, etc.
I did check that before generating the index. You can see on the joint petiteliste.txt that there is no leading whitespace, nnbsp or other for the misplaced entries. I had a look at the window of the index editor and I see that the file has been correctly imported. Then I saved and closed it before creating the index.

The scrambling of entries names has been done during the index generation process. As it seems to concern from 5 to 10% of entries, there is no way I could consider reordering them manually.

Hopefully it would be possible to find a way to sort again the entries names once the index file has been processed (and this time while taking into account the locale specs and the above defect?)

As for writing a plugin, sorry but this is way beyond my technical knowledge.

Last edited by roger64; 12-25-2015 at 07:55 PM.
roger64 is offline   Reply With Quote
Old 12-25-2015, 09:20 PM   #35
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,788
Karma: 6000000
Join Date: Nov 2009
Device: many
roger64,
As I can not recreate the out of order before A part, at my end will you please create a small test epub that will recreate this issue, so that I can see what might be causing it. It could be caused by whitespace or newlines being captured as part of a pattern or inherent given layout of the tags or spans involved.

I noticed that each entry generated before the A has a / in your wordlist. as opposed to a |
How or why are you using the / in that way,

The first part before the / is supposed to be the actual category name while the part after the / is supposed to be the entry name (which can be left blank if needed and then the entry itself should be used.

As far as I can tell from your sample you are using it backwards (at least I think so). I almost never use indexing, so maybe I am the one backwards here. But that is my reading from the online help Doitsu pointed us at.

Last edited by KevinH; 12-25-2015 at 09:42 PM.
KevinH is offline   Reply With Quote
Old 12-25-2015, 10:55 PM   #36
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Thanks for taking care of this.

I will build one complete test EPUB around this list and also the text list to use.

I think the obvious solution, if we wish to take into account some locale specs without too much trouble, is to let the user provide an already sorted text list of entries, including delineations for the index-new-letters to be used.
roger64 is offline   Reply With Quote
Old 12-26-2015, 03:01 AM   #37
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Here it is.

This is what I did: I wanted to provide a real book and not a two page test book. So, I selected one of the EPUB2 I produced this year (in French), I inserted the words or expressions of the petiteliste.txt without taking any care for meaning and saved. So I apologize for those who wish to read it, they will face some understanding problems...

Then, opening Sigil 9.2 (Archlinux x64 build), I inserted the text file in the window of the index editor, saved and close it. Then I generated the index.

Result: the same phenomenon occurred: this time I had two entries (instead of three) in the beginning before the a, and the French diacritic was placed at the end, so the two main problems are confirmed with this Sigil build.
Attached Files
File Type: epub L'an 330 de la Republique - Maurice Spronck.epub (281.2 KB, 169 views)
File Type: txt petiteliste.txt (593 Bytes, 119 views)
roger64 is offline   Reply With Quote
Old 12-26-2015, 03:31 AM   #38
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,731
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by roger64 View Post
Result: the same phenomenon occurred: this time I had two entries (instead of three) in the beginning before the a, and the French diacritic was placed at the end, so the two main problems are confirmed with this Sigil build.
The problem with out of order index entries appears to have been caused by two tab characters in row in the petitelist.txt index file. Here's what it looks like after the import on a Windows machine:
Attached Thumbnails
Click image for larger version

Name:	index.png
Views:	196
Size:	12.3 KB
ID:	144903  
Doitsu is offline   Reply With Quote
Old 12-26-2015, 03:43 AM   #39
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
@Doitsu

So this is my mistake and for this please accept my apologies. I just hope this mistake will help other people not to repeat it.

I had checked many times the left column and saw nothing suspicious. I also thought I had inserted only one tab of fixed length per line. This is another -easy- practical tip to remember. There should be a way to colour the tabs in my text Editor.

So there is only the locale question to solve.

Last edited by roger64; 12-26-2015 at 03:50 AM.
roger64 is offline   Reply With Quote
Old 12-26-2015, 03:54 AM   #40
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,550
Karma: 19500001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by roger64 View Post
I think the obvious solution, if we wish to take into account some locale specs without too much trouble, is to let the user provide an already sorted text list of entries, including delineations for the index-new-letters to be used.
I always wonder, on the fundamental side of the problem (i.e., about the ordering itself, not about how to achieve it), what is the best way of sorting multilingual entries.

For example, in Spanish the order is similar to French, with all accented versions of vowels being sorted as if they were the plain vowel (but "ņ" is a separate letter, sorted after "n"). In Swedish the letters "å", "ä", "ö" are separate and sorted at the end, after "z". If I have a list of words in English, Spanish and Swedish, how should I sort "nino", "niņa", "ninå"? An English reader would expect "ninå"/"niņa", "nino"; a Spanish speaker would expect "ninå", "nino", "niņa"; a Swedish speaker would expect "niņa", "nino", "ninå".
Jellby is offline   Reply With Quote
Old 12-26-2015, 04:18 AM   #41
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,731
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by Jellby View Post
I always wonder, on the fundamental side of the problem (i.e., about the ordering itself, not about how to achieve it), what is the best way of sorting multilingual entries.
AFAIK, c++ has locale-aware sort functions, e.g. collate, which produces the following sort orders, depending on the locale:

Code:
Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp
English locale collation order: ängel ar år förnamn ögrupp zebra Zebra
Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp
However, as KevinH has said, Sigil doesn't do any sorting, but relies on QString comparison, which is a locale-unaware Qt library function. I.e., the only way to get the desired sort order would be to add the unaccented word in the second column.

For example, if you change the last entry to:

Code:
âge    ages/âges de la vie
It'll be listed under A:

Code:
Acton
    Acton Harold 1
agenda 1
ages
    âges de la vie 1
Agnelli
    Agnelli Gianni 1
agrafe 1
Obviously, as post-editing step, you'd have to change "ages" back to "âges." This could be handled by a Python script, either as a Sigil plugin, as suggested in this post, or a stand-alone script.
Doitsu is offline   Reply With Quote
Old 12-26-2015, 04:26 AM   #42
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)


Not taking into account bilingual texts which probably require two separate indexes, usually a book is published in one main language, and I think the reader expects that the index will follow its rules, even if there are some foreign words.

But also, there could be, by courtesy, some hypertext refinements: The same word could be placed in several places.
roger64 is offline   Reply With Quote
Old 12-26-2015, 04:40 AM   #43
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,731
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by roger64 View Post
Not taking into account bilingual texts which probably require two separate indexes, usually a book is published in one main language, and I think the reader expects that the index will follow its rules, even if there are some foreign words.
Ideally, sorting should be locale-aware, however, since this would require major code changes, I don't think that this is going to happen, especially since you're the only user who's asked for it.

Quote:
Originally Posted by roger64 View Post
But also, there could be, by courtesy, some hypertext refinements: The same word could be placed in several places.
What you call "courtesy" actually means a lot of work for developers. However, thanks to the very user-friendly Sigil plugin framework, you could easily add you own custom indexing plugin with all the features that you require.
Doitsu is offline   Reply With Quote
Old 12-26-2015, 08:27 AM   #44
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
@Doitsu

I surely would appreciate a locale-aware index order since, right now, I cannot see myself publishing a French book index with accented words placed at the end (ā, â, é, č, ę, ô, į, etc.really?) and I can't think of any French publisher who would accept to do it.

I will study your proposal and will need some time to test it.

As for the "refinements", yes this is extra-work for everybody not only the developers. The foreign words would need to be tagged with their own language which is rarely done. For a few words, this can be done manually. For me, I really don't push for it.

Last edited by roger64; 12-26-2015 at 09:05 AM.
roger64 is offline   Reply With Quote
Old 12-26-2015, 09:17 AM   #45
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Quote:
Originally Posted by KevinH View Post
Doitsu,
The complete entries are never actually sorted. Entries are built "sorted" by being inserted in order into the IndexEditorModel by QString comparison (so by unicode character value).

Any fixups need to be done inside the IndexEditor by the user before the Index itself is generated.
I'm afraid it's worse than that: there is no fixup. I did insert in the IndexEditor a big "user fixup", which was a complete alphabetic list sorted according to the French alphabetic order but this order was not taken into account since the accented letters were taken out of the list and put at the end.

Last edited by roger64; 12-26-2015 at 09:44 AM. Reason: accented
roger64 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Question about indexing on basic e-reader bonacker Amazon Kindle 9 02-01-2015 04:15 AM
Troubleshooting Indexing latepaul Amazon Kindle 13 01-15-2013 05:22 PM
Question about disable indexing permanently by disabling access to "Search Indexes" WS64 Kindle Developer's Corner 1 12-17-2011 05:51 PM
kindle 3 indexing question kpfeifle Amazon Kindle 2 09-06-2010 12:07 AM
Question about indexing Dragoro Amazon Kindle 4 02-25-2009 03:39 PM


All times are GMT -4. The time now is 03:55 PM.


MobileRead.com is a privately owned, operated and funded community.