Is there a way to have index case insensitive ? We are migrating a
customer from D3 to Qm and the case sensitivity of indexes in QM is
really annoying for our customer. Is there any option for that ?
> Is there a way to have index case insensitive ? We are migrating a
> customer from D3 to Qm and the case sensitivity of indexes in QM is
> really annoying for our customer. Is there any option for that ?
Beware! A case sensitive index cannot be used by a query that uses case insensitive relational operators. Similarly, a case insensitive index cannot be used by a query that uses case sensitive relational operators.
This is because the sort order of the index tree will be different.
We have yet to come up with a way to create a dual mode index.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
> Beware! A case sensitive index cannot be used by a query that uses case > insensitive relational operators. Similarly, a case insensitive index cannot > be used by a query that uses case sensitive relational operators.
> What are case sensitive or insensitive operator ?
The default behaviour of QM is that query processor operators such as =, >, <, etc are case sensitive when working on character data. Thus if field X contains 'Martin' WITH X = 'Martin' will include the record but WITH X = 'MARTIN' will not.
You can enable case insensitive query processor select clause operations by using the OPTION QUERY.NO.CASE command. Now WITH X = 'Martin' and WITH X = 'MARTIN' will both include this record (as will the other 62 combinations of casing). Conversely (and more usefully, either of these selection clause elements will match any casing of the stored data.
Alternatively, I can leave the OPTION unset and follow the operator by NO.CASE WITH X = NO.CASE 'Martin' All of the query processor operators support this keyword... WITH X LIKE NO.CASE 'MAR...' WITH X NE NO.CASE 'Martin' etc, etc.
The fun comes when you use GT and LT (etc). If I write WITH X >= 'M' AND X < 'N' in a case sensitive query, I will get anything that begins with M. But what about WITH X >= 'M' AND X < 'n' (still case sensitive)? The sequence of the ASCII character set will result in this finding anything in which field X begins with M, N, O, P, ...... Z, [, \, ], .... a, b, c, ...., n.
If I do the same as a case insensitve query, just what do I expect it to find? There are probably many good arguable cases here for different results. What we have done in QM is to say that a case insensiive query effectively translates both items being compared to uppercase and then compares them. My query now returns anything that begins with M or m.
The case insensitivity introduced in UniVerse release 10.2 was, in my personal opinion, useless. It converted the string in the command to uppercase and compared this against the stored data. Thus WITH X = 'MARTIN' would find my record but WITH X = 'martin' or even WITH X = 'Martin' (which is what is actually stored) would not.
I like to think that we did a better job.
Going back to indices, a case sensitive index tree might store Lucy, MICHAEL, Mick, Peter, martin but the same data stored in a case insensitive index would be LUCY, MARTIN, MICHAEL, MICK, PETER
A query using a > or < type "range" operator would dive into the index at the first valid point and then walk the index to the last valid point. This will give different results depending on the case sensitivity. To resolve a selection condition where the index casing doesn't match the query casing may require the entire index to be scanned and hence loses most of the advantage of having an index.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
> > What are case sensitive or insensitive operator ?
> The default behaviour of QM is that query processor operators such as =, >,
> <, etc are case sensitive when working on character data. Thus if field X
> contains 'Martin'
> WITH X = 'Martin'
> will include the record but
> WITH X = 'MARTIN'
> will not.
> A query using a > or < type "range" operator would dive into the index at
> the first valid point and then walk the index to the last valid point. This
> will give different results depending on the case sensitivity. To resolve a
> selection condition where the index casing doesn't match the query casing
> may require the entire index to be scanned and hence loses most of the
> advantage of having an index.
Thank you very much for this detailed explanation. You should put it
somewhere in the documentation, linked to the NO.CASE instruction...
It's very valuable.
But now, I think we have a problem with our french accent. Here is
what we found :
* We set the OPTION QUERY.NO.CASE in MASTER.LOGIN
* We create our index using NO.CASE
* In the index, Gérard seems to be with located with the no accent
"e", so near Gerard
* If we look for GIR in all the indexes Gérard will be > GIR so we do
not go the the GI section.
* If we look for Gé, it will go through all the "G"
So, it seems like there is an inconsistency between where the indexes
locate the é and how the operator interprets it. Is it possible ?
> Thank you very much for this detailed explanation. You should > put it somewhere in the documentation, linked to the NO.CASE > instruction...
I will make a note to do this.
> But now, I think we have a problem with our french accent. Here > is what we found :
* We set the OPTION QUERY.NO.CASE in MASTER.LOGIN * We create our index using NO.CASE * In the index, Gérard seems to be with located with the no accent "e", so near Gerard * If we look for GIR in all the indexes Gérard will be > GIR so we do not go the the GI section. * If we look for Gé, it will go through all the "G"
This seems inconsistent with what I would expect. According to my documentation (which may be wrong), é is character 130. It therefore comes after all the other letters.
Can you please confirm the character value that you expect. I can then set up a test.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
> * We set the OPTION QUERY.NO.CASE in MASTER.LOGIN
> * We create our index using NO.CASE
> * In the index, Gérard seems to be with located with the no accent
> "e", so near Gerard
> * If we look for GIR in all the indexes Gérard will be > GIR so we do
> not go the the GI section.
> * If we look for Gé, it will go through all the "G"
> This seems inconsistent with what I would expect. According to my
> documentation (which may be wrong), é is character 130. It therefore comes
> after all the other letters.
> Can you please confirm the character value that you expect. I can then set
> up a test.
In fact, index positioning and operator are inconsistent in case
insensitive mode. In my test, index places Gérard with Gerard (so with
the "e"). But operator interpret Gerard as superior to Girard. So if I
navigate in the index from G looking for Girard it will not go through
and the operator will stop when encountering Gérard.
Do you understand my point ?
I'd like that operator and index works in the same manner. Either that
Gé is not greater than Gf, either in the index Gé is not located in
the same place as Ge.
Right now, in french CASE INSENSITIVE indexes are not usable.
On Jul 6, 2:05 pm, Cedric <cfonta...@spidmail.net> wrote:
> I'd like that operator and index works in the same manner. Either that
> Gé is not greater than Gf, either in the index Gé is not located in
> the same place as Ge.
> Right now, in french CASE INSENSITIVE indexes are not usable.
Sorting for different languages is quite a complex task. While
specific to Unicode strings, the introductory section of the following
article illustrates why sort is difficult:
As Kevin said, sorting for different languages is quite a complex task.
"Foreign" character set support has been on our wish list for a long time and still has not made it to the top as more urgent developments keep coming along. There are actually two parts to this development; alternative sorting sequences and multi-byte character sets.
Multi-byte character support is a major task as it has massive internal impact and also affects how some of the QMBasic operators and functions work. For example, substring extraction/assignment now has to work on characters not bytes though there will always be cases where the program actually still needs it to work in bytes. Multi-byte character support will happen but we are not in a position to give a date.
Alternative sorting sequences are not so hard. Much of the code is already in the master source. There is an interesting problem, however, that data already in the database in sorted order (e.g. use of LOCATE with the BY clause) may not work correctly when the alternative sort order is introduced. Any user choosing to activate this feature will have to check whether it has implications on their existing data.
A related development that is also already largely in the master source and is very relevant to this discussion is case conversion. We need to maintain a table of the corresponding characters for use by internal case conversion operations and user visible features such as UPCASE() and DOWNCASE().
Both sorting sequence and case conversion would be handled by tables provided/maintained by the user. It is reasonable to assume that we will ship some standard tables for common language variants.
Of course, full multi-national support goes much further than this. There are issues of national conventions for representation of dates, times, currencies, etc. There are text direction issues. Etc, etc......
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
If you are thinking of sticking to UTF8 due to its apparent magical backward
compatibility I have a UTF16 pitch for you.
Jbase and many database environments use UTF8 to support Unicode internally
but I believe they suffer from lack of binary transparency and fast string
character indexing. If you can step up to wide characters you will get a lot
of benefit.
If you allow illegal "surrogate pairs" in UTF16, as many do, it can handle
all binary characters whereas UTF8 has many illegal characters and sequences
which prevent you from handling binary transparently.
If you treat "surrogate pairs" as two characters in UTF16, as many do, then
you can continue to index into text by byte offset ie FAST. Indexing into
UTF8 text can be disastrously slow in most/many cases and consequently there
is a huge pressure to recode old apps to use character iterators instead of
numerical indexes.
Most important Unicode programming environments are UTF16 based
ICU/JAVA/QT/WINDOWS. gcc4 supports 2 byte wide characters.
SAP recently got a lot of flack for converting all its software to UTF32.
Regardless of whether memory is plentiful or not, cpu throughput looks to be
highly dependent on memory throughput for the foreseeable future and UTF32
requires approx 2 or 3 times the memory throughput therefore it will be
continually slow compared to UTF8/16.
QT calls whatever native operating system routines are available to do its
collation and casing. The alternative is to depend on the ubiquitous ICU
library.
> -----Original Message-----
> From: OpenQM@googlegroups.com [mailto:OpenQM@googlegroups.com] On
> Behalf Of Martin Phillips
> Sent: 07 July 2009 12:22
> To: OpenQM@googlegroups.com
> Subject: Re: Index no case sensitive
> Hi Cedric,
> As Kevin said, sorting for different languages is quite a complex task.
> "Foreign" character set support has been on our wish list for a long
> time
> and still has not made it to the top as more urgent developments keep
> coming
> along. There are actually two parts to this development; alternative
> sorting
> sequences and multi-byte character sets.
> Multi-byte character support is a major task as it has massive internal
> impact and also affects how some of the QMBasic operators and functions
> work. For example, substring extraction/assignment now has to work on
> characters not bytes though there will always be cases where the
> program
> actually still needs it to work in bytes. Multi-byte character support
> will
> happen but we are not in a position to give a date.
> Alternative sorting sequences are not so hard. Much of the code is
> already
> in the master source. There is an interesting problem, however, that
> data
> already in the database in sorted order (e.g. use of LOCATE with the BY
> clause) may not work correctly when the alternative sort order is
> introduced. Any user choosing to activate this feature will have to
> check
> whether it has implications on their existing data.
> A related development that is also already largely in the master source
> and
> is very relevant to this discussion is case conversion. We need to
> maintain
> a table of the corresponding characters for use by internal case
> conversion
> operations and user visible features such as UPCASE() and DOWNCASE().
> Both sorting sequence and case conversion would be handled by tables
> provided/maintained by the user. It is reasonable to assume that we
> will
> ship some standard tables for common language variants.
> Of course, full multi-national support goes much further than this.
> There
> are issues of national conventions for representation of dates, times,
> currencies, etc. There are text direction issues. Etc, etc......
> Martin Phillips
> Ladybridge Systems Ltd
> 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB
> +44-(0)1604-709200
> If you are thinking of sticking to UTF8 due to its apparent magical > backward > compatibility I have a UTF16 pitch for you.
Steve and I have had lots of discussions about Unicode, and Steve knows his stuff. He works with it everyday.
However, it looks like the problem for Cedric is being forgotten. Namely, the indexing routine uses a different sorting sequence from what is used by the comparison operators. Surely, this can be fixed without implementing Unicode?
> Steve and I have had lots of discussions about Unicode, and > Steve knows his stuff. He works with it everyday.
Almost certainly, when we do multi-byte character set support, it will be by using UTF-8.
> However, it looks like the problem for Cedric is being forgotten. > Namely, the indexing routine uses a different sorting sequence from > what is used by the comparison operators. Surely, this can be fixed > without implementing Unicode?
This is under investigation. Internally, all sorting should be based on simple unsigned 8-bit characters in ASCII order. There are no magic translations for "foreign" characters.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
We have found a bug that causes case insenstive indices to be built using an incorrect sort sequence if there are characters from the upper half of the ASCII character set. This will be fixed for 2.9-5. We should be able to release this later this week or very early next week. Please let me know if you need a more urgent pre-build.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
> Internally, all sorting should be based on simple > unsigned 8-bit characters in ASCII order. There are no > magic translations for "foreign" characters.
Some people may be interested to know that D3 allows for customization of the sorting order. This enhancement was added in the 90's to allow for foreign languages (without going so far as to support Unicode) and in response to requests for control over whether numbers come before letters, null sorts to the top or bottom, or where alphanumerics fit.
Tony Gravagno Nebula Research and Development remove.pleaseNebula-RnD.com/blog Visit PickWiki.com! Contribute!
Your description of what D3 does sounds very similar to how our prototyped system works. The user creates a simple record that lists the characters in the desired sort order and also shows the mapping to be used for upcase() and downcase(). We read this when the collation to be used is selected and hey presto everything works in the new character set with us applying the user defined sort order and casing rules.
Indices have to store the collation sequence internally as this could possibly be different on each index (hopefully not!).
Things like LOCATE, COMPARE(), SORT, etc all pick up the user defined rules.
Perhaps we need to raise the priority of this development.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
You are obviously well down the track with this and are obviously more
aware of "the issues", but if you COULD/DID support UTF-16 as Steve is
suggesting, could help you totally avoid re-visiting this into the
future, and does "play well" for interfacing to other environments.
As usual, all comes down to equations involving time, money, effort &
market
The sort translation table should work well for the other issues
raised
On Jul 9, 1:07 am, "Martin Phillips" <martinphill...@ladybridge.com>
wrote:
At the time when we looked at this previously, UTF-8 was the recommended solution. Perhaps things have changed. We will review this decision when the project is reawakened.
Martin Phillips Ladybridge Systems Ltd 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB +44-(0)1604-709200
2009/7/10 Martin Phillips <martinphill...@ladybridge.com>
> Hi all,
> At the time when we looked at this previously, UTF-8 was the recommended
> solution. Perhaps things have changed.
There is really no simple recommended solution. Each of the encoding
schemes have their benefits and disadvantages. You have to decide what you
are going to use where. Within Anji (the original Java version) we used
UTF-32 for memory strings, and UTF-8 in the database. That solved our
immediate problem (use of Kanji), but had numerous drawbacks (poor memory
efficency and sorting).
> We will review this decision when the
> project is reawakened.
Good plan. It's a good idea to thrash it out with this list before you
commit to a strategy.
> Martin Phillips
> Ladybridge Systems Ltd
> 17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB
> +44-(0)1604-709200