The Zorba XQuery engine implements the XQuery and XPath Full Text 1.0 specification that, among other things, adds the ability to use stemming for text-matching via the stemming option. For example, the query:
returns true because $x contains "Improvment" that has the same stem as "improve".
The initial implementation of the stemming option uses the Snowball stemmers and therefore can stem words in the following languages: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish.
Using the Zorba C++ API, you can provide your own stemmer by deriving from two classes: Stemmer and StemmerProvider.
The Stemmer class is:
For details about the ptr type, the destroy() function, and why the destructor is protected, see the Memory Management document.
To implement the Stemmer, you need to implement the stem() function where:
word | The word to be stemmed. |
lang | The language of the word. |
result | The stemmed word goes here. |
Note that result should always be set to something. If your stemmer doesn't know how to stem the given word, you should set result to word. You also need to implement the properties() function and set the identifying URI of your stemmer.
A very simple stemmer that stems the word "foobar" to "foo" can be implemented as:
A real stemmer would either use a stemming algorithm or a dictionary look-up to stem many words, of course. Although not used in this simple example, lang can be used to allow a single stemmer instance to stem words in more than one language.
In addition to a Stemmer, you must also implement a StemmerProvider that, given a language, provides a Stemmer for that language:
The getStemmer() function should return true only if it can provide a Stemmer for the given language; false otherwise. If the Stemmer::ptr argument is null, the caller wants to check only whether the provider can provide a stemmer for the given language and doesn't want a Stemmer instance created or returned.
A simple StemmerProvider for our simple stemmer can be implemented as:
To enable your stemmer to be used, you need to register it with the XmlDataManager: