Arne Kröger's Blog: Using Oracle Text Search

Oracle offers with "Oracle Text" a strong and highly adjustable Text Search to their database customers. We have used Oracle for years and 2 years ago we needed a Text-Search which has the ability to create a better sort order for search results.

Requirement: It should be possible to specify different emphases in the search query.

1. Object definition
A 'Product' in our case has the following attributes:

Title
Alternate title
Feature

Feature-Title
Feature-Content

Tags
External links
Link description

2. Table defintion
We created a help table in the database. All tags assigned to the product are written space-delimited into the PRODUCT_TAG column, the same for

PRODUCT_ALTERNATE_TITLE
PRODUCT_FEATURE
FEATURE_CONTENT
EXTERNAL_LINK
LINK_DESCRIPTION

The CREATE TABLE statement looks like:

 CREATE TABLE PRODUCT (

    ID NUMBER(38,0) NOT NULL, 

    PRODUCT_TITLE VARCHAR2(256), 

    PRODUCT_ALTERNATE_TITLE clob, 

    PRODUCT_DESCRIPTION VARCHAR2(4000),

    PRODUCT_TAG clob,

    PRODUCT_FEATURE clob,

    FEATURE_CONTENT clob,

    EXTERNAL_LINK clob,

    LINK_DESCRIPTION clob

);

3. Index creation
First of all we need to grant the ctxapp to our user:

grant ctxapp to MY_USER;

Then we need to create a 'Lexer' to use the extended possibilities Oracle Text delivers:

"base-letter conversion, composite word indexing, case-sensitive indexing and alternate spelling for whitespace-delimited languages that have extended character sets."

 begin

    ctx_ddl.create_preference(

        'german_lexer','basic_lexer'

    );

    ctx_ddl.set_attribute(

        'german_lexer','composite','german'

    );

    ctx_ddl.set_attribute (

        'german_lexer', 'SKIPJOINS', '-'

    );

    ctx_ddl.create_preference (

        'german_wordlist', 'BASIC_WORDLIST'

    );

    ctx_ddl.set_attribute (

        'german_wordlist', 'STEMMER', 'GERMAN'

    );

end;

/

We have to create a preference for the index so that we can create an index over more than one column (MULTI_COLUMN_DATASTORE).



begin

    ctx_ddl.create_preference(

        preference_name => 'PRODUCT_DATA_STORE',

        object_name     => 'MULTI_COLUMN_DATASTORE'

    );

    ctx_ddl.set_attribute(

        preference_name => 'PRODUCT_DATA_STORE',

        attribute_name  => 'COLUMNS',

        attribute_value => '

            PRODUCT_TITLE, 

            PRODUCT_DESCRIPTION, 

            PRODUCT_ALTERNATE_TITLE, 

            PRODUCT_TAG, 

            PRODUCT_FEATURE, 

            FEATURE_CONTENT, 

            EXTERNAL_LINK,LINK_DESCRIPTION'

    );

end;

/

The creation of the index looks like:

 CREATE INDEX PRODUCT_FT_IDX ON PRODUCT(PRODUCT_TITLE)

    INDEXTYPE IS CTXSYS.CONTEXT

    PARAMETERS('

        datastore PRODUCT_DATA_STORE 

        section group CTXSYS.AUTO_SECTION_GROUP 

        LEXER german_lexer 

        WORDLIST german_wordlist 

        STOPLIST CTXSYS.EMPTY_STOPLIST 

        SYNC (ON COMMIT)');

The 'index column' you have to specify is the first column from our MULTI_COLUMN_DATASTORE preference (here: PRODUCT_TITLE). The 'german_lexer' we created before handles all the special character stuff for us.

The 'german_wordlist' enables some extra functionality provided by oracle text:

"Use the wordlist preference to enable the query options such as stemming, fuzzy matching for your language. You can also use the wordlist preference to enable substring and prefix indexing, which improves performance for wildcard queries with CONTAINS and CATSEARCH."

The stoplist parameter specifies which words ahouldn't be indexed by Oracle; We don't want it so we specify a empty stoplist (CTXSYS.EMPTY_STOPLIST).

4. Search Query
Now the the index is ready and waits for better queries to use it. Our query looks like this:

 SELECT 

    score(1) as SCORE_VALUE, 

    ID_PRODUCT , 

    PRODUCT_TITLE, 

    PRODUCT_DESCRIPTION

FROM PRODUCT

WHERE CONTAINS(PRODUCT_TITLE,'

    (((autos) within (PRODUCT_TITLE))*10 

    ACCUM (($autos) within (PRODUCT_TITLE))*9 

    ACCUM ((%autos%) within (PRODUCT_TITLE))*2)

    ACCUM (((autos) within (PRODUCT_DESCRIPTION))*9 

    ACCUM (($autos) within (PRODUCT_DESCRIPTION))*8 

    ACCUM ((%autos%) within (PRODUCT_DESCRIPTION))*2)

    ACCUM (((autos) within (PRODUCT_ALTERNATE_TITLE))*8 

    ACCUM (($autos) within (PRODUCT_ALTERNATE_TITLE))*7 

    ACCUM ((%autos%) within (PRODUCT_ALTERNATE_TITLE))*2)

    ACCUM (((autos) within (PRODUCT_TAG))*7 

    ACCUM (($autos) within (PRODUCT_TAG))*6 

    ACCUM ((%autos%) within (PRODUCT_TAG))*2)

    ACCUM (((autos) within (PRODUCT_FEATURE))*6 

    ACCUM (($autos) within (PRODUCT_FEATURE))*5 

    ACCUM ((%autos%) within (PRODUCT_FEATURE))*1)

    ACCUM (((autos) within (FEATURE_CONTENT))*5 

    ACCUM (($autos) within (FEATURE_CONTENT))*4 

    ACCUM ((%autos%) within (FEATURE_CONTENT))*1) 

    ACCUM (((autos) within (EXTERNAL_LINK))*4 

    ACCUM (($autos) within (EXTERNAL_LINK))*3 

    ACCUM ((%autos%) within (EXTERNAL_LINK))*1)

    ACCUM (((autos) within (LINK_DESCRIPTION))*3 

    ACCUM (($autos) within (LINK_DESCRIPTION))*2.5 

    ACCUM ((%autos%) within (LINK_DESCRIPTION))',1)>0

Now it is possible to adjust the criterias:

Is the searchstring found as the 'complete phrase': (autos) within (COLUMN)
Is the baseword of the searchstring found as the 'complete phrase': ($auto) within (COLUMN); [This means: Searching for 'cars' should also find all occurrences of 'car']
Is the searchstring anywhere inside the text: (%autos%) within (COLUMN)

For all of this possiblities for every column we now can adjust the index score value: ((autos) within (PRODUCT_TITLE))*10. Oracle Text uses this value to calculate the score value (there are more criteria like: amount of occurrences).

Now we can sort the resultset after score(1) or SCORE_VALUE to get a better sort order for the search.

Arne Kröger's Blog

2. Februar 2011

Using Oracle Text Search

Keine Kommentare:

Kommentar veröffentlichen