Smart way of indexing for Better performance

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Smart way of indexing for Better performance

RaviWhy
Hi,
  I have the following use case. I could implement the solution but performance is affected. I need some smart ways of doing this.
Use Case :
Incoming data has two fields which have values like 'WAL MART STORES INC'  and 'wal-mart-stores-inc'.  
Users can search the data either in 'walmart'  'wal mart' or 'wal-mart'  also partially on any part of the name from the start of word like 'wal', 'walm' 'wal m'  etc .   I could get the solution  by using two indexes, one as text field for the first field (wal mart ) column and sub word  wal-mart-stores (with WordDelimiterFilterFactory filter).  

Is there a smart way of doing or any other techniques to boost the performance? I need to use them for a high traffic application where the response requirements are around 50 milli seconds.
I have some control on modifying the incoming data.

Can someone suggest better ways of implementing. I can provide more information the tokens and filters I am using.

Thanks
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Smart way of indexing for Better performance

hossman
:   I have the following use case. I could implement the solution but
: performance is affected. I need some smart ways of doing this.
: Use Case :
: Incoming data has two fields which have values like 'WAL MART STORES INC'
: and 'wal-mart-stores-inc'.  
: Users can search the data either in 'walmart'  'wal mart' or 'wal-mart'
: also partially on any part of the name from the start of word like 'wal',
: 'walm' 'wal m'  etc .   I could get the solution  by using two indexes, one
: as text field for the first field (wal mart ) column and sub word
: wal-mart-stores (with WordDelimiterFilterFactory filter).  

there are lots of solutions that could work, all depending on what *else*
you need to be able to match on besides just prefix queries where
whitespace/punctuation are ignored.

One example: using KeywordTokenizer, along with a PatternReplaceFilter
that throws away non letter charagers and a LowercaseFilter and then
issuing all your queries as PrefixQueries will get w* wa* wal* and walm*
to all match "wal mart", "WALMART", "WAL-mart", etc....  but that won't
let "mart" match a document contain "wal mart" .. but you can always use
copyField and hit one field for the first type of query, and the other
field for "normal" queries.

depending on the nature of your data (ie: how many documents, how common
certian prefixes are, etc...) you might get better performacne at the
expense of a larger index if you use something like the
EdgeNGramTokenFilter or EdgeNGramTokenizer to index all the prefixes of
various sizes so you don't need to do a prefix query

The bottom line: there are *lots* of options, you'll need to experimentto
find the right solution that matches when you want to match, and doesn't
when you don't



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Smart way of indexing for Better performance

RaviWhy
data set(number of documents) is not large - 100k. Number of fields could max to 10 . With average size of indexed field could be 200 characters.
I tried creating using multiple indexes  by using copy field.
Let me see how the performance will be with EdgeNGramTokenFilter or EdgeNGramTokenizer

Thanks for the sugegstions.

hossman wrote
:   I have the following use case. I could implement the solution but
: performance is affected. I need some smart ways of doing this.
: Use Case :
: Incoming data has two fields which have values like 'WAL MART STORES INC'
: and 'wal-mart-stores-inc'.  
: Users can search the data either in 'walmart'  'wal mart' or 'wal-mart'
: also partially on any part of the name from the start of word like 'wal',
: 'walm' 'wal m'  etc .   I could get the solution  by using two indexes, one
: as text field for the first field (wal mart ) column and sub word
: wal-mart-stores (with WordDelimiterFilterFactory filter).  

there are lots of solutions that could work, all depending on what *else*
you need to be able to match on besides just prefix queries where
whitespace/punctuation are ignored.

One example: using KeywordTokenizer, along with a PatternReplaceFilter
that throws away non letter charagers and a LowercaseFilter and then
issuing all your queries as PrefixQueries will get w* wa* wal* and walm*
to all match "wal mart", "WALMART", "WAL-mart", etc....  but that won't
let "mart" match a document contain "wal mart" .. but you can always use
copyField and hit one field for the first type of query, and the other
field for "normal" queries.

depending on the nature of your data (ie: how many documents, how common
certian prefixes are, etc...) you might get better performacne at the
expense of a larger index if you use something like the
EdgeNGramTokenFilter or EdgeNGramTokenizer to index all the prefixes of
various sizes so you don't need to do a prefix query

The bottom line: there are *lots* of options, you'll need to experimentto
find the right solution that matches when you want to match, and doesn't
when you don't



-Hoss