Searchproblem composite words

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Searchproblem composite words

Lutz Steinborn-2
Hi,

I have a search problem with composite words.

For example I have the composite word "wishlist" in my document. I can
easily find the document by using the search string "wishlist" or "wish*"
but I don't get any result with "list".

I can do a fuzzy search but this gives me too many results.

Is where a better way to fix this problem ?


Kindly regards

Lutz Steinborn
4c GmbH
Reply | Threaded
Open this post in threaded view
|

Re: Searchproblem composite words

Otis Gospodnetic-2
Hi Lutz,

That is because neither Solr nor Lucene (the indexing/searching toolkit that Solr runs on top of) know anything about compound words.  Noting there knows that the English word "wishlist" is a compounded word.  You'd have to write your own analyzer and tokenizer that examines each word/token and splits it into its constituent words if the token/word is a compound word.  In other words, you'd have to write something that is language-aware and language-specific.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Lutz Steinborn <[hidden email]>
To: [hidden email]
Sent: Wednesday, May 2, 2007 5:41:33 AM
Subject: Searchproblem composite words

Hi,

I have a search problem with composite words.

For example I have the composite word "wishlist" in my document. I can
easily find the document by using the search string "wishlist" or "wish*"
but I don't get any result with "list".

I can do a fuzzy search but this gives me too many results.

Is where a better way to fix this problem ?


Kindly regards

Lutz Steinborn
4c GmbH



Reply | Threaded
Open this post in threaded view
|

Re: Searchproblem composite words

Chris Hostetter-3
In reply to this post by Lutz Steinborn-2

: For example I have the composite word "wishlist" in my document. I can
: easily find the document by using the search string "wishlist" or "wish*"
: but I don't get any result with "list".

what you are describing is basically a substring search problem ...
sometimes this can be dealt with by using something like the
WordDeliminterFilter -- but only if people are using "WishList" in their
documents.

Another approach would be to use and NGram based tokenizer (built in
support for this will probably be added soon) but then searches for things
like "able" will match words like "cable" ... which may not be what you
want (yes it is a substring, but it is not what anyone would consider a
"composite word"

the best way to match what you want extremely acurately would be to use
the SynonymFilter and enumerate every composite word you care about in the
Synonym list ... tedious yes, but also very accurate.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Searchproblem composite words

Walter Underwood, Netflix
A agree that multi-word synonyms are an excellent way to do this.

This may sound like a hack, but you'd end up doing this even if
you had dedicated linguistic compound decomposition software.
Those usually use a dictionary of common words and the dictionary
rarely has all the words that are important for your site.

I'll be doing this for my site to handle things like "dreamgirls"
and "dream girls".

wunder

On 5/2/07 11:58 AM, "Chris Hostetter" <[hidden email]> wrote:

>
> : For example I have the composite word "wishlist" in my document. I can
> : easily find the document by using the search string "wishlist" or "wish*"
> : but I don't get any result with "list".
>
> what you are describing is basically a substring search problem ...
> sometimes this can be dealt with by using something like the
> WordDeliminterFilter -- but only if people are using "WishList" in their
> documents.
>
> Another approach would be to use and NGram based tokenizer (built in
> support for this will probably be added soon) but then searches for things
> like "able" will match words like "cable" ... which may not be what you
> want (yes it is a substring, but it is not what anyone would consider a
> "composite word"
>
> the best way to match what you want extremely acurately would be to use
> the SynonymFilter and enumerate every composite word you care about in the
> Synonym list ... tedious yes, but also very accurate.
>
> -Hoss