[Resent] Document boosting based on .. semantics?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Resent] Document boosting based on .. semantics?

Markus Fischer-5
Hi,

[Resent: guess I sent the first before I completed my subscription, just
in case it comes up twice ...]

the subject may be a bit weird but I couldn't find a better way to
describe a problem I'm trying to solve.

If I'm not mistaken, one factor of scoring is the distance of the word
within the document and the length of the document. I'm titling my
problem as the cauliflower-problem, but it's related to any "type of"
problem.

When searching for "cauliflower", I get x hits. Now documents in the top
range (pos < 10) are most likely selected by the user. Unfortunately,
although "cauliflower" is in the documents and it's word position is
like around 3000 characters in the document, the document itself has
nothing much to do with "cauliflower" or "vegetables" or "eating", etc.

Unfortunately, the most relevant documents come much later in the index
(pos >= 10) because the "cauliflower" word is positioned like around
5000 characters within the document.

Based on the relation on the content, these later documents are much
more appropriate to the search term, because the also deal with
"vegetables" and "eating", etc.

I'm stuck here how I can signal Lucene to boost those later documents,
because frankly I don't know on what. I would probably have to tag the
relation of the document (is-about vegetables, is-about eating) and also
detect that the searched term is-a vegetable. This gets even more
complex with non-single-term queries.


On a related topic, I'm also searching for a way to suggest alternate
spelling of words to the user, when we found a word which is very less
frequent used in the index or not in the index at all. I'm Austrian
based, when I e.g. search for "retthich" (wrong spelled "rettich" which
is radish), Google suggests me the proper spelled word. I'm searching
for a way to figure how to accomplish this, but again this may be lucene
off-topic and/or I should properly start a separate thread ...


Has someone an advice how to approach this kind of problems? Is it
appropriate/can it be solved with Lucene? Am I right here on this list
anyway? :)

thanks for any feedback,
- Markus


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Resent] Document boosting based on .. semantics?

Mathieu Lecarme
Markus Fischer a écrit :

> Hi,
>
> [Resent: guess I sent the first before I completed my subscription,
> just in case it comes up twice ...]
>
> the subject may be a bit weird but I couldn't find a better way to
> describe a problem I'm trying to solve.
>
> If I'm not mistaken, one factor of scoring is the distance of the word
> within the document and the length of the document. I'm titling my
> problem as the cauliflower-problem, but it's related to any "type of"
> problem.
>
> When searching for "cauliflower", I get x hits. Now documents in the
> top range (pos < 10) are most likely selected by the user.
> Unfortunately, although "cauliflower" is in the documents and it's
> word position is like around 3000 characters in the document, the
> document itself has nothing much to do with "cauliflower" or
> "vegetables" or "eating", etc.
>
> Unfortunately, the most relevant documents come much later in the
> index (pos >= 10) because the "cauliflower" word is positioned like
> around 5000 characters within the document.
>
> Based on the relation on the content, these later documents are much
> more appropriate to the search term, because the also deal with
> "vegetables" and "eating", etc.
>
> I'm stuck here how I can signal Lucene to boost those later documents,
> because frankly I don't know on what. I would probably have to tag the
> relation of the document (is-about vegetables, is-about eating) and
> also detect that the searched term is-a vegetable. This gets even more
> complex with non-single-term queries.
>
>
> On a related topic, I'm also searching for a way to suggest alternate
> spelling of words to the user, when we found a word which is very less
> frequent used in the index or not in the index at all. I'm Austrian
> based, when I e.g. search for "retthich" (wrong spelled "rettich"
> which is radish), Google suggests me the proper spelled word. I'm
> searching for a way to figure how to accomplish this, but again this
> may be lucene off-topic and/or I should properly start a separate
> thread ...
you can use the ngram pattern and levestein distance to find near words.
I try with  phonetic and aspell dictionnary.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Resent] Document boosting based on .. semantics?

Grant Ingersoll-2
In reply to this post by Markus Fischer-5

On Feb 20, 2008, at 2:51 AM, Markus Fischer wrote:

> Hi,
>
> [Resent: guess I sent the first before I completed my subscription,  
> just in case it comes up twice ...]
>
> the subject may be a bit weird but I couldn't find a better way to  
> describe a problem I'm trying to solve.
>
> If I'm not mistaken, one factor of scoring is the distance of the  
> word within the document and the length of the document. I'm titling  
> my problem as the cauliflower-problem, but it's related to any "type  
> of" problem.
>
> When searching for "cauliflower", I get x hits. Now documents in the  
> top range (pos < 10) are most likely selected by the user.  
> Unfortunately, although "cauliflower" is in the documents and it's  
> word position is like around 3000 characters in the document, the  
> document itself has nothing much to do with "cauliflower" or  
> "vegetables" or "eating", etc.
>
> Unfortunately, the most relevant documents come much later in the  
> index (pos >= 10) because the "cauliflower" word is positioned like  
> around 5000 characters within the document.
>
> Based on the relation on the content, these later documents are much  
> more appropriate to the search term, because the also deal with  
> "vegetables" and "eating", etc.
>
> I'm stuck here how I can signal Lucene to boost those later  
> documents, because frankly I don't know on what. I would probably  
> have to tag the relation of the document (is-about vegetables, is-
> about eating) and also detect that the searched term is-a vegetable.  
> This gets even more complex with non-single-term queries.

Well, this is a classic problem in IR.  The question is, how do you  
know when a user types "cauliflower" that they really are interested  
in "vegetables" and "eating" and not the other document?  There really  
is nothing in that query, by itself, that gives you or Lucene that  
information.   Your top hit has the term and has other factors, such  
as document length etc. that make it the top result.  That is not to  
say there is nothing you can do, just that it is hard and can be  
brittle.

Some suggestions (and the use of cauliflower and vegetables, etc. is  
figurative, not literal):
1. If you know cauliflower should be related to vegetables and eating,  
add those as synonyms to your query terms.  This can be hard to  
generalize.
2. If you have some user profile that suggests that user is interested  
in vegetables/eating over other things, then you could incorporate that.
3. If you see that most of your users like the vegetable documents for  
the query cauliflower by doing some log analysis, then you could use  
popularity of the document as a factor in your scoring (see  
FunctionQuery capability in Lucene)
4. You could try to do some fancy-schmancy reasoning using Wordnet,  
hyper/hypo - nyms
5. You could use MoreLikeThis to allow the user to choose the  
vegetable result and say "Give me more documents like this"
6. Last, but certainly not least, if you want the user to get a  
certain document as #1 or #2, then make the document #1 or #2.  You  
don't need search for this.  It's called editorial boosting.  Again,  
hard to generalize, but sometimes you just need a document to be #1  
and trying to tune the various knobs a search engine gives you is  
going to break a whole lot of other things.

Also, what kind of queries are you using such that you get the first  
3K characters in your query?  Are you using some type of SpanQuery or  
are you just referring to the effects of length normalization?  You  
might also try using a different Similarity implementation that  
doesn't punish longer documents as much.

Finally, the explain() method may help you better understand the  
factors that go into why your documents score the way they do.


>
>
>
> On a related topic, I'm also searching for a way to suggest  
> alternate spelling of words to the user, when we found a word which  
> is very less frequent used in the index or not in the index at all.  
> I'm Austrian based, when I e.g. search for "retthich" (wrong spelled  
> "rettich" which is radish), Google suggests me the proper spelled  
> word. I'm searching for a way to figure how to accomplish this, but  
> again this may be lucene off-topic and/or I should properly start a  
> separate thread ...
>

Search the archives here for "Google suggest" or "suggestions", "ajax  
suggestions", etc.  There are a couple of implementations out there  
for Lucene/Solr I believe that use the TermEnum class to go and get  
suggestions.  I believe there might even be some patches in JIRA.


>
> Has someone an advice how to approach this kind of problems? Is it  
> appropriate/can it be solved with Lucene? Am I right here on this  
> list anyway? :)
>
> thanks for any feedback,
> - Markus
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Alternate spelling suggestion (was [Resent] Document boosting based on .. semantics? )

Markus Fischer-5
In reply to this post by Mathieu Lecarme
Hi

Mathieu Lecarme wrote:

>> On a related topic, I'm also searching for a way to suggest alternate
>> spelling of words to the user, when we found a word which is very less
>> frequent used in the index or not in the index at all. I'm Austrian
>> based, when I e.g. search for "retthich" (wrong spelled "rettich"
>> which is radish), Google suggests me the proper spelled word. I'm
>> searching for a way to figure how to accomplish this, but again this
>> may be lucene off-topic and/or I should properly start a separate
>> thread ...
> you can use the ngram pattern and levestein distance to find near words.
> I try with  phonetic and aspell dictionnary.

Thanks for the dictionary hint. So obvious, but still I haven't thought about
it until you mentioned it!

I've had really great success with the following pattern:

* after the Lucene index was created, I generate a myspell compatible
dictionary from it

* when the search returns no result, every term is ran through myspell suggestion

* the top five myspell suggestion of each term are re-sorted by their
frequency from the Lucene index

Additionally: when the user only entered a single term, I'm also providing the
second best results as alternative (did you mean x or y?).

The results have been very good so far, thanks again!

- Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Alternate spelling suggestion (was [Resent] Document boosting based on .. semantics? )

Mathieu Lecarme

> Hi
>
> Mathieu Lecarme wrote:
>>> On a related topic, I'm also searching for a way to suggest  
>>> alternate spelling of words to the user, when we found a word  
>>> which is very less frequent used in the index or not in the index  
>>> at all. I'm Austrian based, when I e.g. search for  
>>> "retthich" (wrong spelled "rettich" which is radish), Google  
>>> suggests me the proper spelled word. I'm searching for a way to  
>>> figure how to accomplish this, but again this may be lucene off-
>>> topic and/or I should properly start a separate thread ...
>> you can use the ngram pattern and levestein distance to find near  
>> words.
>> I try with  phonetic and aspell dictionnary.
>
> Thanks for the dictionary hint. So obvious, but still I haven't  
> thought about it until you mentioned it!
>
> I've had really great success with the following pattern:
>
> * after the Lucene index was created, I generate a myspell  
> compatible dictionary from it
>
> * when the search returns no result, every term is ran through  
> myspell suggestion
>
> * the top five myspell suggestion of each term are re-sorted by  
> their frequency from the Lucene index
>
> Additionally: when the user only entered a single term, I'm also  
> providing the second best results as alternative (did you mean x or  
> y?).
>
> The results have been very good so far, thanks again!
>
> - Markus

I submit a patch for doing that nicely :
https://issues.apache.org/jira/browse/LUCENE-1190

Here is an example code :

LexiconReader lexiconReader = new DirectoryReader(directory);
Lexicon lexicon = new Lexicon(new RAMDirectory());
lexicon.addAnalyser(new NGramAnalyzer());
lexicon.read(lexiconReader);
QueryParser parser = new QueryParser("txt", new WhitespaceAnalyzer());
Query query = parser.parse("bio:brawn");
SuggestiveSearcher searcher = new SuggestiveSearcher(new  
IndexSearcher(directory), lexicon);
SuggestiveHits hits = searcher.searchWithSuggestions(query);
System.out.println(hits.getSuggestedQuery());

In this example, a lexicon is build from a directory, with a Ngram  
analyzer.
The directory contains the term "brown" in field "bio".
Hits is final, so, you can't heritate from it, that's why there is a  
SuggestiveHits.
SuggestiveHits has a thresold, if there's to few answer, similar word  
search is triggered.

Using myspell dictionnary is a good idea, i'll implement this.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]