Which file in the lucene package is used to manipulate results..

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Which file in the lucene package is used to manipulate results..

sumittyagi
hi, i am using lucene for the very first time and want to manipulate the results, by adding some more factors to it, which file should i edit to manipulate the search results.... Thanks Sumit Tyagi
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

mark harwood
I think you need to describe your "factors" in more detail. Exactly what do you want to achieve for your users?
We could be talking about any number of Lucene functions here.

----- Original Message ----
From: sumittyagi <[hidden email]>
To: [hidden email]
Sent: Friday, 21 December, 2007 4:51:09 AM
Subject: Which file in the lucene package is used to manipulate results..


hi, i am using lucene for the very first time and want to manipulate
 the
results, by adding some more factors to it, which file should i edit to
manipulate the search results....

Thanks
Sumit Tyagi
--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.





      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

Zhou Qi-2
Hi Sumittyagi,

     I think you can implement your factors in the scorer to obtain your
desired results.

2007/12/21, mark harwood <[hidden email]>:

>
> I think you need to describe your "factors" in more detail. Exactly what
> do you want to achieve for your users?
> We could be talking about any number of Lucene functions here.
>
> ----- Original Message ----
> From: sumittyagi <[hidden email]>
> To: [hidden email]
> Sent: Friday, 21 December, 2007 4:51:09 AM
> Subject: Which file in the lucene package is used to manipulate results..
>
>
> hi, i am using lucene for the very first time and want to manipulate
> the
> results, by adding some more factors to it, which file should i edit to
> manipulate the search results....
>
> Thanks
> Sumit Tyagi
> --
> View this message in context:
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
>
>
>
>       __________________________________________________________
> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

sumittyagi
In reply to this post by mark harwood
actually i am writing a module to rerank the results, so i want to edit the file which arrange the results and give them ranks,
or is there any other way i can use my module to rerank the results

markharw00d wrote
I think you need to describe your "factors" in more detail. Exactly what do you want to achieve for your users?
We could be talking about any number of Lucene functions here.

----- Original Message ----
From: sumittyagi <ping.sumit@gmail.com>
To: java-user@lucene.apache.org
Sent: Friday, 21 December, 2007 4:51:09 AM
Subject: Which file in the lucene package is used to manipulate results..


hi, i am using lucene for the very first time and want to manipulate
 the
results, by adding some more factors to it, which file should i edit to
manipulate the search results....

Thanks
Sumit Tyagi
--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.





      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

mark harwood
In reply to this post by sumittyagi
Again, if you could be precise about what factors will influence the ranking that would help. Field names, what is wrong with existing ranking order and some of the thinking about your proposed re-rank logic would be useful context.

In Lucene you have the options for individual query-clause boosts, index-time document boosts, field-specific boosts on parsers, index-time length normalisation options, query result sorting, IndexSearcher "Similarity" settings and custom scorers to name a  few. We can't recommend which approach is most suited unless you can say more about what problem you're trying to address.

Cheers
Mark


----- Original Message ----
From: sumittyagi <[hidden email]>
To: [hidden email]
Sent: Friday, 21 December, 2007 3:09:48 PM
Subject: Re: Which file in the lucene package is used to manipulate results..


actually i am writing a module to rerank the results, so i want to edit
 the
file which arrange the results and give them ranks,
or is there any other way i can use my module to rerank the results


markharw00d wrote:
>
> I think you need to describe your "factors" in more detail. Exactly
 what
> do you want to achieve for your users?
> We could be talking about any number of Lucene functions here.
>
> ----- Original Message ----
> From: sumittyagi <[hidden email]>
> To: [hidden email]
> Sent: Friday, 21 December, 2007 4:51:09 AM
> Subject: Which file in the lucene package is used to manipulate
 results..
>
>
> hi, i am using lucene for the very first time and want to manipulate
>  the
> results, by adding some more factors to it, which file should i edit
 to
> manipulate the search results....
>
> Thanks
> Sumit Tyagi
> --
> View this message in context:
>
>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html

> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
>
>
>
>       __________________________________________________________
> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

Erick Erickson
In reply to this post by sumittyagi
You still haven't explained *why* you want to rerank results. What
is the use-case you're trying to implement? Quite often it's turned
out for me that when I let folks on the list know what the use
case I'm trying to support is, they come up with much more elegant
solutions than I was thinking about.

For instance, does the CustomScoreQuery class have any relevance
to your problem?

If you're thinking of modifying the core Lucene code for your
special purpose, I'd advise against it unless and until you'd exhausted
all the other options. It's always a maintenance headache to do this.

Best
Erick

On Dec 21, 2007 10:09 AM, sumittyagi <[hidden email]> wrote:

>
> actually i am writing a module to rerank the results, so i want to edit
> the
> file which arrange the results and give them ranks,
> or is there any other way i can use my module to rerank the results
>
>
> markharw00d wrote:
> >
> > I think you need to describe your "factors" in more detail. Exactly what
> > do you want to achieve for your users?
> > We could be talking about any number of Lucene functions here.
> >
> > ----- Original Message ----
> > From: sumittyagi <[hidden email]>
> > To: [hidden email]
> > Sent: Friday, 21 December, 2007 4:51:09 AM
> > Subject: Which file in the lucene package is used to manipulate
> results..
> >
> >
> > hi, i am using lucene for the very first time and want to manipulate
> >  the
> > results, by adding some more factors to it, which file should i edit to
> > manipulate the search results....
> >
> > Thanks
> > Sumit Tyagi
> > --
> > View this message in context:
> >
> >
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> >
> >
> >
> >       __________________________________________________________
> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

sumittyagi
Actually what i have to do is...
1.) for every query(keyword), among the results obtained, the keyword will be mapped with the page clicked, along with the no. of clicks for that keyword on that page
2.) next time for the same query(keyword), the mapped pages will be ranked higher considering the no. of clicks too..
3.) for every new query these steps will be repeated...
this was a very high level view , i have made algorithms for these modules and trying to incorporate with lucene but dont know , on which files i have to do edition to make it work...
please help me regarding this, if you need some more explanation, please let me know...
thanks
Sumit Tyagi




Erick Erickson wrote
You still haven't explained *why* you want to rerank results. What
is the use-case you're trying to implement? Quite often it's turned
out for me that when I let folks on the list know what the use
case I'm trying to support is, they come up with much more elegant
solutions than I was thinking about.

For instance, does the CustomScoreQuery class have any relevance
to your problem?

If you're thinking of modifying the core Lucene code for your
special purpose, I'd advise against it unless and until you'd exhausted
all the other options. It's always a maintenance headache to do this.

Best
Erick

On Dec 21, 2007 10:09 AM, sumittyagi <ping.sumit@gmail.com> wrote:

>
> actually i am writing a module to rerank the results, so i want to edit
> the
> file which arrange the results and give them ranks,
> or is there any other way i can use my module to rerank the results
>
>
> markharw00d wrote:
> >
> > I think you need to describe your "factors" in more detail. Exactly what
> > do you want to achieve for your users?
> > We could be talking about any number of Lucene functions here.
> >
> > ----- Original Message ----
> > From: sumittyagi <ping.sumit@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Friday, 21 December, 2007 4:51:09 AM
> > Subject: Which file in the lucene package is used to manipulate
> results..
> >
> >
> > hi, i am using lucene for the very first time and want to manipulate
> >  the
> > results, by adding some more factors to it, which file should i edit to
> > manipulate the search results....
> >
> > Thanks
> > Sumit Tyagi
> > --
> > View this message in context:
> >
> >
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> >
> >
> >
> >       __________________________________________________________
> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

mark harwood
In reply to this post by sumittyagi
Thanks for the context - much more useful.
The challenge here is similar to that posed by offering end-user tagging of content (see here http://www.mail-archive.com/java-user@.../msg17580.html ). The main difference here being that words are added to docs implicitly by search click-throughs rather than any explicit tagging action.

In both cases the challenge is that the user data around documents is likely to be updated very often while the documents remain relatively static.
I suspect some additional things to think about are:
1) Cancelling out the "human laziness" bias that favours clicking results on page 1. Are clicks on page 2 worth more?
2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
3) Lucene doc IDs are not stable - how will you associate query terms/click data with documents and join them at speed?
4) Are individual words or phrases the unit of boost? "Paris" means different things in "Paris Hilton" and "Paris, France".

A simple approach might be to re-index your content with all of the additional search terms from clicks added to the associated document in a "searchClicks" field - the more clicks, the more repetitions of the same search words in the document to help with tf (Term Frequency). This additional content would need to be capped, to avoid huge documents. This has the disadvantage of requiring a re-index though.
Another option to avoid reindexing everything is to wrap IndexReader (See FilterIndexReader) and implement TermEnum/TermDocs for a fake field called "searchClicks". The idea is Lucene looks after the usual, static document content while your implementation goes off to your more volatile storage (e.g. database/parallel index, custom file structure) to retrieve lists of doc ids, term frequencies etc. for this "searchClicks" field. All of the Lucene queries you might want to throw at this e.g. PhraseQueries can then test both the static Lucene fields and your new volatile "click" fields without being aware of this low-level trickery.

I'm sure there will be other ways of doing this too but this seems like a conceptually clean way of modelling it - just seeing search terms as extensions to the document content.

Cheers
Mark


----- Original Message ----
From: sumittyagi <[hidden email]>
To: [hidden email]
Sent: Sunday, 23 December, 2007 5:30:55 AM
Subject: Re: Which file in the lucene package is used to manipulate results..


Actually what i have to do is...
1.) for every query(keyword), among the results obtained, the keyword
 will
be mapped with the page clicked, along with the no. of clicks for that
keyword on that page
2.) next time for the same query(keyword), the mapped pages will be
 ranked
higher considering the no. of clicks too..
3.) for every new query these steps will be repeated...
this was a very high level view , i have made algorithms for these
 modules
and trying to incorporate with lucene but dont know , on which files i
 have
to do edition to make it work...
please help me regarding this, if you need some more explanation,
 please let
me know...
thanks
Sumit Tyagi





Erick Erickson wrote:

>
> You still haven't explained *why* you want to rerank results. What
> is the use-case you're trying to implement? Quite often it's turned
> out for me that when I let folks on the list know what the use
> case I'm trying to support is, they come up with much more elegant
> solutions than I was thinking about.
>
> For instance, does the CustomScoreQuery class have any relevance
> to your problem?
>
> If you're thinking of modifying the core Lucene code for your
> special purpose, I'd advise against it unless and until you'd
 exhausted
> all the other options. It's always a maintenance headache to do this.
>
> Best
> Erick
>
> On Dec 21, 2007 10:09 AM, sumittyagi <[hidden email]> wrote:
>
>>
>> actually i am writing a module to rerank the results, so i want to
 edit
>> the
>> file which arrange the results and give them ranks,
>> or is there any other way i can use my module to rerank the results
>>
>>
>> markharw00d wrote:
>> >
>> > I think you need to describe your "factors" in more detail.
 Exactly

>> what
>> > do you want to achieve for your users?
>> > We could be talking about any number of Lucene functions here.
>> >
>> > ----- Original Message ----
>> > From: sumittyagi <[hidden email]>
>> > To: [hidden email]
>> > Sent: Friday, 21 December, 2007 4:51:09 AM
>> > Subject: Which file in the lucene package is used to manipulate
>> results..
>> >
>> >
>> > hi, i am using lucene for the very first time and want to
 manipulate
>> >  the
>> > results, by adding some more factors to it, which file should i
 edit to
>> > manipulate the search results....
>> >
>> > Thanks
>> > Sumit Tyagi
>> > --
>> > View this message in context:
>> >
>> >
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
>> > Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.

>> >
>> >
>> >
>> >
>> >
>> >       __________________________________________________________
>> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>> >
>> >
>> >
 ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
>> Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>>
>>
>>
 ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>

--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: Which file in the lucene package is used to manipulate results..

Tom Roberts LUXONLINE
In reply to this post by sumittyagi
AUTOMATIC REPLY

LUX is closed until 7th January 2008

most information about LUX is available at www.lux.org.uk



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

sumittyagi
In reply to this post by mark harwood
hi..
thanks for the help
following your suggestions ..
i do not have the package org.apache.lucene.index  , from where can i download it to start this project
markharw00d wrote
Thanks for the context - much more useful.
The challenge here is similar to that posed by offering end-user tagging of content (see here http://www.mail-archive.com/java-user@lucene.apache.org/msg17580.html ). The main difference here being that words are added to docs implicitly by search click-throughs rather than any explicit tagging action.

In both cases the challenge is that the user data around documents is likely to be updated very often while the documents remain relatively static.
I suspect some additional things to think about are:
1) Cancelling out the "human laziness" bias that favours clicking results on page 1. Are clicks on page 2 worth more?
2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
3) Lucene doc IDs are not stable - how will you associate query terms/click data with documents and join them at speed?
4) Are individual words or phrases the unit of boost? "Paris" means different things in "Paris Hilton" and "Paris, France".

A simple approach might be to re-index your content with all of the additional search terms from clicks added to the associated document in a "searchClicks" field - the more clicks, the more repetitions of the same search words in the document to help with tf (Term Frequency). This additional content would need to be capped, to avoid huge documents. This has the disadvantage of requiring a re-index though.
Another option to avoid reindexing everything is to wrap IndexReader (See FilterIndexReader) and implement TermEnum/TermDocs for a fake field called "searchClicks". The idea is Lucene looks after the usual, static document content while your implementation goes off to your more volatile storage (e.g. database/parallel index, custom file structure) to retrieve lists of doc ids, term frequencies etc. for this "searchClicks" field. All of the Lucene queries you might want to throw at this e.g. PhraseQueries can then test both the static Lucene fields and your new volatile "click" fields without being aware of this low-level trickery.

I'm sure there will be other ways of doing this too but this seems like a conceptually clean way of modelling it - just seeing search terms as extensions to the document content.

Cheers
Mark


----- Original Message ----
From: sumittyagi <ping.sumit@gmail.com>
To: java-user@lucene.apache.org
Sent: Sunday, 23 December, 2007 5:30:55 AM
Subject: Re: Which file in the lucene package is used to manipulate results..


Actually what i have to do is...
1.) for every query(keyword), among the results obtained, the keyword
 will
be mapped with the page clicked, along with the no. of clicks for that
keyword on that page
2.) next time for the same query(keyword), the mapped pages will be
 ranked
higher considering the no. of clicks too..
3.) for every new query these steps will be repeated...
this was a very high level view , i have made algorithms for these
 modules
and trying to incorporate with lucene but dont know , on which files i
 have
to do edition to make it work...
please help me regarding this, if you need some more explanation,
 please let
me know...
thanks
Sumit Tyagi





Erick Erickson wrote:
>
> You still haven't explained *why* you want to rerank results. What
> is the use-case you're trying to implement? Quite often it's turned
> out for me that when I let folks on the list know what the use
> case I'm trying to support is, they come up with much more elegant
> solutions than I was thinking about.
>
> For instance, does the CustomScoreQuery class have any relevance
> to your problem?
>
> If you're thinking of modifying the core Lucene code for your
> special purpose, I'd advise against it unless and until you'd
 exhausted
> all the other options. It's always a maintenance headache to do this.
>
> Best
> Erick
>
> On Dec 21, 2007 10:09 AM, sumittyagi <ping.sumit@gmail.com> wrote:
>
>>
>> actually i am writing a module to rerank the results, so i want to
 edit
>> the
>> file which arrange the results and give them ranks,
>> or is there any other way i can use my module to rerank the results
>>
>>
>> markharw00d wrote:
>> >
>> > I think you need to describe your "factors" in more detail.
 Exactly
>> what
>> > do you want to achieve for your users?
>> > We could be talking about any number of Lucene functions here.
>> >
>> > ----- Original Message ----
>> > From: sumittyagi <ping.sumit@gmail.com>
>> > To: java-user@lucene.apache.org
>> > Sent: Friday, 21 December, 2007 4:51:09 AM
>> > Subject: Which file in the lucene package is used to manipulate
>> results..
>> >
>> >
>> > hi, i am using lucene for the very first time and want to
 manipulate
>> >  the
>> > results, by adding some more factors to it, which file should i
 edit to
>> > manipulate the search results....
>> >
>> > Thanks
>> > Sumit Tyagi
>> > --
>> > View this message in context:
>> >
>> >
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
>> > Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>> >
>> >
>> >
>> >
>> >
>> >       __________________________________________________________
>> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>> >
>> >
>> >
 ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
>> Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>>
>>
>>
 ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>

--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

sumittyagi
ignore my previous msg... i got that package....
sumittyagi wrote
hi..
thanks for the help
following your suggestions ..
i do not have the package org.apache.lucene.index  , from where can i download it to start this project
markharw00d wrote
Thanks for the context - much more useful.
The challenge here is similar to that posed by offering end-user tagging of content (see here http://www.mail-archive.com/java-user@lucene.apache.org/msg17580.html ). The main difference here being that words are added to docs implicitly by search click-throughs rather than any explicit tagging action.

In both cases the challenge is that the user data around documents is likely to be updated very often while the documents remain relatively static.
I suspect some additional things to think about are:
1) Cancelling out the "human laziness" bias that favours clicking results on page 1. Are clicks on page 2 worth more?
2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
3) Lucene doc IDs are not stable - how will you associate query terms/click data with documents and join them at speed?
4) Are individual words or phrases the unit of boost? "Paris" means different things in "Paris Hilton" and "Paris, France".

A simple approach might be to re-index your content with all of the additional search terms from clicks added to the associated document in a "searchClicks" field - the more clicks, the more repetitions of the same search words in the document to help with tf (Term Frequency). This additional content would need to be capped, to avoid huge documents. This has the disadvantage of requiring a re-index though.
Another option to avoid reindexing everything is to wrap IndexReader (See FilterIndexReader) and implement TermEnum/TermDocs for a fake field called "searchClicks". The idea is Lucene looks after the usual, static document content while your implementation goes off to your more volatile storage (e.g. database/parallel index, custom file structure) to retrieve lists of doc ids, term frequencies etc. for this "searchClicks" field. All of the Lucene queries you might want to throw at this e.g. PhraseQueries can then test both the static Lucene fields and your new volatile "click" fields without being aware of this low-level trickery.

I'm sure there will be other ways of doing this too but this seems like a conceptually clean way of modelling it - just seeing search terms as extensions to the document content.

Cheers
Mark


----- Original Message ----
From: sumittyagi <ping.sumit@gmail.com>
To: java-user@lucene.apache.org
Sent: Sunday, 23 December, 2007 5:30:55 AM
Subject: Re: Which file in the lucene package is used to manipulate results..


Actually what i have to do is...
1.) for every query(keyword), among the results obtained, the keyword
 will
be mapped with the page clicked, along with the no. of clicks for that
keyword on that page
2.) next time for the same query(keyword), the mapped pages will be
 ranked
higher considering the no. of clicks too..
3.) for every new query these steps will be repeated...
this was a very high level view , i have made algorithms for these
 modules
and trying to incorporate with lucene but dont know , on which files i
 have
to do edition to make it work...
please help me regarding this, if you need some more explanation,
 please let
me know...
thanks
Sumit Tyagi





Erick Erickson wrote:
>
> You still haven't explained *why* you want to rerank results. What
> is the use-case you're trying to implement? Quite often it's turned
> out for me that when I let folks on the list know what the use
> case I'm trying to support is, they come up with much more elegant
> solutions than I was thinking about.
>
> For instance, does the CustomScoreQuery class have any relevance
> to your problem?
>
> If you're thinking of modifying the core Lucene code for your
> special purpose, I'd advise against it unless and until you'd
 exhausted
> all the other options. It's always a maintenance headache to do this.
>
> Best
> Erick
>
> On Dec 21, 2007 10:09 AM, sumittyagi <ping.sumit@gmail.com> wrote:
>
>>
>> actually i am writing a module to rerank the results, so i want to
 edit
>> the
>> file which arrange the results and give them ranks,
>> or is there any other way i can use my module to rerank the results
>>
>>
>> markharw00d wrote:
>> >
>> > I think you need to describe your "factors" in more detail.
 Exactly
>> what
>> > do you want to achieve for your users?
>> > We could be talking about any number of Lucene functions here.
>> >
>> > ----- Original Message ----
>> > From: sumittyagi <ping.sumit@gmail.com>
>> > To: java-user@lucene.apache.org
>> > Sent: Friday, 21 December, 2007 4:51:09 AM
>> > Subject: Which file in the lucene package is used to manipulate
>> results..
>> >
>> >
>> > hi, i am using lucene for the very first time and want to
 manipulate
>> >  the
>> > results, by adding some more factors to it, which file should i
 edit to
>> > manipulate the search results....
>> >
>> > Thanks
>> > Sumit Tyagi
>> > --
>> > View this message in context:
>> >
>> >
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
>> > Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>> >
>> >
>> >
>> >
>> >
>> >       __________________________________________________________
>> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>> >
>> >
>> >
 ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
>> Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>>
>>
>>
 ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>

--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

sumittyagi
In reply to this post by mark harwood
hi
which file can i edit to change the scoring factors in lucene results
markharw00d wrote
Thanks for the context - much more useful.
The challenge here is similar to that posed by offering end-user tagging of content (see here http://www.mail-archive.com/java-user@lucene.apache.org/msg17580.html ). The main difference here being that words are added to docs implicitly by search click-throughs rather than any explicit tagging action.

In both cases the challenge is that the user data around documents is likely to be updated very often while the documents remain relatively static.
I suspect some additional things to think about are:
1) Cancelling out the "human laziness" bias that favours clicking results on page 1. Are clicks on page 2 worth more?
2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
3) Lucene doc IDs are not stable - how will you associate query terms/click data with documents and join them at speed?
4) Are individual words or phrases the unit of boost? "Paris" means different things in "Paris Hilton" and "Paris, France".

A simple approach might be to re-index your content with all of the additional search terms from clicks added to the associated document in a "searchClicks" field - the more clicks, the more repetitions of the same search words in the document to help with tf (Term Frequency). This additional content would need to be capped, to avoid huge documents. This has the disadvantage of requiring a re-index though.
Another option to avoid reindexing everything is to wrap IndexReader (See FilterIndexReader) and implement TermEnum/TermDocs for a fake field called "searchClicks". The idea is Lucene looks after the usual, static document content while your implementation goes off to your more volatile storage (e.g. database/parallel index, custom file structure) to retrieve lists of doc ids, term frequencies etc. for this "searchClicks" field. All of the Lucene queries you might want to throw at this e.g. PhraseQueries can then test both the static Lucene fields and your new volatile "click" fields without being aware of this low-level trickery.

I'm sure there will be other ways of doing this too but this seems like a conceptually clean way of modelling it - just seeing search terms as extensions to the document content.

Cheers
Mark


----- Original Message ----
From: sumittyagi <ping.sumit@gmail.com>
To: java-user@lucene.apache.org
Sent: Sunday, 23 December, 2007 5:30:55 AM
Subject: Re: Which file in the lucene package is used to manipulate results..


Actually what i have to do is...
1.) for every query(keyword), among the results obtained, the keyword
 will
be mapped with the page clicked, along with the no. of clicks for that
keyword on that page
2.) next time for the same query(keyword), the mapped pages will be
 ranked
higher considering the no. of clicks too..
3.) for every new query these steps will be repeated...
this was a very high level view , i have made algorithms for these
 modules
and trying to incorporate with lucene but dont know , on which files i
 have
to do edition to make it work...
please help me regarding this, if you need some more explanation,
 please let
me know...
thanks
Sumit Tyagi





Erick Erickson wrote:
>
> You still haven't explained *why* you want to rerank results. What
> is the use-case you're trying to implement? Quite often it's turned
> out for me that when I let folks on the list know what the use
> case I'm trying to support is, they come up with much more elegant
> solutions than I was thinking about.
>
> For instance, does the CustomScoreQuery class have any relevance
> to your problem?
>
> If you're thinking of modifying the core Lucene code for your
> special purpose, I'd advise against it unless and until you'd
 exhausted
> all the other options. It's always a maintenance headache to do this.
>
> Best
> Erick
>
> On Dec 21, 2007 10:09 AM, sumittyagi <ping.sumit@gmail.com> wrote:
>
>>
>> actually i am writing a module to rerank the results, so i want to
 edit
>> the
>> file which arrange the results and give them ranks,
>> or is there any other way i can use my module to rerank the results
>>
>>
>> markharw00d wrote:
>> >
>> > I think you need to describe your "factors" in more detail.
 Exactly
>> what
>> > do you want to achieve for your users?
>> > We could be talking about any number of Lucene functions here.
>> >
>> > ----- Original Message ----
>> > From: sumittyagi <ping.sumit@gmail.com>
>> > To: java-user@lucene.apache.org
>> > Sent: Friday, 21 December, 2007 4:51:09 AM
>> > Subject: Which file in the lucene package is used to manipulate
>> results..
>> >
>> >
>> > hi, i am using lucene for the very first time and want to
 manipulate
>> >  the
>> > results, by adding some more factors to it, which file should i
 edit to
>> > manipulate the search results....
>> >
>> > Thanks
>> > Sumit Tyagi
>> > --
>> > View this message in context:
>> >
>> >
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
>> > Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>> >
>> >
>> >
>> >
>> >
>> >       __________________________________________________________
>> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>> >
>> >
>> >
 ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
>> Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>>
>>
>>
 ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>

--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

sumittyagi
also
what is the lucene ranking (scoring documents) formula
sumittyagi wrote
hi
which file can i edit to change the scoring factors in lucene results
markharw00d wrote
Thanks for the context - much more useful.
The challenge here is similar to that posed by offering end-user tagging of content (see here http://www.mail-archive.com/java-user@lucene.apache.org/msg17580.html ). The main difference here being that words are added to docs implicitly by search click-throughs rather than any explicit tagging action.

In both cases the challenge is that the user data around documents is likely to be updated very often while the documents remain relatively static.
I suspect some additional things to think about are:
1) Cancelling out the "human laziness" bias that favours clicking results on page 1. Are clicks on page 2 worth more?
2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
3) Lucene doc IDs are not stable - how will you associate query terms/click data with documents and join them at speed?
4) Are individual words or phrases the unit of boost? "Paris" means different things in "Paris Hilton" and "Paris, France".

A simple approach might be to re-index your content with all of the additional search terms from clicks added to the associated document in a "searchClicks" field - the more clicks, the more repetitions of the same search words in the document to help with tf (Term Frequency). This additional content would need to be capped, to avoid huge documents. This has the disadvantage of requiring a re-index though.
Another option to avoid reindexing everything is to wrap IndexReader (See FilterIndexReader) and implement TermEnum/TermDocs for a fake field called "searchClicks". The idea is Lucene looks after the usual, static document content while your implementation goes off to your more volatile storage (e.g. database/parallel index, custom file structure) to retrieve lists of doc ids, term frequencies etc. for this "searchClicks" field. All of the Lucene queries you might want to throw at this e.g. PhraseQueries can then test both the static Lucene fields and your new volatile "click" fields without being aware of this low-level trickery.

I'm sure there will be other ways of doing this too but this seems like a conceptually clean way of modelling it - just seeing search terms as extensions to the document content.

Cheers
Mark


----- Original Message ----
From: sumittyagi <ping.sumit@gmail.com>
To: java-user@lucene.apache.org
Sent: Sunday, 23 December, 2007 5:30:55 AM
Subject: Re: Which file in the lucene package is used to manipulate results..


Actually what i have to do is...
1.) for every query(keyword), among the results obtained, the keyword
 will
be mapped with the page clicked, along with the no. of clicks for that
keyword on that page
2.) next time for the same query(keyword), the mapped pages will be
 ranked
higher considering the no. of clicks too..
3.) for every new query these steps will be repeated...
this was a very high level view , i have made algorithms for these
 modules
and trying to incorporate with lucene but dont know , on which files i
 have
to do edition to make it work...
please help me regarding this, if you need some more explanation,
 please let
me know...
thanks
Sumit Tyagi





Erick Erickson wrote:
>
> You still haven't explained *why* you want to rerank results. What
> is the use-case you're trying to implement? Quite often it's turned
> out for me that when I let folks on the list know what the use
> case I'm trying to support is, they come up with much more elegant
> solutions than I was thinking about.
>
> For instance, does the CustomScoreQuery class have any relevance
> to your problem?
>
> If you're thinking of modifying the core Lucene code for your
> special purpose, I'd advise against it unless and until you'd
 exhausted
> all the other options. It's always a maintenance headache to do this.
>
> Best
> Erick
>
> On Dec 21, 2007 10:09 AM, sumittyagi <ping.sumit@gmail.com> wrote:
>
>>
>> actually i am writing a module to rerank the results, so i want to
 edit
>> the
>> file which arrange the results and give them ranks,
>> or is there any other way i can use my module to rerank the results
>>
>>
>> markharw00d wrote:
>> >
>> > I think you need to describe your "factors" in more detail.
 Exactly
>> what
>> > do you want to achieve for your users?
>> > We could be talking about any number of Lucene functions here.
>> >
>> > ----- Original Message ----
>> > From: sumittyagi <ping.sumit@gmail.com>
>> > To: java-user@lucene.apache.org
>> > Sent: Friday, 21 December, 2007 4:51:09 AM
>> > Subject: Which file in the lucene package is used to manipulate
>> results..
>> >
>> >
>> > hi, i am using lucene for the very first time and want to
 manipulate
>> >  the
>> > results, by adding some more factors to it, which file should i
 edit to
>> > manipulate the search results....
>> >
>> > Thanks
>> > Sumit Tyagi
>> > --
>> > View this message in context:
>> >
>> >
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
>> > Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>> >
>> >
>> >
>> >
>> >
>> >       __________________________________________________________
>> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>> >
>> >
>> >
 ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
>> Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>>
>>
>>
 ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>

--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

RE: Which file in the lucene package is used to manipulate results..

steve_rowe
Hi Sumit,

Here's a good place to start:

   http://lucene.apache.org/java/docs/scoring.html

Steve

On 12/28/2007 at 12:30 PM, sumittyagi wrote:

>
> also
> what is the lucene ranking (scoring documents) formula
>
> sumittyagi wrote:
> >
> > hi which file can i edit to change the scoring factors in lucene results
> >
> > markharw00d wrote:
> > >
> > > Thanks for the context - much more useful. The challenge here is
> > > similar to that posed by offering end-user tagging of content (see here
> > >
> http://www.mail-archive.com/java-user@.../msg175
> 80.html ).
> > > The main difference here being that words are added to docs implicitly
> > > by search click-throughs rather than any explicit tagging action.
> > >
> > > In both cases the challenge is that the user data around documents is
> > > likely to be updated very often while the documents remain relatively
> > > static. I suspect some additional things to think about are: 1)
> > > Cancelling out the "human laziness" bias that favours clicking results
> > > on page 1. Are clicks on page 2 worth more? 2) Spam clicks - detecting
> > > deliberate gaming of your re-ranking algorithm. 3) Lucene doc IDs are
> > > not stable - how will you associate query terms/click data with
> > > documents and join them at speed? 4) Are individual words or phrases
> > > the unit of boost? "Paris" means different things in "Paris Hilton"
> > > and "Paris, France".
> > >
> > > A simple approach might be to re-index your content with all of the
> > > additional search terms from clicks added to the associated document
> > > in a "searchClicks" field - the more clicks, the more repetitions of
> > > the same search words in the document to help with tf (Term
> > > Frequency). This additional content would need to be capped, to avoid
> > > huge documents. This has the disadvantage of requiring a re-index
> > > though. Another option to avoid reindexing everything is to wrap
> > > IndexReader (See FilterIndexReader) and implement TermEnum/TermDocs
> > > for a fake field called "searchClicks". The idea is Lucene looks after
> > > the usual, static document content while your implementation goes off
> > > to your more volatile storage (e.g. database/parallel index, custom
> > > file structure) to retrieve lists of doc ids, term frequencies etc.
> > > for this "searchClicks" field. All of the Lucene queries you might
> > > want to throw at this e.g. PhraseQueries can then test both the static
> > > Lucene fields and your new volatile "click" fields without being aware
> > > of this low-level trickery.
> > >
> > > I'm sure there will be other ways of doing this too but this seems
> > > like a conceptually clean way of modelling it - just seeing search
> > > terms as extensions to the document content.
> > >
> > > Cheers
> > > Mark
> > >
> > >
> > > ----- Original Message ----
> > > From: sumittyagi <[hidden email]>
> > > To: [hidden email]
> > > Sent: Sunday, 23 December, 2007 5:30:55 AM
> > > Subject: Re: Which file in the lucene package is used to manipulate
> > > results..
> > >
> > >
> > > Actually what i have to do is...
> > > 1.) for every query(keyword), among the results obtained,
> the keyword
> > >  will
> > > be mapped with the page clicked, along with the no. of clicks for that
> > > keyword on that page 2.) next time for the same query(keyword), the
> > > mapped pages will be
> > >  ranked
> > > higher considering the no. of clicks too..
> > > 3.) for every new query these steps will be repeated...
> > > this was a very high level view , i have made algorithms for these
> > >  modules
> > > and trying to incorporate with lucene but dont know , on
> which files i
> > >  have
> > > to do edition to make it work...
> > > please help me regarding this, if you need some more explanation,
> > >  please let
> > > me know...
> > > thanks
> > > Sumit Tyagi
> > >
> > >
> > >
> > >
> > >
> > > Erick Erickson wrote:
> > > >
> > > > You still haven't explained *why* you want to rerank results. What is
> > > > the use-case you're trying to implement? Quite often it's turned out
> > > > for me that when I let folks on the list know what the use case I'm
> > > > trying to support is, they come up with much more elegant solutions
> > > > than I was thinking about.
> > > >
> > > > For instance, does the CustomScoreQuery class have any relevance
> > > > to your problem?
> > > >
> > > > If you're thinking of modifying the core Lucene code for your special
> > > > purpose, I'd advise against it unless and until you'd exhausted all
> > > > the other options. It's always a maintenance headache to do this.
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Dec 21, 2007 10:09 AM, sumittyagi <[hidden email]> wrote:
> > > >
> > > > >
> > > > > actually i am writing a module to rerank the results, so
> i want to
> > >  edit
> > > > > the file which arrange the results and give them ranks, or is there
> > > > > any other way i can use my module to rerank the results
> > > > >
> > > > >
> > > > > markharw00d wrote:
> > > > > >
> > > > > > I think you need to describe your "factors" in more detail.
> > >  Exactly
> > > > > what
> > > > > > do you want to achieve for your users?
> > > > > > We could be talking about any number of Lucene functions here.
> > > > > >
> > > > > > ----- Original Message ---- From: sumittyagi <[hidden email]>
> > > > > > To: [hidden email] Sent: Friday, 21 December, 2007
> > > > > > 4:51:09 AM Subject: Which file in the lucene package is used to
> > > > > > manipulate results..
> > > > > >
> > > > > >
> > > > > > hi, i am using lucene for the very first time and want to
> > >  manipulate
> > > > > >  the
> > > > > > results, by adding some more factors to it, which file should i
> > > > > > edit to manipulate the search results....
> > > > > >
> > > > > > Thanks
> > > > > > Sumit Tyagi
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > > >
> > > > >
> > >
> > >
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used
> -to-manipulate-results..-tp14450335p14450335.html
> > > > > > Sent from the Lucene - Java Users mailing list archive at
> > > > > > Nabble.com.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > __________________________________________________________ Sent
> > > > > > from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
> > > > > >
> > > > > >
> > > > > >
> > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [hidden email] For
> > > > > > additional commands, e-mail: [hidden email]
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > >
> > >
> > >
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used
> -to-manipulate-results..-tp14450335p14456938.html
> > > > > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > > > >
> > > > >
> > > > >
> > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [hidden email]
> > > > > For additional commands, e-mail: [hidden email]
> > > > >
> > > > >
> > > >
> > > >
> > >
> > > --
> > > View this message in context:
> > >
> > >
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used
> -to-manipulate-results..-tp14450335p14476062.html
> > > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email] For
> > > additional commands, e-mail: [hidden email]
> > >
> > >
> > >
> > >
> > >
> > >
> > >       __________________________________________________________
> > > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email] For
> > > additional commands, e-mail: [hidden email]
> > >
> > >
> > >
> >
> >
>
> -- View this message in context:
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used
> -to-manipulate-results..-tp14450335p14528677.html Sent from the Lucene -
> Java Users mailing list archive at Nabble.com.
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: [hidden email] For
> additional commands, e-mail: [hidden email]
>
>

 


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

sumittyagi
In reply to this post by mark harwood
Hi Mark Harwood
I know it's being a long time, but till now i was busy in developing the database to store the keyword, document and no. of clicks of the document for the keyword and their respective mappings.
now i want my database to communicate with lucene api and i cannot figure it out where to start from.
Please help me out, how can i make my database to work with lucene.
Thanks
Sumit
mark harwood wrote
Thanks for the context - much more useful.
The challenge here is similar to that posed by offering end-user tagging of content (see here http://www.mail-archive.com/java-user@lucene.apache.org/msg17580.html ). The main difference here being that words are added to docs implicitly by search click-throughs rather than any explicit tagging action.

In both cases the challenge is that the user data around documents is likely to be updated very often while the documents remain relatively static.
I suspect some additional things to think about are:
1) Cancelling out the "human laziness" bias that favours clicking results on page 1. Are clicks on page 2 worth more?
2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
3) Lucene doc IDs are not stable - how will you associate query terms/click data with documents and join them at speed?
4) Are individual words or phrases the unit of boost? "Paris" means different things in "Paris Hilton" and "Paris, France".

A simple approach might be to re-index your content with all of the additional search terms from clicks added to the associated document in a "searchClicks" field - the more clicks, the more repetitions of the same search words in the document to help with tf (Term Frequency). This additional content would need to be capped, to avoid huge documents. This has the disadvantage of requiring a re-index though.
Another option to avoid reindexing everything is to wrap IndexReader (See FilterIndexReader) and implement TermEnum/TermDocs for a fake field called "searchClicks". The idea is Lucene looks after the usual, static document content while your implementation goes off to your more volatile storage (e.g. database/parallel index, custom file structure) to retrieve lists of doc ids, term frequencies etc. for this "searchClicks" field. All of the Lucene queries you might want to throw at this e.g. PhraseQueries can then test both the static Lucene fields and your new volatile "click" fields without being aware of this low-level trickery.

I'm sure there will be other ways of doing this too but this seems like a conceptually clean way of modelling it - just seeing search terms as extensions to the document content.

Cheers
Mark


----- Original Message ----
From: sumittyagi <ping.sumit@gmail.com>
To: java-user@lucene.apache.org
Sent: Sunday, 23 December, 2007 5:30:55 AM
Subject: Re: Which file in the lucene package is used to manipulate results..


Actually what i have to do is...
1.) for every query(keyword), among the results obtained, the keyword
 will
be mapped with the page clicked, along with the no. of clicks for that
keyword on that page
2.) next time for the same query(keyword), the mapped pages will be
 ranked
higher considering the no. of clicks too..
3.) for every new query these steps will be repeated...
this was a very high level view , i have made algorithms for these
 modules
and trying to incorporate with lucene but dont know , on which files i
 have
to do edition to make it work...
please help me regarding this, if you need some more explanation,
 please let
me know...
thanks
Sumit Tyagi





Erick Erickson wrote:
>
> You still haven't explained *why* you want to rerank results. What
> is the use-case you're trying to implement? Quite often it's turned
> out for me that when I let folks on the list know what the use
> case I'm trying to support is, they come up with much more elegant
> solutions than I was thinking about.
>
> For instance, does the CustomScoreQuery class have any relevance
> to your problem?
>
> If you're thinking of modifying the core Lucene code for your
> special purpose, I'd advise against it unless and until you'd
 exhausted
> all the other options. It's always a maintenance headache to do this.
>
> Best
> Erick
>
> On Dec 21, 2007 10:09 AM, sumittyagi <ping.sumit@gmail.com> wrote:
>
>>
>> actually i am writing a module to rerank the results, so i want to
 edit
>> the
>> file which arrange the results and give them ranks,
>> or is there any other way i can use my module to rerank the results
>>
>>
>> markharw00d wrote:
>> >
>> > I think you need to describe your "factors" in more detail.
 Exactly
>> what
>> > do you want to achieve for your users?
>> > We could be talking about any number of Lucene functions here.
>> >
>> > ----- Original Message ----
>> > From: sumittyagi <ping.sumit@gmail.com>
>> > To: java-user@lucene.apache.org
>> > Sent: Friday, 21 December, 2007 4:51:09 AM
>> > Subject: Which file in the lucene package is used to manipulate
>> results..
>> >
>> >
>> > hi, i am using lucene for the very first time and want to
 manipulate
>> >  the
>> > results, by adding some more factors to it, which file should i
 edit to
>> > manipulate the search results....
>> >
>> > Thanks
>> > Sumit Tyagi
>> > --
>> > View this message in context:
>> >
>> >
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
>> > Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>> >
>> >
>> >
>> >
>> >
>> >       __________________________________________________________
>> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>> >
>> >
>> >
 ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>>
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
>> Sent from the Lucene - Java Users mailing list archive at
 Nabble.com.
>>
>>
>>
 ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>

--
View this message in context:
 http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

mark harwood
In reply to this post by sumittyagi
Hi Sumit,
>>now i want my database to communicate with lucene api

I would recommend that it's the other way round....see my earlier comment on using FilterIndexReader and creating "faked" TermEnum and TermDocs to make your database content appear as if it were part of the index when calling Lucene. If you do want to make the database call Lucene see the recent work on embedding Lucene in Oracle.
There is no simple ready-made solution here that I can post in a few lines of code - you'll need to familiarise yourself with these low-level APIs that underpin Lucene searches (they are all documented).


Cheers
Mark

----- Original Message ----
From: sumittyagi <[hidden email]>
To: [hidden email]
Sent: Wednesday, 20 February, 2008 4:43:56 PM
Subject: Re: Which file in the lucene package is used to manipulate results..


Hi Mark Harwood
I know it's being a long time, but till now i was busy in developing the
database to store the keyword, document and no. of clicks of the document
for the keyword and their respective mappings.
now i want my database to communicate with lucene api and i cannot figure it
out where to start from.
Please help me out, how can i make my database to work with lucene.
Thanks
Sumit

mark harwood wrote:

>
> Thanks for the context - much more useful.
> The challenge here is similar to that posed by offering end-user tagging
> of content (see here
> http://www.mail-archive.com/java-user@.../msg17580.html ).
> The main difference here being that words are added to docs implicitly by
> search click-throughs rather than any explicit tagging action.
>
> In both cases the challenge is that the user data around documents is
> likely to be updated very often while the documents remain relatively
> static.
> I suspect some additional things to think about are:
> 1) Cancelling out the "human laziness" bias that favours clicking results
> on page 1. Are clicks on page 2 worth more?
> 2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
> 3) Lucene doc IDs are not stable - how will you associate query
> terms/click data with documents and join them at speed?
> 4) Are individual words or phrases the unit of boost? "Paris" means
> different things in "Paris Hilton" and "Paris, France".
>
> A simple approach might be to re-index your content with all of the
> additional search terms from clicks added to the associated document in a
> "searchClicks" field - the more clicks, the more repetitions of the same
> search words in the document to help with tf (Term Frequency). This
> additional content would need to be capped, to avoid huge documents. This
> has the disadvantage of requiring a re-index though.
> Another option to avoid reindexing everything is to wrap IndexReader (See
> FilterIndexReader) and implement TermEnum/TermDocs for a fake field called
> "searchClicks". The idea is Lucene looks after the usual, static document
> content while your implementation goes off to your more volatile storage
> (e.g. database/parallel index, custom file structure) to retrieve lists of
> doc ids, term frequencies etc. for this "searchClicks" field. All of the
> Lucene queries you might want to throw at this e.g. PhraseQueries can then
> test both the static Lucene fields and your new volatile "click" fields
> without being aware of this low-level trickery.
>
> I'm sure there will be other ways of doing this too but this seems like a
> conceptually clean way of modelling it - just seeing search terms as
> extensions to the document content.
>
> Cheers
> Mark
>
>
> ----- Original Message ----
> From: sumittyagi <[hidden email]>
> To: [hidden email]
> Sent: Sunday, 23 December, 2007 5:30:55 AM
> Subject: Re: Which file in the lucene package is used to manipulate
> results..
>
>
> Actually what i have to do is...
> 1.) for every query(keyword), among the results obtained, the keyword
>  will
> be mapped with the page clicked, along with the no. of clicks for that
> keyword on that page
> 2.) next time for the same query(keyword), the mapped pages will be
>  ranked
> higher considering the no. of clicks too..
> 3.) for every new query these steps will be repeated...
> this was a very high level view , i have made algorithms for these
>  modules
> and trying to incorporate with lucene but dont know , on which files i
>  have
> to do edition to make it work...
> please help me regarding this, if you need some more explanation,
>  please let
> me know...
> thanks
> Sumit Tyagi
>
>
>
>
>
> Erick Erickson wrote:
>>
>> You still haven't explained *why* you want to rerank results. What
>> is the use-case you're trying to implement? Quite often it's turned
>> out for me that when I let folks on the list know what the use
>> case I'm trying to support is, they come up with much more elegant
>> solutions than I was thinking about.
>>
>> For instance, does the CustomScoreQuery class have any relevance
>> to your problem?
>>
>> If you're thinking of modifying the core Lucene code for your
>> special purpose, I'd advise against it unless and until you'd
>  exhausted
>> all the other options. It's always a maintenance headache to do this.
>>
>> Best
>> Erick
>>
>> On Dec 21, 2007 10:09 AM, sumittyagi <[hidden email]> wrote:
>>
>>>
>>> actually i am writing a module to rerank the results, so i want to
>  edit
>>> the
>>> file which arrange the results and give them ranks,
>>> or is there any other way i can use my module to rerank the results
>>>
>>>
>>> markharw00d wrote:
>>> >
>>> > I think you need to describe your "factors" in more detail.
>  Exactly
>>> what
>>> > do you want to achieve for your users?
>>> > We could be talking about any number of Lucene functions here.
>>> >
>>> > ----- Original Message ----
>>> > From: sumittyagi <[hidden email]>
>>> > To: [hidden email]
>>> > Sent: Friday, 21 December, 2007 4:51:09 AM
>>> > Subject: Which file in the lucene package is used to manipulate
>>> results..
>>> >
>>> >
>>> > hi, i am using lucene for the very first time and want to
>  manipulate
>>> >  the
>>> > results, by adding some more factors to it, which file should i
>  edit to
>>> > manipulate the search results....
>>> >
>>> > Thanks
>>> > Sumit Tyagi
>>> > --
>>> > View this message in context:
>>> >
>>> >
>>>
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
>>> > Sent from the Lucene - Java Users mailing list archive at
>  Nabble.com.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >       __________________________________________________________
>>> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>>> >
>>> >
>>> >
>  ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [hidden email]
>>> > For additional commands, e-mail: [hidden email]
>>> >
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>>
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
>>> Sent from the Lucene - Java Users mailing list archive at
>  Nabble.com.
>>>
>>>
>>>
>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>>
>
> --
> View this message in context:
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
>
>       __________________________________________________________
> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

--
View this message in context: http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p15591566.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

sumittyagi
hi Mark
Actually i am using object oriented database. Where can i find the information regarding embedding lucene with database.
Thanks

mark harwood wrote
Hi Sumit,
>>now i want my database to communicate with lucene api

I would recommend that it's the other way round....see my earlier comment on using FilterIndexReader and creating "faked" TermEnum and TermDocs to make your database content appear as if it were part of the index when calling Lucene. If you do want to make the database call Lucene see the recent work on embedding Lucene in Oracle.
There is no simple ready-made solution here that I can post in a few lines of code - you'll need to familiarise yourself with these low-level APIs that underpin Lucene searches (they are all documented).


Cheers
Mark

----- Original Message ----
From: sumittyagi <ping.sumit@gmail.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 20 February, 2008 4:43:56 PM
Subject: Re: Which file in the lucene package is used to manipulate results..


Hi Mark Harwood
I know it's being a long time, but till now i was busy in developing the
database to store the keyword, document and no. of clicks of the document
for the keyword and their respective mappings.
now i want my database to communicate with lucene api and i cannot figure it
out where to start from.
Please help me out, how can i make my database to work with lucene.
Thanks
Sumit

mark harwood wrote:
>
> Thanks for the context - much more useful.
> The challenge here is similar to that posed by offering end-user tagging
> of content (see here
> http://www.mail-archive.com/java-user@lucene.apache.org/msg17580.html ).
> The main difference here being that words are added to docs implicitly by
> search click-throughs rather than any explicit tagging action.
>
> In both cases the challenge is that the user data around documents is
> likely to be updated very often while the documents remain relatively
> static.
> I suspect some additional things to think about are:
> 1) Cancelling out the "human laziness" bias that favours clicking results
> on page 1. Are clicks on page 2 worth more?
> 2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
> 3) Lucene doc IDs are not stable - how will you associate query
> terms/click data with documents and join them at speed?
> 4) Are individual words or phrases the unit of boost? "Paris" means
> different things in "Paris Hilton" and "Paris, France".
>
> A simple approach might be to re-index your content with all of the
> additional search terms from clicks added to the associated document in a
> "searchClicks" field - the more clicks, the more repetitions of the same
> search words in the document to help with tf (Term Frequency). This
> additional content would need to be capped, to avoid huge documents. This
> has the disadvantage of requiring a re-index though.
> Another option to avoid reindexing everything is to wrap IndexReader (See
> FilterIndexReader) and implement TermEnum/TermDocs for a fake field called
> "searchClicks". The idea is Lucene looks after the usual, static document
> content while your implementation goes off to your more volatile storage
> (e.g. database/parallel index, custom file structure) to retrieve lists of
> doc ids, term frequencies etc. for this "searchClicks" field. All of the
> Lucene queries you might want to throw at this e.g. PhraseQueries can then
> test both the static Lucene fields and your new volatile "click" fields
> without being aware of this low-level trickery.
>
> I'm sure there will be other ways of doing this too but this seems like a
> conceptually clean way of modelling it - just seeing search terms as
> extensions to the document content.
>
> Cheers
> Mark
>
>
> ----- Original Message ----
> From: sumittyagi <ping.sumit@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Sunday, 23 December, 2007 5:30:55 AM
> Subject: Re: Which file in the lucene package is used to manipulate
> results..
>
>
> Actually what i have to do is...
> 1.) for every query(keyword), among the results obtained, the keyword
>  will
> be mapped with the page clicked, along with the no. of clicks for that
> keyword on that page
> 2.) next time for the same query(keyword), the mapped pages will be
>  ranked
> higher considering the no. of clicks too..
> 3.) for every new query these steps will be repeated...
> this was a very high level view , i have made algorithms for these
>  modules
> and trying to incorporate with lucene but dont know , on which files i
>  have
> to do edition to make it work...
> please help me regarding this, if you need some more explanation,
>  please let
> me know...
> thanks
> Sumit Tyagi
>
>
>
>
>
> Erick Erickson wrote:
>>
>> You still haven't explained *why* you want to rerank results. What
>> is the use-case you're trying to implement? Quite often it's turned
>> out for me that when I let folks on the list know what the use
>> case I'm trying to support is, they come up with much more elegant
>> solutions than I was thinking about.
>>
>> For instance, does the CustomScoreQuery class have any relevance
>> to your problem?
>>
>> If you're thinking of modifying the core Lucene code for your
>> special purpose, I'd advise against it unless and until you'd
>  exhausted
>> all the other options. It's always a maintenance headache to do this.
>>
>> Best
>> Erick
>>
>> On Dec 21, 2007 10:09 AM, sumittyagi <ping.sumit@gmail.com> wrote:
>>
>>>
>>> actually i am writing a module to rerank the results, so i want to
>  edit
>>> the
>>> file which arrange the results and give them ranks,
>>> or is there any other way i can use my module to rerank the results
>>>
>>>
>>> markharw00d wrote:
>>> >
>>> > I think you need to describe your "factors" in more detail.
>  Exactly
>>> what
>>> > do you want to achieve for your users?
>>> > We could be talking about any number of Lucene functions here.
>>> >
>>> > ----- Original Message ----
>>> > From: sumittyagi <ping.sumit@gmail.com>
>>> > To: java-user@lucene.apache.org
>>> > Sent: Friday, 21 December, 2007 4:51:09 AM
>>> > Subject: Which file in the lucene package is used to manipulate
>>> results..
>>> >
>>> >
>>> > hi, i am using lucene for the very first time and want to
>  manipulate
>>> >  the
>>> > results, by adding some more factors to it, which file should i
>  edit to
>>> > manipulate the search results....
>>> >
>>> > Thanks
>>> > Sumit Tyagi
>>> > --
>>> > View this message in context:
>>> >
>>> >
>>>
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
>>> > Sent from the Lucene - Java Users mailing list archive at
>  Nabble.com.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >       __________________________________________________________
>>> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>>> >
>>> >
>>> >
>  ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>>
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
>>> Sent from the Lucene - Java Users mailing list archive at
>  Nabble.com.
>>>
>>>
>>>
>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>
> --
> View this message in context:
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
>       __________________________________________________________
> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

--
View this message in context: http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p15591566.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






      __________________________________________________________
Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

mark harwood

>  Where can i find the information regarding embedding lucene with database.
> Thanks
>  

http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html

http://issues.apache.org/jira/browse/LUCENE-434

Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which file in the lucene package is used to manipulate results..

Michael Stoppelman
In reply to this post by mark harwood
To add to what Mark is saying, it's very important that watch out for the
first N results effect. If you showed a user a random set of documents with
crap
relevance I'll bet you that a good number will click on the first result
(call it user laziness or the Google "I'm feeling lucky" effect :)). You can
a/b results with
some entropy or try determine your own result position normalizers.

You could also have your own doc id that is stable and you mark documents
maybe a md5 of the title and then have an external boost file that has
query-to-doc.
Then on the query you boost result documents accordingly.

-M

On Sun, Dec 23, 2007 at 2:15 AM, mark harwood <[hidden email]>
wrote:

> Thanks for the context - much more useful.
> The challenge here is similar to that posed by offering end-user tagging
> of content (see here
> http://www.mail-archive.com/java-user@.../msg17580.html ).
> The main difference here being that words are added to docs implicitly by
> search click-throughs rather than any explicit tagging action.
>
> In both cases the challenge is that the user data around documents is
> likely to be updated very often while the documents remain relatively
> static.
> I suspect some additional things to think about are:
> 1) Cancelling out the "human laziness" bias that favours clicking results
> on page 1. Are clicks on page 2 worth more?
> 2) Spam clicks - detecting deliberate gaming of your re-ranking algorithm.
> 3) Lucene doc IDs are not stable - how will you associate query
> terms/click data with documents and join them at speed?
> 4) Are individual words or phrases the unit of boost? "Paris" means
> different things in "Paris Hilton" and "Paris, France".
>
> A simple approach might be to re-index your content with all of the
> additional search terms from clicks added to the associated document in a
> "searchClicks" field - the more clicks, the more repetitions of the same
> search words in the document to help with tf (Term Frequency). This
> additional content would need to be capped, to avoid huge documents. This
> has the disadvantage of requiring a re-index though.
> Another option to avoid reindexing everything is to wrap IndexReader (See
> FilterIndexReader) and implement TermEnum/TermDocs for a fake field called
> "searchClicks". The idea is Lucene looks after the usual, static document
> content while your implementation goes off to your more volatile storage (
> e.g. database/parallel index, custom file structure) to retrieve lists of
> doc ids, term frequencies etc. for this "searchClicks" field. All of the
> Lucene queries you might want to throw at this e.g. PhraseQueries can then
> test both the static Lucene fields and your new volatile "click" fields
> without being aware of this low-level trickery.
>
> I'm sure there will be other ways of doing this too but this seems like a
> conceptually clean way of modelling it - just seeing search terms as
> extensions to the document content.
>
> Cheers
> Mark
>
>
> ----- Original Message ----
> From: sumittyagi <[hidden email]>
> To: [hidden email]
> Sent: Sunday, 23 December, 2007 5:30:55 AM
> Subject: Re: Which file in the lucene package is used to manipulate
> results..
>
>
> Actually what i have to do is...
> 1.) for every query(keyword), among the results obtained, the keyword
>  will
> be mapped with the page clicked, along with the no. of clicks for that
> keyword on that page
> 2.) next time for the same query(keyword), the mapped pages will be
>  ranked
> higher considering the no. of clicks too..
> 3.) for every new query these steps will be repeated...
> this was a very high level view , i have made algorithms for these
>  modules
> and trying to incorporate with lucene but dont know , on which files i
>  have
> to do edition to make it work...
> please help me regarding this, if you need some more explanation,
>  please let
> me know...
> thanks
> Sumit Tyagi
>
>
>
>
>
> Erick Erickson wrote:
> >
> > You still haven't explained *why* you want to rerank results. What
> > is the use-case you're trying to implement? Quite often it's turned
> > out for me that when I let folks on the list know what the use
> > case I'm trying to support is, they come up with much more elegant
> > solutions than I was thinking about.
> >
> > For instance, does the CustomScoreQuery class have any relevance
> > to your problem?
> >
> > If you're thinking of modifying the core Lucene code for your
> > special purpose, I'd advise against it unless and until you'd
>  exhausted
> > all the other options. It's always a maintenance headache to do this.
> >
> > Best
> > Erick
> >
> > On Dec 21, 2007 10:09 AM, sumittyagi <[hidden email]> wrote:
> >
> >>
> >> actually i am writing a module to rerank the results, so i want to
>  edit
> >> the
> >> file which arrange the results and give them ranks,
> >> or is there any other way i can use my module to rerank the results
> >>
> >>
> >> markharw00d wrote:
> >> >
> >> > I think you need to describe your "factors" in more detail.
>  Exactly
> >> what
> >> > do you want to achieve for your users?
> >> > We could be talking about any number of Lucene functions here.
> >> >
> >> > ----- Original Message ----
> >> > From: sumittyagi <[hidden email]>
> >> > To: [hidden email]
> >> > Sent: Friday, 21 December, 2007 4:51:09 AM
> >> > Subject: Which file in the lucene package is used to manipulate
> >> results..
> >> >
> >> >
> >> > hi, i am using lucene for the very first time and want to
>  manipulate
> >> >  the
> >> > results, by adding some more factors to it, which file should i
>  edit to
> >> > manipulate the search results....
> >> >
> >> > Thanks
> >> > Sumit Tyagi
> >> > --
> >> > View this message in context:
> >> >
> >> >
> >>
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14450335.html
> >> > Sent from the Lucene - Java Users mailing list archive at
>  Nabble.com.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >       __________________________________________________________
> >> > Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
> >> >
> >> >
> >> >
>  ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [hidden email]
> >> > For additional commands, e-mail: [hidden email]
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14456938.html
> >> Sent from the Lucene - Java Users mailing list archive at
>  Nabble.com.
> >>
> >>
> >>
>  ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
>
> --
> View this message in context:
>
> http://www.nabble.com/Which-file-in-the-lucene-package-is-used-to-manipulate-results..-tp14450335p14476062.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
>
>      __________________________________________________________
> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>