MoreLikeThis class in Lucene within Solr?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

MoreLikeThis class in Lucene within Solr?

Michael Imbeault
Ok, so hopefully I resolved my problems posting to this mailing list and
this won't show up in some thread, but as a new topic!

Is it possible in any way to use the MoreLikeThis class with solr
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html)?
Right now I'm determining similar docs by just querying for the whole
body with OR between words, and it's not very efficient performance
wise. I never coded in Java so I really don't know where I should start...

Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MoreLikeThis class in Lucene within Solr?

Chris Hostetter-3

: Is it possible in any way to use the MoreLikeThis class with solr
: (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html)?
: Right now I'm determining similar docs by just querying for the whole
: body with OR between words, and it's not very efficient performance
: wise. I never coded in Java so I really don't know where I should start...

MoreLikeThis could certainly be used in a custom request handler --
there's nothing that does it out of the box however.

if you wanted to implement it, you'd need to start by getting comfortable
compiling java classes -- if you can write a little HellowWorld.java app
and compile it then you can probably compile the solr source tree.
Find yourself a Java Basics tutorial (or a "Java for ___ developers
tutorial" if you can based on whatever language you understand the
most) -- in addition this little ant tutorial might be helpful if you
aren't very familiar with ant either..

http://ant.apache.org/manual/tutorial-HelloWorldWithAnt.html

...once you can build the Solr source, try writing your own class tat
implements SOlrRequestHandler ... most of hte methods are just for
statistics and can just return constant values, the only one that you
really need to put any meat into is handleRequest -- you can make it work
a lot like the MoreLikeThis.main method, the biggest differences being:
  * get your input from the SolrQueryRequest.getParams()
  * put your output in the SolrQueryResponse
  * you don't need to open your own Directory, IndexReader, or
IndexSearcher, just use SolrQueryRequest.getSearcher().getReader()

(don't forget to register your new handler in your solrconfig.xml so you
can try it out)

if you run into problems, feel free to post your code and get feedback.

-Hoss

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MoreLikeThis class in Lucene within Solr?

Erik Hatcher
In reply to this post by Michael Imbeault

On Sep 11, 2006, at 4:54 PM, Michael Imbeault wrote:
> Is it possible in any way to use the MoreLikeThis class with solr  
> (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> similar/MoreLikeThis.html)? Right now I'm determining similar docs  
> by just querying for the whole body with OR between words, and it's  
> not very efficient performance wise. I never coded in Java so I  
> really don't know where I should start...

I use MoreLikeThis in a custom request handler for Collex, for  
example the three items shown at the bottom left here:

        <http://svn.sourceforge.net/viewvc/patacriticism/collex/trunk/src/ 
solr/org/nines/TermQueryRequestHandler.java?revision=391&view=markup>

I would like to get MoreLikeThis hooked into the  
StandardRequestHandler just like highlighting and facets are now.  
One of these days I'll carve out time to do that if no one beats me  
to it.  It would not be difficult to do, it would just take some time  
to iron out how to parameterize it cleanly for general-purpose use.

        Erik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MoreLikeThis class in Lucene within Solr?

Erik Hatcher

On Sep 12, 2006, at 12:45 PM, Erik Hatcher wrote:
> I use MoreLikeThis in a custom request handler for Collex, for  
> example the three items shown at the bottom left here:
>
> <http://svn.sourceforge.net/viewvc/patacriticism/collex/trunk/src/ 
> solr/org/nines/TermQueryRequestHandler.java?revision=391&view=markup>

oops... I meant to post the URL to Collex where we show up to 3 items  
like the selected one:

        <<a href="http://www.nines.org/permalink/detail?objid=http%3A%2F%">http://www.nines.org/permalink/detail?objid=http%3A%2F% 
2Fwww.swinburnearchive.org%2Fid%2Fpb1anctr00%2F>

The first URL is to the custom request handler that integrates  
Lucene's MoreLikeThis (in a not-so-general-purpose way).

        Erik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MoreLikeThis class in Lucene within Solr?

Chris Hostetter-3

: oops... I meant to post the URL to Collex where we show up to 3 items
: like the selected one:
:
: <<a href="http://www.nines.org/permalink/detail?objid=http%3A%2F%">http://www.nines.org/permalink/detail?objid=http%3A%2F%
: 2Fwww.swinburnearchive.org%2Fid%2Fpb1anctr00%2F>

Wow! ... Fragoletta and Laus Veneris sure do look a lot like Anactoria ...
they could be triplets!


-Hoss

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MoreLikeThis class in Lucene within Solr?

Erik Hatcher

On Sep 12, 2006, at 1:14 PM, Chris Hostetter wrote:

>
> : oops... I meant to post the URL to Collex where we show up to 3  
> items
> : like the selected one:
> :
> : <<a href="http://www.nines.org/permalink/detail?objid=http%3A%2F%">http://www.nines.org/permalink/detail?objid=http%3A%2F%
> : 2Fwww.swinburnearchive.org%2Fid%2Fpb1anctr00%2F>
>
> Wow! ... Fragoletta and Laus Veneris sure do look a lot like  
> Anactoria ...
> they could be triplets!

Well, this is because not every "object" in our system has a unique  
thumbnail sadly.  We're an aggregator of other archives, and can only  
work with the metadata they provide us :)

If you want images, we got plenty of them: <http://www.nines.org/ 
permalink/list/tag/nudes> for example.  Yes, it's safe for work, it's  
ART!  :)

        Erik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MoreLikeThis class in Lucene within Solr?

Michael Imbeault
In reply to this post by Erik Hatcher
Thanks for that Eric; It looks like a very good implementation of the
class. If you ever find time to add it to the query handlers in Solr,
I'm sure it would be wonderful for tons of users (solr has tons of
users, right? it definitively should!).

I haven't looked at the specifics of how MoreLikeThis determine which
items are similar; I'm mainly wondering about performance here.
Yesterday I tried to code myself a poor man's similarity class (which
was nothing more than doing a search with OR between words and sorting
by score), and the performance was abysmal (well, I kinda expected it.
1000+ words queries on a 15 millions docs collection, you don't expect
miracles). At first glance I think it searches for the most 'relevant'
words, I'm I right? What kind of performance are you getting with it?

Thanks a lot,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Erik Hatcher wrote:

> I use MoreLikeThis in a custom request handler for Collex, for example
> the three items shown at the bottom left here:
>
>     <http://svn.sourceforge.net/viewvc/patacriticism/collex/trunk/src/solr/org/nines/TermQueryRequestHandler.java?revision=391&view=markup>
>
>
> I would like to get MoreLikeThis hooked into the
> StandardRequestHandler just like highlighting and facets are now.  One
> of these days I'll carve out time to do that if no one beats me to
> it.  It would not be difficult to do, it would just take some time to
> iron out how to parameterize it cleanly for general-purpose use.
>
>     Erik
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MoreLikeThis class in Lucene within Solr?

Erik Hatcher

On Sep 12, 2006, at 3:41 PM, Michael Imbeault wrote:
> I haven't looked at the specifics of how MoreLikeThis determine  
> which items are similar; I'm mainly wondering about performance  
> here. Yesterday I tried to code myself a poor man's similarity  
> class (which was nothing more than doing a search with OR between  
> words and sorting by score), and the performance was abysmal (well,  
> I kinda expected it. 1000+ words queries on a 15 millions docs  
> collection, you don't expect miracles). At first glance I think it  
> searches for the most 'relevant' words, I'm I right? What kind of  
> performance are you getting with it?

Performance with MoreLikeThis is not an issue.  It has many  
parameters to tune how many terms are used in the query it builds,  
and it pulls these terms in an extremely efficient manner from the  
Lucene index.

I'm doing some traveling soon, which is always a good time to hack on  
something tractable like adding MoreLikeThis to Solr.  So your wish  
may be granted in a week :)

        Erik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: MoreLikeThis class in Lucene within Solr?

Michael Imbeault
Thanks for the answer; and try to enjoy your vacation / travel! Can't
wait to be able to interface with MoreLikeThis within Solr!

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Erik Hatcher wrote:

>
> On Sep 12, 2006, at 3:41 PM, Michael Imbeault wrote:
>> I haven't looked at the specifics of how MoreLikeThis determine which
>> items are similar; I'm mainly wondering about performance here.
>> Yesterday I tried to code myself a poor man's similarity class (which
>> was nothing more than doing a search with OR between words and
>> sorting by score), and the performance was abysmal (well, I kinda
>> expected it. 1000+ words queries on a 15 millions docs collection,
>> you don't expect miracles). At first glance I think it searches for
>> the most 'relevant' words, I'm I right? What kind of performance are
>> you getting with it?
>
> Performance with MoreLikeThis is not an issue.  It has many parameters
> to tune how many terms are used in the query it builds, and it pulls
> these terms in an extremely efficient manner from the Lucene index.
>
> I'm doing some traveling soon, which is always a good time to hack on
> something tractable like adding MoreLikeThis to Solr.  So your wish
> may be granted in a week :)
>
>     Erik
>
Loading...