Using Lucene as a Document Comparison Tool

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Lucene as a Document Comparison Tool

John Brown
Hi,



I have some questions about how to use Lucene for the specific purpose of
finding document similarities. Lucene seems to have classes that were made
for this, including: ClassicSimilarity and BM25Similarity. However I’m
fumbling a bit when it comes to implementing them.



From what I understand, to use these classes you simply set the similarity
of your IndexWriter and IndexSearcher, then submit a query. The documents
returned from your query should be ordered from highest to lowest
similarity.



My initial thought was to just use a phrase query to hold the "document" I
want to find similarities to, but phrase queries are limited in that they
will only return results that are deemed to fall within a certain slop
value. Term/Boolean queries are similarly limited in that they allow
documents to be sorted only if they contain all the terms in the query.



Ideally, I’d like to submit a query that would essentially be a document
itself. Each of my queries contain 10 or so phrases, that each contain 5-10
terms. I would like to compare this query with all the documents in my
index to see which is the most similar, and which is the least similar. I
feel as if there is an easy way to do this that I'm missing, after all, I
essentially just want to remove a step from the process. Any help would be
much appreciated.


Thank  you,

-John B
Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene as a Document Comparison Tool

Michael Sokolov-4
Have you tried making a BooleanQuery with a term for every word in the
query document as Optional? You will get a lot of matches,  ranked
according to the similarity.

On Thu, Dec 12, 2019 at 10:47 AM John Brown <[hidden email]> wrote:

>
> Hi,
>
>
>
> I have some questions about how to use Lucene for the specific purpose of
> finding document similarities. Lucene seems to have classes that were made
> for this, including: ClassicSimilarity and BM25Similarity. However I’m
> fumbling a bit when it comes to implementing them.
>
>
>
> From what I understand, to use these classes you simply set the similarity
> of your IndexWriter and IndexSearcher, then submit a query. The documents
> returned from your query should be ordered from highest to lowest
> similarity.
>
>
>
> My initial thought was to just use a phrase query to hold the "document" I
> want to find similarities to, but phrase queries are limited in that they
> will only return results that are deemed to fall within a certain slop
> value. Term/Boolean queries are similarly limited in that they allow
> documents to be sorted only if they contain all the terms in the query.
>
>
>
> Ideally, I’d like to submit a query that would essentially be a document
> itself. Each of my queries contain 10 or so phrases, that each contain 5-10
> terms. I would like to compare this query with all the documents in my
> index to see which is the most similar, and which is the least similar. I
> feel as if there is an easy way to do this that I'm missing, after all, I
> essentially just want to remove a step from the process. Any help would be
> much appreciated.
>
>
> Thank  you,
>
> -John B

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]