Quantcast

Reverse keyword search?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Reverse keyword search?

Uncle
Hello,

I am relatively new to Lucene, this might be a noob question, if so please redirect me. I'd like some guidance on how to use Lucene to address a problem.

I have a set of a few hundred (and growing) user-defined keywords such as "spain" and "volkswagen" and each of which is associated to one of about 20 categories, such as "world" and "automotive". My challenge is to use the summary (title, description, caption, meta-tags, keywords, but not the entire content) from a news article such as what you might find on cnn.com and look for those keywords in the article, to identify the article's category. The article's summary is often "dirty" with special characters, commas, hash tags, etc. and so needs to be tokenized. I would also like to utilize Lucene's natural language processing to match "spanish" to "spain" for example.

This appears to be somewhat the reverse of the typical Lucene use case -- rather than having a set of say 1000 of articles which are indexed, then issuing a query using a few keywords to search on those articles, I have a set of say 1000 keywords, and a single article, and I want to determine which keyword best fits the article's summary.  How to best use Lucene to handle this?

I have considered:

1) Creating a Lucene index of the keywords and topics, then tokenizing the summaries using Lucene's tokenizers, then issuing queries with the tokens to find the best match
2) Indexing the article summary, then iterating over all of the keywords, issuing a query for each of them, then keeping the best match.
3) Learning how Lucene does the individual keyword-to-keyword matching and writing some custom solution.

I'd appreciate it if someone could point me in the right direction.

Randy


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Reverse keyword search?

iorixxx
> This appears to be somewhat the reverse of the typical
> Lucene use case -- rather than having a set of say 1000 of
> articles which are indexed, then issuing a query using a few
> keywords to search on those articles, I have a set of say
> 1000 keywords, and a single article, and I want to determine
> which keyword best fits the article's summary.  How to
> best use Lucene to handle this?

Not used myself but MemoryIndex seems what you are after.

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/memory/MemoryIndex.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Similarity coefficient for more exact matching

sxam
In reply to this post by Uncle
Hi guys,
I have a field, Anayzed, Store.No.
Suppose one Document with value inside this field "Hello".
Another one "Hello world , one, two, three, four".
Since the field is Analyzed (with norms), the "one two three four) will definitely affect the resulting rating in case we search for "Hello world" query. Does anyone know whether I can control some coefficients to determine what is the weight for exact matching vs. amount of worlds (the norm factor)?
Thanks,
 

Maxim
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Similarity coefficient for more exact matching

Ian Lea
You can override org.apache.lucene.search.Similarity/DefaultSimilarity
to tweak quite a lot of stuff.

computeNorm() may be the method you are interested in.  Called at
indexing time so be sure to use the same implementation at index and
query time, using IndexWriterConfig.setSimilarity() and
IndexSearcher.setSimilarity(), unless you are clever or like being
confused.

SweetSpotSimilarity might also be worth a look.

--
Ian.


On Fri, Apr 27, 2012 at 1:18 PM, Maxim Terletsky <[hidden email]> wrote:
> Hi guys,
> I have a field, Anayzed, Store.No.
> Suppose one Document with value inside this field "Hello".
> Another one "Hello world , one, two, three, four".
> Since the field is Analyzed (with norms), the "one two three four) will definitely affect the resulting rating in case we search for "Hello world" query. Does anyone know whether I can control some coefficients to determine what is the weight for exact matching vs. amount of worlds (the norm factor)?
> Thanks,
>
>
> Maxim

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Similarity coefficient for more exact matching

Paul Hill
> [use] IndexWriterConfig.setSimilarity() and
> IndexSearcher.setSimilarity(), unless you are clever or like being confused.
>
> SweetSpotSimilarity might also be worth a look.
>
> --
> Ian.

Being even less clever,  I just make sure I set:

Similarity.setDefault(new MySimilarity())  

when crawling and searching, so everything uses the same similarity strategies.

Checking the 3.4 code IndexWriterConfig and IndexSearcher, both default to Similarity.getDefault().

Any thoughts on scenarios where you'd not push a custom similarity into the default position?

-Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Similarity coefficient for more exact matching

Ian Lea
Similarity.setDefault(new MySimilarity()) is certainly better than the
2 calls I recommended.  Thanks.

I find it hard to see why one might not want to do this in normal
usage but have a vague recollection of someone once outlining some
obscure scenarios where different similarities at index and search
time made sense.


--
Ian.


On Fri, May 4, 2012 at 5:32 PM, Paul Hill <[hidden email]> wrote:

>> [use] IndexWriterConfig.setSimilarity() and
>> IndexSearcher.setSimilarity(), unless you are clever or like being confused.
>>
>> SweetSpotSimilarity might also be worth a look.
>>
>> --
>> Ian.
>
> Being even less clever,  I just make sure I set:
>
> Similarity.setDefault(new MySimilarity())
>
> when crawling and searching, so everything uses the same similarity strategies.
>
> Checking the 3.4 code IndexWriterConfig and IndexSearcher, both default to Similarity.getDefault().
>
> Any thoughts on scenarios where you'd not push a custom similarity into the default position?
>
> -Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...