internal Searching behavior or how to get a hit?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

internal Searching behavior or how to get a hit?

Mathias Keilbach
Hi!
I have a question concerning the interal searching behavior of lucene. How does lucene get a hit.
If I search for the a term, will each index document be checked for this term or is there an internal relation between terms and lucene documents?
Thanks for any advice.
Matt
_______________________________________________________________
SMS schreiben mit WEB.DE FreeMail - einfach, schnell und
kostenguenstig. Jetzt gleich testen! http://f.web.de/?mc=021192


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: internal Searching behavior or how to get a hit?

Miles Barr
On Wednesday 03 May 2006 14:56, Mathias Keilbach wrote:
> I have a question concerning the interal searching behavior of lucene. How
> does lucene get a hit. If I search for the a term, will each index document
> be checked for this term or is there an internal relation between terms and
> lucene documents? Thanks for any advice.

AFAIK Lucene will have an inverted index which maps tokens (terms) to the
documents that they appear in.

Some background info on inverted indices:

http://en.wikipedia.org/wiki/Inverted_index



Miles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

AW: internal Searching behavior or how to get a hit?

Mathias Keilbach

Thanks a lot for your quick support!
I found another site which describes the index structure and files.
http://lucene.apache.org/java/docs/fileformats.html

--------------------------------------------------------------------------

AFAIK Lucene will have an inverted index which maps tokens (terms) to the
documents that they appear in.

Some background info on inverted indices:

http://en.wikipedia.org/wiki/Inverted_index

Miles

--------------------------------------------------------------------------

On Wednesday 03 May 2006 14:56, Mathias Keilbach wrote:
> I have a question concerning the interal searching behavior of lucene. How
> does lucene get a hit. If I search for the a term, will each index
document
> be checked for this term or is there an internal relation between terms
and
> lucene documents? Thanks for any advice.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

(Lucene) tools/algorithms for co-occurrence terms computation

Xiaocheng Luan
In reply to this post by Mathias Keilbach
Hi,

Is there any Lucene tools (or general tools/algorithms) that can compute the co-occurrence terms for a given query (or term)?

For example, if the user types in "avian flu", the top co-occurrence terms may include "Hong Kong", "vaccine", "H5N1", or "pandemic", depending on the underlying data set. It may be precompued or dynamically computed on a small data set, any help wil be highly appreciated.

Thanks!
Xiaocheng Luan

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: (Lucene) tools/algorithms for co-occurrence terms computation

Karl Wettin-3
On Wed, 2006-05-10 at 10:26 -0700, Xiaocheng Luan wrote:
> Is there any Lucene tools

Not that I know.

> (or general tools/algorithms) that can compute the co-occurrence terms
> for a given query (or term)?

Might be slow, but you can work the TermFreqVector. It would probably be
best to store this data in an alternative index.

I would start with making it an all in memory index using Maps and hard
links. Then use your favorite object mapping layer to store the
information. Perhaps java.io.Serializable is enough.
 
Weka is a really nice data mining library. You should post the same
question to them, and tell them what you try to achieve with this data.

Perhaps they have some really nice classifier for you.

Feel free to report back here.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: (Lucene) tools/algorithms for co-occurrence terms computation

Grant Ingersoll
Take a look at my ApacheCon example code at
http://www.cnlp.org/apachecon2005.  In particular there is some sample
code in the file IndexAnalysis.java that demonstrates what Karl is
talking about.  I don't think it is exactly what you want, but it shows
how to get co-occurrence information from the Index.  You may be able to
use it as a starting point.

karl wettin wrote:

> On Wed, 2006-05-10 at 10:26 -0700, Xiaocheng Luan wrote:
>  
>> Is there any Lucene tools
>>    
>
> Not that I know.
>
>  
>> (or general tools/algorithms) that can compute the co-occurrence terms
>> for a given query (or term)?
>>    
>
> Might be slow, but you can work the TermFreqVector. It would probably be
> best to store this data in an alternative index.
>
> I would start with making it an all in memory index using Maps and hard
> links. Then use your favorite object mapping layer to store the
> information. Perhaps java.io.Serializable is enough.
>  
> Weka is a really nice data mining library. You should post the same
> question to them, and tell them what you try to achieve with this data.
>
> Perhaps they have some really nice classifier for you.
>
> Feel free to report back here.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]