Re: Indexing of virtual "made up" documents

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Indexing of virtual "made up" documents

Erik Hatcher

On Apr 26, 2005, at 3:21 AM, Daniel Stephan wrote:
> lets see if somebody listens on this list :-D

I doubt many are on this list, yet.  But your question is probably best
asked on the java-user@lucene list rather than here.  I'll CC java-user
this time to loop those folks in.

> I wonder if the following is possible with Lucene.

Yes, it is!

> I would like to add documents to the index, which aren't real
> documents.

That's pretty much how my use of Lucene works - there aren't real
filesystem documents to index per se.

> :-) Meaning: there is no text to parse and tokenize. What I have is a
> number of features, some are simple words, some are combinations of
> words. Those features classify an entity in my database.
>
> I have also an own parser/analyzer/tokenizer which is able to take a
> text and extract those features from it. (Possibly a query.)
>
> So, I wanna do sth like (pseudo code):
>
>    Lucene.index(myEntity.getId, myEntity.getDescriptors)
>
> and then when a query was issued:
>
>    List entityIds = Lucene.query(myQuery.convertToLuceneQueryLanguage)
>
> I was looking at the source and couldn't find a possibility to get rid
> of the analyzing stage to hand the features to Lucene myself.

You have two good options here.... you can add each token individually
as a Field.Keyword() with the same field name - these will not get
analyzed.

> One possibility would be to use an analyzer which only considers
> whitespace as delimiter and set all descriptors as one string. This
> feels suboptimal, because I have them as single tokens already and
> concatenating them first, to let lucene tokenize them again,  should
> not
> be necessary.

You're right, it's not necessary.  The second option is to create a
custom Analyzer that returns the tokens you've already established.

> Also, the neighbour information isn't applicable in my scenario. It
> seems you use placement of terms somehow. I don't have placement
> information. Would that hurt Lucene?

Placement of terms is used in phrase queries, but certainly isn't a
necessity that you concern yourself with it.  You can simply emit
tokens in whatever position you like (leaving the default position
increment to 1 is what I'd recommend).

> I am not sure how Lucenes uses the placement information, but in the
> described case where I concatenate all my features to a
> whitespace-delimited text, I fear that Lucene uses the placement of
> features in this made-up text and comes to some wrong conclusions
> (after
> all, the placement is arbitrary in the "made-up" text).

What wrong conclusions do you fear here?  Again, the position
information is used for phrase queries, but in your situation you
wouldn't be using phrase queries so no need to concern yourself with
the position stuff at all.

> Also 2, I am not sure yet, how the converter would have to look like.
> After all the terms in the query have to be of the same form as those
> in
> the index, otherwise they wouldn't match. Can I inject my own analyzer
> only for the query part, so that lucene hands it phrases and lets it
> build features from those phrases?

Sure - you can use any analyzer you like for query parsing.  It sounds
like you aren't going to use QueryParser, though, so you may not need
an analyzer at all.  You definitely have to ensure that the terms in
the query match the terms you indexed in order to find documents.  How
you do this is really up to you.

> Any info is appreciated. I could maybe build my own simple index, the
> analyzer is already there, but I would prefer to use a professional
> solution with a good query language and some additional
> nice-to-have-features.

If you could give us something more concrete, we could help in more
detail.  But from the scenario you've described, Lucene fits fine and
in fact describes the way I use it in some cases.

> May I use Lucene? :-)

Yes, you may!

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Indexing of virtual "made up" documents

Paul Libbrecht

Le 26 avr. 05, à 15:00, Erik Hatcher a écrit :

>> I am not sure how Lucenes uses the placement information, but in the
>> described case where I concatenate all my features to a
>> whitespace-delimited text, I fear that Lucene uses the placement of
>> features in this made-up text and comes to some wrong conclusions
>> (after
>> all, the placement is arbitrary in the "made-up" text).
> What wrong conclusions do you fear here?  Again, the position
> information is used for phrase queries, but in your situation you
> wouldn't be using phrase queries so no need to concern yourself with
> the position stuff at all.

There are some information retrieval settings which tend to say that
things that appear early in the document should be considered with
greater score... is there nothing such in Lucene's scoring ?

paul


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Indexing of virtual "made up" documents

Erik Hatcher
On Apr 26, 2005, at 4:46 PM, Paul Libbrecht wrote:

>
> Le 26 avr. 05, à 15:00, Erik Hatcher a écrit :
>>> I am not sure how Lucenes uses the placement information, but in the
>>> described case where I concatenate all my features to a
>>> whitespace-delimited text, I fear that Lucene uses the placement of
>>> features in this made-up text and comes to some wrong conclusions
>>> (after
>>> all, the placement is arbitrary in the "made-up" text).
>> What wrong conclusions do you fear here?  Again, the position
>> information is used for phrase queries, but in your situation you
>> wouldn't be using phrase queries so no need to concern yourself with
>> the position stuff at all.
>
> There are some information retrieval settings which tend to say that
> things that appear early in the document should be considered with
> greater score... is there nothing such in Lucene's scoring ?

No, Lucene doesn't have that feature, at least not explicitly....  it
could be hacked, sort of, by injecting multiple of the same term in the
same position (to get a higher term frequency) for the earlier terms.  
Back to the original question - the position information will not
adversely affect scoring.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Indexing of virtual "made up" documents

Morus Walter
Erik Hatcher writes:

> >
> > There are some information retrieval settings which tend to say that
> > things that appear early in the document should be considered with
> > greater score... is there nothing such in Lucene's scoring ?
>
> No, Lucene doesn't have that feature, at least not explicitly....  it
> could be hacked, sort of, by injecting multiple of the same term in the
> same position (to get a higher term frequency) for the earlier terms.  
> Back to the original question - the position information will not
> adversely affect scoring.
>
Wouldn't it be easier to fake that by using a proximity query and a document
start marker?
E.g. index `xxxstartxxx some text other text'
and search for "xxxstartxxx some"~10000000 or "xxxstartxxx other"~10000000
If I understand proximity query correctly the latter should have a lower
score (given that 'some' and 'other' have equal scores). Untested though.

Alternatively it should be able to write a query that does such a scoring
directly (without the document start anchor) by the same means proximity
query uses. Proximity query uses positional information so it should be
possible to use that information for scoring based on document position also.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Indexing of virtual "made up" documents

Doug Cutting
Morus Walter wrote:
> Alternatively it should be able to write a query that does such a scoring
> directly (without the document start anchor) by the same means proximity
> query uses. Proximity query uses positional information so it should be
> possible to use that information for scoring based on document position also.

The easiest way to do this would be to use span queries and modify
SpanScorer.  The score for a span is currently based on the inverse of
its length, but it could also or instead be based on the inverse of its
start, end or middle, so that matches nearer the start of the document
are given higher scores.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Indexing of virtual "made up" documents

Daniel Stephan
In reply to this post by Erik Hatcher
Erik, thank you very much for your help!

I am not in the position to build the indexing (other features are in
line before that), yet, but I will try Lucene for it. Looks very good :)

What I did not ask the last time, because it just occurred to me, was: I
have in my application a metric of "fit" to know how much a term
compares to a document, and also how much the term compares to the whole
collection. Seems to be a good candidate for your "boost" value, then?

Cheers,
Daniel


Erik Hatcher schrieb:

>
> On Apr 26, 2005, at 3:21 AM, Daniel Stephan wrote:
>
>> lets see if somebody listens on this list :-D
>
>
> I doubt many are on this list, yet.  But your question is probably
> best asked on the java-user@lucene list rather than here.  I'll CC
> java-user this time to loop those folks in.
>
>> I wonder if the following is possible with Lucene.
>
>
> Yes, it is!
>
>> I would like to add documents to the index, which aren't real documents.
>
>
> That's pretty much how my use of Lucene works - there aren't real
> filesystem documents to index per se.
>
>> :-) Meaning: there is no text to parse and tokenize. What I have is a
>> number of features, some are simple words, some are combinations of
>> words. Those features classify an entity in my database.
>>
>> I have also an own parser/analyzer/tokenizer which is able to take a
>> text and extract those features from it. (Possibly a query.)
>>
>> So, I wanna do sth like (pseudo code):
>>
>>    Lucene.index(myEntity.getId, myEntity.getDescriptors)
>>
>> and then when a query was issued:
>>
>>    List entityIds = Lucene.query(myQuery.convertToLuceneQueryLanguage)
>>
>> I was looking at the source and couldn't find a possibility to get rid
>> of the analyzing stage to hand the features to Lucene myself.
>
>
> You have two good options here.... you can add each token individually
> as a Field.Keyword() with the same field name - these will not get
> analyzed.
>
>> One possibility would be to use an analyzer which only considers
>> whitespace as delimiter and set all descriptors as one string. This
>> feels suboptimal, because I have them as single tokens already and
>> concatenating them first, to let lucene tokenize them again,  should not
>> be necessary.
>
>
> You're right, it's not necessary.  The second option is to create a
> custom Analyzer that returns the tokens you've already established.
>
>> Also, the neighbour information isn't applicable in my scenario. It
>> seems you use placement of terms somehow. I don't have placement
>> information. Would that hurt Lucene?
>
>
> Placement of terms is used in phrase queries, but certainly isn't a
> necessity that you concern yourself with it.  You can simply emit
> tokens in whatever position you like (leaving the default position
> increment to 1 is what I'd recommend).
>
>> I am not sure how Lucenes uses the placement information, but in the
>> described case where I concatenate all my features to a
>> whitespace-delimited text, I fear that Lucene uses the placement of
>> features in this made-up text and comes to some wrong conclusions (after
>> all, the placement is arbitrary in the "made-up" text).
>
>
> What wrong conclusions do you fear here?  Again, the position
> information is used for phrase queries, but in your situation you
> wouldn't be using phrase queries so no need to concern yourself with
> the position stuff at all.
>
>> Also 2, I am not sure yet, how the converter would have to look like.
>> After all the terms in the query have to be of the same form as those in
>> the index, otherwise they wouldn't match. Can I inject my own analyzer
>> only for the query part, so that lucene hands it phrases and lets it
>> build features from those phrases?
>
>
> Sure - you can use any analyzer you like for query parsing.  It sounds
> like you aren't going to use QueryParser, though, so you may not need
> an analyzer at all.  You definitely have to ensure that the terms in
> the query match the terms you indexed in order to find documents.  How
> you do this is really up to you.
>
>> Any info is appreciated. I could maybe build my own simple index, the
>> analyzer is already there, but I would prefer to use a professional
>> solution with a good query language and some additional
>> nice-to-have-features.
>
>
> If you could give us something more concrete, we could help in more
> detail.  But from the scenario you've described, Lucene fits fine and
> in fact describes the way I use it in some cases.
>
>> May I use Lucene? :-)
>
>
> Yes, you may!
>
>     Erik
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Indexing of virtual "made up" documents

Erik Hatcher

On Apr 30, 2005, at 7:01 AM, Daniel Stephan wrote:

> Erik, thank you very much for your help!
>
> I am not in the position to build the indexing (other features are in
> line before that), yet, but I will try Lucene for it. Looks very  
> good :)
>
> What I did not ask the last time, because it just occurred to me,  
> was: I
> have in my application a metric of "fit" to know how much a term
> compares to a document, and also how much the term compares to the  
> whole
> collection. Seems to be a good candidate for your "boost" value, then?

Sorry for the delay.  It sounds like a custom Similarity is what  
you'll need to implement.  Check out the formula and javadocs here:

     http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
Similarity.html

The tf and idf factors seem to be what, though only idf gives you the  
term for context.  Boosting a term is only possible at query time,  
not indexing time, so that might not be sufficient, though maybe it is.

Let us know how you manage to "fit" Lucene into your situation.

     Erik


>
> Cheers,
> Daniel
>
>
> Erik Hatcher schrieb:
>
>
>>
>> On Apr 26, 2005, at 3:21 AM, Daniel Stephan wrote:
>>
>>
>>> lets see if somebody listens on this list :-D
>>>
>>
>>
>> I doubt many are on this list, yet.  But your question is probably
>> best asked on the java-user@lucene list rather than here.  I'll CC
>> java-user this time to loop those folks in.
>>
>>
>>> I wonder if the following is possible with Lucene.
>>>
>>
>>
>> Yes, it is!
>>
>>
>>> I would like to add documents to the index, which aren't real  
>>> documents.
>>>
>>
>>
>> That's pretty much how my use of Lucene works - there aren't real
>> filesystem documents to index per se.
>>
>>
>>> :-) Meaning: there is no text to parse and tokenize. What I have  
>>> is a
>>> number of features, some are simple words, some are combinations of
>>> words. Those features classify an entity in my database.
>>>
>>> I have also an own parser/analyzer/tokenizer which is able to take a
>>> text and extract those features from it. (Possibly a query.)
>>>
>>> So, I wanna do sth like (pseudo code):
>>>
>>>    Lucene.index(myEntity.getId, myEntity.getDescriptors)
>>>
>>> and then when a query was issued:
>>>
>>>    List entityIds = Lucene.query
>>> (myQuery.convertToLuceneQueryLanguage)
>>>
>>> I was looking at the source and couldn't find a possibility to  
>>> get rid
>>> of the analyzing stage to hand the features to Lucene myself.
>>>
>>
>>
>> You have two good options here.... you can add each token  
>> individually
>> as a Field.Keyword() with the same field name - these will not get
>> analyzed.
>>
>>
>>> One possibility would be to use an analyzer which only considers
>>> whitespace as delimiter and set all descriptors as one string. This
>>> feels suboptimal, because I have them as single tokens already and
>>> concatenating them first, to let lucene tokenize them again,  
>>> should not
>>> be necessary.
>>>
>>
>>
>> You're right, it's not necessary.  The second option is to create a
>> custom Analyzer that returns the tokens you've already established.
>>
>>
>>> Also, the neighbour information isn't applicable in my scenario. It
>>> seems you use placement of terms somehow. I don't have placement
>>> information. Would that hurt Lucene?
>>>
>>
>>
>> Placement of terms is used in phrase queries, but certainly isn't a
>> necessity that you concern yourself with it.  You can simply emit
>> tokens in whatever position you like (leaving the default position
>> increment to 1 is what I'd recommend).
>>
>>
>>> I am not sure how Lucenes uses the placement information, but in the
>>> described case where I concatenate all my features to a
>>> whitespace-delimited text, I fear that Lucene uses the placement of
>>> features in this made-up text and comes to some wrong conclusions  
>>> (after
>>> all, the placement is arbitrary in the "made-up" text).
>>>
>>
>>
>> What wrong conclusions do you fear here?  Again, the position
>> information is used for phrase queries, but in your situation you
>> wouldn't be using phrase queries so no need to concern yourself with
>> the position stuff at all.
>>
>>
>>> Also 2, I am not sure yet, how the converter would have to look  
>>> like.
>>> After all the terms in the query have to be of the same form as  
>>> those in
>>> the index, otherwise they wouldn't match. Can I inject my own  
>>> analyzer
>>> only for the query part, so that lucene hands it phrases and lets it
>>> build features from those phrases?
>>>
>>
>>
>> Sure - you can use any analyzer you like for query parsing.  It  
>> sounds
>> like you aren't going to use QueryParser, though, so you may not need
>> an analyzer at all.  You definitely have to ensure that the terms in
>> the query match the terms you indexed in order to find documents.  
>> How
>> you do this is really up to you.
>>
>>
>>> Any info is appreciated. I could maybe build my own simple index,  
>>> the
>>> analyzer is already there, but I would prefer to use a professional
>>> solution with a good query language and some additional
>>> nice-to-have-features.
>>>
>>
>>
>> If you could give us something more concrete, we could help in more
>> detail.  But from the scenario you've described, Lucene fits fine and
>> in fact describes the way I use it in some cases.
>>
>>
>>> May I use Lucene? :-)
>>>
>>
>>
>> Yes, you may!
>>
>>     Erik
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...