Adding to the termFreqVector

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding to the termFreqVector

ryanskow

How would one go about adding additional terms to a field which is not
stored literally, but instead has a termFreqVector?  For example:

If DocumentA was indexed originally with:
    myTermField: red green blue

termFreqVector would look like:
   freq {myTermField: red/1, green/1, blue/1}

Now, I'd like to add some more terms (red, yellow) and desire the
termFreqVector to look like this:
   freq {myTermField: red/2, green/1, blue/1, yellow/1}

It would seem like there would be a covenant way of accomplishing this,
but I must be missing something.

Any advice would be greatly appreciated!


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Adding to the termFreqVector

Grant Ingersoll
Is your intent to persist the changed vector somehow or just use it in
your application for the immediate search?

TermFreqVector is an interface, so if you aren't persisting, I would
write a wrapper class around the one that is returned by Lucene that has
add/set methods on it for manipulating the underlying vector and pass
that around in your application.  Other option, is to get the source and
modify the TermFreqVector for your needs.

Persistance is a bit harder, but would probably involve manipulating
the document and then re-indexing it so that it's new vector has the
updated frequencies by adding some dummy terms onto the document.

Is that what you are looking for?

>>> [hidden email] 5/30/2005 12:37:54 PM >>>

How would one go about adding additional terms to a field which is not

stored literally, but instead has a termFreqVector?  For example:

If DocumentA was indexed originally with:
    myTermField: red green blue

termFreqVector would look like:
   freq {myTermField: red/1, green/1, blue/1}

Now, I'd like to add some more terms (red, yellow) and desire the
termFreqVector to look like this:
   freq {myTermField: red/2, green/1, blue/1, yellow/1}

It would seem like there would be a covenant way of accomplishing this,

but I must be missing something.

Any advice would be greatly appreciated!


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Adding to the termFreqVector

ryanskow

Adding new terms and re-indexing the document is the desired behavior.

One (non-scalable) solution would be to parse the toString of the
termFreqVector (freq {myTermField: red/2, green/1, blue/1}) and create a
new string representation of the expanded terms:  (red red green blue)

This obviously isn't a good solution.  Finding a way to simply do a
document.addTerm("red"); and then re-indexing would be ideal.



> Is your intent to persist the changed vector somehow or just use it in
> your application for the immediate search?
>
> TermFreqVector is an interface, so if you aren't persisting, I would
> write a wrapper class around the one that is returned by Lucene that has
> add/set methods on it for manipulating the underlying vector and pass
> that around in your application.  Other option, is to get the source and
> modify the TermFreqVector for your needs.
>
> Persistance is a bit harder, but would probably involve manipulating
> the document and then re-indexing it so that it's new vector has the
> updated frequencies by adding some dummy terms onto the document.
>
> Is that what you are looking for?
>
>>>> [hidden email] 5/30/2005 12:37:54 PM >>>
>
> How would one go about adding additional terms to a field which is not
>
> stored literally, but instead has a termFreqVector?  For example:
>
> If DocumentA was indexed originally with:
>     myTermField: red green blue
>
> termFreqVector would look like:
>    freq {myTermField: red/1, green/1, blue/1}
>
> Now, I'd like to add some more terms (red, yellow) and desire the
> termFreqVector to look like this:
>    freq {myTermField: red/2, green/1, blue/1, yellow/1}
>
> It would seem like there would be a covenant way of accomplishing this,
>
> but I must be missing something.
>
> Any advice would be greatly appreciated!
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Adding to the termFreqVector

Grant Ingersoll
In reply to this post by ryanskow
I don't think you need to parse the toString, you have the
TermFreqVector object which lets you access the appropriate pieces of
information (string, freq).  You could then turn around and delete/index
the new document based on the vector with the increments.  I don't know
whether it would scale or not, but at some point you have to do the
work, right?  Lucene is quite fast, you might be surprised what
"scales".

Where is the information coming from that allows you to increment "red"
below?  Is the information known at indexing?  Perhaps your token
filters could just add the value  onto the token stream when you index.
Of course, this _may_ screw up position information, but you haven't
said anything about needing that, so I will assume you don't

-Grant


>>> [hidden email] 5/31/2005 6:02:53 PM >>>

Adding new terms and re-indexing the document is the desired behavior.

One (non-scalable) solution would be to parse the toString of the
termFreqVector (freq {myTermField: red/2, green/1, blue/1}) and create
a
new string representation of the expanded terms:  (red red green blue)

This obviously isn't a good solution.  Finding a way to simply do a
document.addTerm("red"); and then re-indexing would be ideal.



> Is your intent to persist the changed vector somehow or just use it
in
> your application for the immediate search?
>
> TermFreqVector is an interface, so if you aren't persisting, I would
> write a wrapper class around the one that is returned by Lucene that
has
> add/set methods on it for manipulating the underlying vector and
pass
> that around in your application.  Other option, is to get the source
and

> modify the TermFreqVector for your needs.
>
> Persistance is a bit harder, but would probably involve manipulating
> the document and then re-indexing it so that it's new vector has the
> updated frequencies by adding some dummy terms onto the document.
>
> Is that what you are looking for?
>
>>>> [hidden email] 5/30/2005 12:37:54 PM >>>
>
> How would one go about adding additional terms to a field which is
not

>
> stored literally, but instead has a termFreqVector?  For example:
>
> If DocumentA was indexed originally with:
>     myTermField: red green blue
>
> termFreqVector would look like:
>    freq {myTermField: red/1, green/1, blue/1}
>
> Now, I'd like to add some more terms (red, yellow) and desire the
> termFreqVector to look like this:
>    freq {myTermField: red/2, green/1, blue/1, yellow/1}
>
> It would seem like there would be a covenant way of accomplishing
this,
>
> but I must be missing something.
>
> Any advice would be greatly appreciated!
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Adding to the termFreqVector

ryanskow
Lucene's scalability is not in question.  The simple solution of
rebuilding the string of terms is what I referred to as not being
scalable.  For instance, consider the following term vector:
termFreqVector (freq {myTermField: red/699999, green/799999, blue/8999999})
Recreating a string with 699999 'reds' and a lot of greens and blues is
definitely not memory or time efficient.  That solution I proposed was
simply to show an example of the effect I was hoping for and not something
that one would want to do.  A spin off that theme could be used though -
if the numbers were normalized, a relative number of reds, greens, and
blues could be created...  Still not convenient, but an option.

The new information, to increment 'red' for instance, would come at
runtime - after the initial indexing.  I would have no issues with being
able to up the 'red' bucket in the term vector and reindex the document
with the new term vector, but it looks like the term vector is read-only
and can not be modified once retrieved from the IndexReader for the given
field of a document.  I'm not familiar with how to add to an existing term
frequency vector - if you have an example of doing so that would be great!
 At this point in time the term positions are not of interest.

Thanks for your replies!

Ryan

> I don't think you need to parse the toString, you have the
> TermFreqVector object which lets you access the appropriate pieces of
> information (string, freq).  You could then turn around and delete/index
> the new document based on the vector with the increments.  I don't know
> whether it would scale or not, but at some point you have to do the
> work, right?  Lucene is quite fast, you might be surprised what
> "scales".
>
> Where is the information coming from that allows you to increment "red"
> below?  Is the information known at indexing?  Perhaps your token
> filters could just add the value  onto the token stream when you index.
> Of course, this _may_ screw up position information, but you haven't
> said anything about needing that, so I will assume you don't
>
> -Grant
>
>
>>>> [hidden email] 5/31/2005 6:02:53 PM >>>
>
> Adding new terms and re-indexing the document is the desired behavior.
>
> One (non-scalable) solution would be to parse the toString of the
> termFreqVector (freq {myTermField: red/2, green/1, blue/1}) and create
> a
> new string representation of the expanded terms:  (red red green blue)
>
> This obviously isn't a good solution.  Finding a way to simply do a
> document.addTerm("red"); and then re-indexing would be ideal.
>
>
>
>> Is your intent to persist the changed vector somehow or just use it
> in
>> your application for the immediate search?
>>
>> TermFreqVector is an interface, so if you aren't persisting, I would
>> write a wrapper class around the one that is returned by Lucene that
> has
>> add/set methods on it for manipulating the underlying vector and
> pass
>> that around in your application.  Other option, is to get the source
> and
>> modify the TermFreqVector for your needs.
>>
>> Persistance is a bit harder, but would probably involve manipulating
>> the document and then re-indexing it so that it's new vector has the
>> updated frequencies by adding some dummy terms onto the document.
>>
>> Is that what you are looking for?
>>
>>>>> [hidden email] 5/30/2005 12:37:54 PM >>>
>>
>> How would one go about adding additional terms to a field which is
> not
>>
>> stored literally, but instead has a termFreqVector?  For example:
>>
>> If DocumentA was indexed originally with:
>>     myTermField: red green blue
>>
>> termFreqVector would look like:
>>    freq {myTermField: red/1, green/1, blue/1}
>>
>> Now, I'd like to add some more terms (red, yellow) and desire the
>> termFreqVector to look like this:
>>    freq {myTermField: red/2, green/1, blue/1, yellow/1}
>>
>> It would seem like there would be a covenant way of accomplishing
> this,
>>
>> but I must be missing something.
>>
>> Any advice would be greatly appreciated!
>>
>>
>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Adding to the termFreqVector

Grant Ingersoll
In reply to this post by ryanskow
I don't think you need to add to a term vector.

I guess what I was thinking (and this is just a guess not knowing your
architecture) is that you could have a TokenStreamFilter/Analyzer that
took in the appropriate term vectors, along with your runtime
incremental information.  Then, as you are "indexing" you add things to
the token stream from the term vector (doing this the number of times
needed based on its frequency).  If a specific term is in your runtime
incremental information, you add that many more terms unto the token
stream.  If your runtime info is stored in a Map then this lookup would
be quite fast and you indexing performance should be about what it was
for the original document.



>>> [hidden email] 6/1/2005 5:23:33 PM >>>
Lucene's scalability is not in question.  The simple solution of
rebuilding the string of terms is what I referred to as not being
scalable.  For instance, consider the following term vector:
termFreqVector (freq {myTermField: red/699999, green/799999,
blue/8999999})
Recreating a string with 699999 'reds' and a lot of greens and blues
is
definitely not memory or time efficient.  That solution I proposed was
simply to show an example of the effect I was hoping for and not
something
that one would want to do.  A spin off that theme could be used though
-
if the numbers were normalized, a relative number of reds, greens, and
blues could be created...  Still not convenient, but an option.


The new information, to increment 'red' for instance, would come at
runtime - after the initial indexing.  I would have no issues with
being
able to up the 'red' bucket in the term vector and reindex the
document
with the new term vector, but it looks like the term vector is
read-only
and can not be modified once retrieved from the IndexReader for the
given
field of a document.  I'm not familiar with how to add to an existing
term
frequency vector - if you have an example of doing so that would be
great!
 At this point in time the term positions are not of interest.

Thanks for your replies!

Ryan

> I don't think you need to parse the toString, you have the
> TermFreqVector object which lets you access the appropriate pieces
of
> information (string, freq).  You could then turn around and
delete/index
> the new document based on the vector with the increments.  I don't
know
> whether it would scale or not, but at some point you have to do the
> work, right?  Lucene is quite fast, you might be surprised what
> "scales".
>
> Where is the information coming from that allows you to increment
"red"
> below?  Is the information known at indexing?  Perhaps your token
> filters could just add the value  onto the token stream when you
index.
> Of course, this _may_ screw up position information, but you haven't
> said anything about needing that, so I will assume you don't
>
> -Grant
>
>
>>>> [hidden email] 5/31/2005 6:02:53 PM >>>
>
> Adding new terms and re-indexing the document is the desired
behavior.
>
> One (non-scalable) solution would be to parse the toString of the
> termFreqVector (freq {myTermField: red/2, green/1, blue/1}) and
create
> a
> new string representation of the expanded terms:  (red red green
blue)

>
> This obviously isn't a good solution.  Finding a way to simply do a
> document.addTerm("red"); and then re-indexing would be ideal.
>
>
>
>> Is your intent to persist the changed vector somehow or just use it
> in
>> your application for the immediate search?
>>
>> TermFreqVector is an interface, so if you aren't persisting, I
would
>> write a wrapper class around the one that is returned by Lucene
that
> has
>> add/set methods on it for manipulating the underlying vector and
> pass
>> that around in your application.  Other option, is to get the
source
> and
>> modify the TermFreqVector for your needs.
>>
>> Persistance is a bit harder, but would probably involve
manipulating
>> the document and then re-indexing it so that it's new vector has
the

>> updated frequencies by adding some dummy terms onto the document.
>>
>> Is that what you are looking for?
>>
>>>>> [hidden email] 5/30/2005 12:37:54 PM >>>
>>
>> How would one go about adding additional terms to a field which is
> not
>>
>> stored literally, but instead has a termFreqVector?  For example:
>>
>> If DocumentA was indexed originally with:
>>     myTermField: red green blue
>>
>> termFreqVector would look like:
>>    freq {myTermField: red/1, green/1, blue/1}
>>
>> Now, I'd like to add some more terms (red, yellow) and desire the
>> termFreqVector to look like this:
>>    freq {myTermField: red/2, green/1, blue/1, yellow/1}
>>
>> It would seem like there would be a covenant way of accomplishing
> this,
>>
>> but I must be missing something.
>>
>> Any advice would be greatly appreciated!
>>
>>
>>
>
---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]