MoreLikeThis and term vectors - documentation suggestion

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

MoreLikeThis and term vectors - documentation suggestion

kkrugler
Hi all,

I was trying out the MoreLikeThis support, and getting some odd results.

I realized that unless the fields being used for similarity
calculation have a stored term vector, the MoreLikeThis code from
Lucene will re-analyze the field using the StandardAnalyzer. Which,
in my case, is quite different from what I'm using in the Solr schema.

So the first note is just for anybody using MoreLikeThis, make sure
you also specify termVectors=true in the Solr schema for any fields
being passed to the query as mlt.fl parameters.

The second note is that the Wiki page and the example schema might
want to include some reference to the termVectors field attribute.
For example, the sample schema says:

>    <!-- Valid attributes for fields:
>      name: mandatory - the name for the field
>      type: mandatory - the name of a previously defined type from
>the <types> section
>      indexed: true if this field should be indexed (searchable or sortable)
>      stored: true if this field should be retrievable
>      compressed: [false] if this field should be stored using gzip compression
>        (this will only apply if the field type is compressable; among
>        the standard field types, only TextField and StrField are)
>      multiValued: true if this field may contain multiple values per document
>      omitNorms: (expert) set to true to omit the norms associated with
>        this field (this disables length normalization and index-time
>        boosting for the field, and saves some memory).  Only full-text
>        fields or fields that need an index-time boost need norms.

Which made me think initially these were the only valid attributes
for fields. Likewise the wiki page at
http://wiki.apache.org/solr/SchemaXml also doesn't make any mention
of termVectors, termPositions, or termOffsets. I would edit that
page, but there currently isn't a section that talks about all the
attributes, only the common ones.

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: MoreLikeThis and term vectors - documentation suggestion

Bertrand Delacretaz
On 2/26/07, Ken Krugler <[hidden email]> wrote:

> ...I was trying out the MoreLikeThis support, and getting some odd results...

Thanks for the info, I have added a link to your message at
https://issues.apache.org/jira/browse/SOLR-69

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: MoreLikeThis and term vectors - documentation suggestion

Mike Klaas
On 2/26/07, Bertrand Delacretaz <[hidden email]> wrote:
> On 2/26/07, Ken Krugler <[hidden email]> wrote:
>
> > ...I was trying out the MoreLikeThis support, and getting some odd results...
>
> Thanks for the info, I have added a link to your message at
> https://issues.apache.org/jira/browse/SOLR-69

Is it possible to modify MoreLikeThis to use the schema.xml-defined
analyzer?  That's the way the highlighting code currently works (it
picks the index-time analyzer).

It woudl be nice for as many features as possible to work without term
vectors.  I sometimes wonder whether schema.xml exposes the right
level of abstraction (it is currently very lucene-guts-y).  Options
like compressed are nice as we are free to change the implementation.
canPerformMoreLikeThis=true gives us more flexibility in the future.

Then again, perhaps all that is needed is a nice table... something
like http://wiki.apache.org/solr/FieldOptionsByUseCase?

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: MoreLikeThis and term vectors - documentation suggestion

kkrugler
>On 2/26/07, Bertrand Delacretaz <[hidden email]> wrote:
>>On 2/26/07, Ken Krugler <[hidden email]> wrote:
>>
>>>  ...I was trying out the MoreLikeThis support, and getting some
>>>odd results...
>>
>>Thanks for the info, I have added a link to your message at
>>https://issues.apache.org/jira/browse/SOLR-69
>
>Is it possible to modify MoreLikeThis to use the schema.xml-defined
>analyzer?  That's the way the highlighting code currently works (it
>picks the index-time analyzer).

I looked at that briefly (passing the analyzer to use down to
MoreLikeThis), but for my fields it's a lot more than just what
analyzer is used, given all of the filters that are also in play.

Also the performance really stunk when I didn't use stored term vectors.

>It woudl be nice for as many features as possible to work without term
>vectors.  I sometimes wonder whether schema.xml exposes the right
>level of abstraction (it is currently very lucene-guts-y).  Options
>like compressed are nice as we are free to change the implementation.
>canPerformMoreLikeThis=true gives us more flexibility in the future.
>
>Then again, perhaps all that is needed is a nice table... something
>like http://wiki.apache.org/solr/FieldOptionsByUseCase?

That would be nice, yes.

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: MoreLikeThis and term vectors - documentation suggestion

Chris Hostetter-3

: >Is it possible to modify MoreLikeThis to use the schema.xml-defined
: >analyzer?  That's the way the highlighting code currently works (it
: >picks the index-time analyzer).
:
: I looked at that briefly (passing the analyzer to use down to
: MoreLikeThis), but for my fields it's a lot more than just what
: analyzer is used, given all of the filters that are also in play.

that confuses me ... when dealing with the "plugin" level of things (ie:
writing java code) it's easy to access an IndexSchema instance, and from
there to get a SolrAnalyzer that already knows about all of the fields and
what token filters to use on each -- you could even access the "index"
analyzer instead of the "query" analyzer if you wanted for any field at
query time ... so if the MLT class allows some way of setting the Analyzer
to use, that should work fine.

what other problems did you run into when you looked into this Ken?

: Also the performance really stunk when I didn't use stored term vectors.

well .. i'd still rather be able to say "using termVectors to make MLT
faster" then: "if you don't use termVectors MLT doesn't work at all"


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: MoreLikeThis and term vectors - documentation suggestion

kkrugler
>: >Is it possible to modify MoreLikeThis to use the schema.xml-defined
>: >analyzer?  That's the way the highlighting code currently works (it
>: >picks the index-time analyzer).
>:
>: I looked at that briefly (passing the analyzer to use down to
>: MoreLikeThis), but for my fields it's a lot more than just what
>: analyzer is used, given all of the filters that are also in play.
>
>that confuses me ... when dealing with the "plugin" level of things (ie:
>writing java code) it's easy to access an IndexSchema instance, and from
>there to get a SolrAnalyzer that already knows about all of the fields and
>what token filters to use on each -- you could even access the "index"
>analyzer instead of the "query" analyzer if you wanted for any field at
>query time ... so if the MLT class allows some way of setting the Analyzer
>to use, that should work fine.
>
>what other problems did you run into when you looked into this Ken?

No other problems - just not knowing that it was possible to set up a
SolrAnalyzer so easily :)

If that's the case, then it seems like a minor tweak to call
MoreLikeThis.setAnalyzer
(http://krugle.com/kse/files/svn/svn.apache.org/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java)
with the SolrAnalyzer.

Though I don't understand Mark's comment for the setAnalyzer() method
- he says that it's not required when using the like(docNum) method
call, but from what I can tell the analyzer (either the default
StandardAnalyzer or whatever gets set explicitly) will still get used
in that case, if there's no term vector.

>: Also the performance really stunk when I didn't use stored term vectors.
>
>well .. i'd still rather be able to say "using termVectors to make MLT
>faster" then: "if you don't use termVectors MLT doesn't work at all"

Agreed.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"