Norm Value of not existing Field

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Norm Value of not existing Field

Benjamin Heilbrunn
Hi,

I'm using Lucene 2.9.1 patched with
http://issues.apache.org/jira/browse/LUCENE-1260
For some special reason I need to find all documents which contain at
least 1 term in a certain field.
This works by iterating the norms array only as long as the field
exists on every document.
For documents without the field the norms array holds the byte-value 124.
Where does 124 come from - and is there a way to change it to an other
value like -128 (0xFF) for not existing fields?


Benjamin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Norm Value of not existing Field

Michael McCandless-2
This isn't easy to change; it's hardcoded, in oal.index.NormsWriter,
to 1.0, and also in SegmentReader, to 1.0 (when the field doesn't have
norms stored, but eg someone is requesting them anyway).  1.0 must
encode to 124.  I suppose we could empower Similarity to define what
the "undefined norm value" should be?  Wanna make a patch?

Mike

On Thu, Dec 3, 2009 at 11:46 AM, Benjamin Heilbrunn <[hidden email]> wrote:

> Hi,
>
> I'm using Lucene 2.9.1 patched with
> http://issues.apache.org/jira/browse/LUCENE-1260
> For some special reason I need to find all documents which contain at
> least 1 term in a certain field.
> This works by iterating the norms array only as long as the field
> exists on every document.
> For documents without the field the norms array holds the byte-value 124.
> Where does 124 come from - and is there a way to change it to an other
> value like -128 (0xFF) for not existing fields?
>
>
> Benjamin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Norm Value of not existing Field

Erick Erickson
In reply to this post by Benjamin Heilbrunn
It would be clumsier, but you could create a Filter by spinning
through all the terms on a field and setting the appropriate bit.

You could even do this at startup and store the filters around for
all the fields you care about, or cache them when first used.

The advantage I see here is that it wouldn't depend upon
what looks like a peculiarity in field norms.

The disadvantage is that I bet it's slower.

FWIW
Erick

On Thu, Dec 3, 2009 at 11:46 AM, Benjamin Heilbrunn <[hidden email]>wrote:

> Hi,
>
> I'm using Lucene 2.9.1 patched with
> http://issues.apache.org/jira/browse/LUCENE-1260
> For some special reason I need to find all documents which contain at
> least 1 term in a certain field.
> This works by iterating the norms array only as long as the field
> exists on every document.
> For documents without the field the norms array holds the byte-value 124.
> Where does 124 come from - and is there a way to change it to an other
> value like -128 (0xFF) for not existing fields?
>
>
> Benjamin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Snowball Stemmer Question

Christopher Condit
In reply to this post by Michael McCandless-2
The Snowball Analyzer works well for certain constructs but not others. In particular I'm having a problem with things like "colossal" vs "colossus" and "hippocampus" vs "hippocampal".
Is there a way to customize the analyzer to include these rules?
Thanks,
-Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Snowball Stemmer Question

Otis Gospodnetic-2
Chris,

You could look at KStem to see if that does a better job.
Or perhaps WordNet can be used to get the lemma of those terms instead of using stemming.
Finally.... what was I going to say... ah, yes, using synonyms may be another way this can be handled.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----

> From: Christopher Condit <[hidden email]>
> To: "[hidden email]" <[hidden email]>
> Sent: Thu, December 3, 2009 3:04:03 PM
> Subject: Snowball Stemmer Question
>
> The Snowball Analyzer works well for certain constructs but not others. In
> particular I'm having a problem with things like "colossal" vs "colossus" and
> "hippocampus" vs "hippocampal".
> Is there a way to customize the analyzer to include these rules?
> Thanks,
> -Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Norm Value of not existing Field

Benjamin Heilbrunn
In reply to this post by Erick Erickson
Erick, I'm not sure if I understand you right.
What do you mean by "spinning through all the terms on a field".

It would be an option to load all unique terms of a field by using TermEnum.
Than use TermDocs to get the docs to those terms.
The rest of docs doesn't contain a term and so you know, that the
field don't exists or is empty on those docs.
Btw: Is there a distinction in Lucene between empty and not existing Fields?

The above method would work very well I think, but it would require to
build and hold an extra data structure.
My index has about 20 fields and 4 million docs. The overhead would be to large.

I think - using the norms array (which is already there for most of
the fields) would be a nice approach.


Benjamin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Norm Value of not existing Field

Erick Erickson
The word "Filter" as part of a class is overloaded in Lucene <G>....

See: http://lucene.apache.org/java/2_9_1/api/all/index.html

The above filter is just a DocIdSet, one bit per document. So
in your example, you're only talking 12M or so, even if you
create one filter for every field and keep it around.

You *might* get some joy from, say, QueryWrapperFilter, although
I don't know if it handles pure wildcard terms (e.g. field:*)...

If that doesn't work out of the box, I *think* you can use TermDocs
with a term like field:"" and just keep marching until next() returns
false, merrily setting your Filter bits for each Doc returned by
the enumerator.....

HTH
Erick


On Fri, Dec 4, 2009 at 3:40 AM, Benjamin Heilbrunn <[hidden email]> wrote:

> Erick, I'm not sure if I understand you right.
> What do you mean by "spinning through all the terms on a field".
>
> It would be an option to load all unique terms of a field by using
> TermEnum.
> Than use TermDocs to get the docs to those terms.
> The rest of docs doesn't contain a term and so you know, that the
> field don't exists or is empty on those docs.
> Btw: Is there a distinction in Lucene between empty and not existing
> Fields?
>
> The above method would work very well I think, but it would require to
> build and hold an extra data structure.
> My index has about 20 fields and 4 million docs. The overhead would be to
> large.
>
> I think - using the norms array (which is already there for most of
> the fields) would be a nice approach.
>
>
> Benjamin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>