.sN (separate norms files) and NO_NORMS

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

.sN (separate norms files) and NO_NORMS

Otis Gospodnetic-2
Hi,

I recently run the FieldNormModifier (see http://issues.apache.org/jira/browse/LUCENE-741 ) on 8 fields that I wanted to turn into NO_NORMS fields.  I run this on several optimized .cfs indices.  Afterwards I noticed that *some* (but not all!) indices contained 8 .sN (where N is a number) files.  Those are norm files, I believe (Lucene 2.0.0).  Meanwhile, the .cfs file remained untouched.  Does anyone know how to explain this?

What bugs me is:
- Why was the original .cfs not modified?
- Why did .sN files show up separately?

What bugs my colleague (hi Brian!) is:
- Why are there separate norms for each NO_NORMS field, and not just 1 for all of them?
(my answer is that the files still exists like they exist for non-NO_NORMS fields, it's just that they are full of 1.0s, but I'm not absolutely sure that's the correct answer.)

I would have expected the .cfs file to get modified.  Or I'd expect to see 8 .sN files along the unmodified .cfs in *all* index directories I run this against, and not just some.

The essential, index-modifying part of FieldNormModifier is this:

      reader = IndexReader.open(dir);
      for (int d = 0; d < termCounts.length; d++) {
        if (! reader.isDeleted(d)) {
          if (sim == null)
            reader.setNorm(d, fieldName, fakeNorms[0]);        // this is my case - turning existing fields into Field.NO_NORMS fields.
          else
            reader.setNorm(d, fieldName, sim.encodeNorm(sim.lengthNorm(fieldName, termCounts[d])));
        }
      }

Also, looking at http://lucene.apache.org/java/docs/fileformats.html I don't even see any mention of .sN files.

Does anyone has an explanation for this before I start digging?

Thanks,
Otis




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: .sN (separate norms files) and NO_NORMS

Otis Gospodnetic-2
Hi,

After a little digging/debugging, it seems to me that what I am seeing is actually normal and expected behaviour.  Morever, it seems that once a Field is indexed without it being NO_NORMS field, it is not really possible to make it a trully NO_NORMS field.  From what I can tell, one of the key methods is in DocumentWriter:

  private final void writeNorms(String segment) throws IOException {
    for(int n = 0; n < fieldInfos.size(); n++){
      FieldInfo fi = fieldInfos.fieldInfo(n);
      if(fi.isIndexed && !fi.omitNorms){                                                                 <== here
        float norm = fieldBoosts[n] * similarity.lengthNorm(fi.name, fieldLengths[n]);
        IndexOutput norms = directory.createOutput(segment + ".f" + n);
        try {
          norms.writeByte(Similarity.encodeNorm(norm));
        } finally {
          norms.close();
        }
      }
    }
  }

This is where norms for a field are either written if the field is indexed and *not* a NO_NORMS field, or not written if the field is indexed and *is* a NO_NORMS field.

I also see this in the FieldInfo class:

      if (fi.omitNorms != omitNorms) {
        fi.omitNorms = false;                // once norms are stored, always store
      }

Thus, it's not really possible to completely kill field norms and make the field a genuine NO_NORMS field after the fact... is this correct?
Therefore, that FieldNormModifier call that tries to turn an existing field into a NO_NORMS field doesn't really work:

            reader.setNorm(d, fieldName, fakeNorms[0]);        // this
is my case - turning existing fields into Field.NO_NORMS fields.

I think this just fakes out a norms file for a given field, and this norms file ends up containing a byte[] of encoded 1.0f's, one for each Document.  But this really is completely fake - this just makes the norms be 1.0, while NO_NORMS skips the *writing* of norms file for a given field completely.

Is the above correct?
If so, is there any way to turn an existing field into a genuine NO_NORMS field?

Thanks,
Otis



----- Original Message ----
From: Otis Gospodnetic <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 9, 2007 2:36:46 AM
Subject: .sN (separate norms files) and NO_NORMS

Hi,

I recently run the FieldNormModifier (see http://issues.apache.org/jira/browse/LUCENE-741 ) on 8 fields that I wanted to turn into NO_NORMS fields.  I run this on several optimized .cfs indices.  Afterwards I noticed that *some* (but not all!) indices contained 8 .sN (where N is a number) files.  Those are norm files, I believe (Lucene 2.0.0).  Meanwhile, the .cfs file remained untouched.  Does anyone know how to explain this?

What bugs me is:
- Why was the original .cfs not modified?
- Why did .sN files show up separately?

What bugs my colleague (hi Brian!) is:
- Why are there separate norms for each NO_NORMS field, and not just 1 for all of them?
(my answer is that the files still exists like they exist for non-NO_NORMS fields, it's just that they are full of 1.0s, but I'm not absolutely sure that's the correct answer.)

I would have expected the .cfs file to get modified.  Or I'd expect to see 8 .sN files along the unmodified .cfs in *all* index directories I run this against, and not just some.

The essential, index-modifying part of FieldNormModifier is this:

      reader = IndexReader.open(dir);
      for (int d = 0; d < termCounts.length; d++) {
        if (! reader.isDeleted(d)) {
          if (sim == null)
            reader.setNorm(d, fieldName, fakeNorms[0]);        // this is my case - turning existing fields into Field.NO_NORMS fields.
          else
            reader.setNorm(d, fieldName, sim.encodeNorm(sim.lengthNorm(fieldName, termCounts[d])));
        }
      }

Also, looking at http://lucene.apache.org/java/docs/fileformats.html I don't even see any mention of .sN files.

Does anyone has an explanation for this before I start digging?

Thanks,
Otis




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: .sN (separate norms files) and NO_NORMS

Yonik Seeley-2
On 1/9/07, Otis Gospodnetic <[hidden email]> wrote:
> After a little digging/debugging, it seems to me that what I am seeing is actually normal and expected behaviour.  Morever, it seems that once a Field is indexed without it being NO_NORMS field, it is not really possible to make it a trully NO_NORMS field.

Correct.  As with many index changes, reindexing from scratch is the
best way to go.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: .sN (separate norms files) and NO_NORMS

Doron Cohen
In reply to this post by Otis Gospodnetic-2
Otis Gospodnetic <[hidden email]> wrote on 08/01/2007 23:36:46:
> Also, looking at http://lucene.apache.org/java/docs/fileformats.html
> I don't even see any mention of .sN files.

I am almost sure I added that info to fileformats (issue 756).
I'll check what's been with that.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: .sN (separate norms files) and NO_NORMS

Yonik Seeley-2
On 1/9/07, Doron Cohen <[hidden email]> wrote:
> Otis Gospodnetic <[hidden email]> wrote on 08/01/2007 23:36:46:
> > Also, looking at http://lucene.apache.org/java/docs/fileformats.html
> > I don't even see any mention of .sN files.
>
> I am almost sure I added that info to fileformats (issue 756).
> I'll check what's been with that.

It may be in the xml, but I didn't regenerate or sync the site.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: .sN (separate norms files) and NO_NORMS

Otis Gospodnetic-2
In reply to this post by Otis Gospodnetic-2
Hi,
Got this thought while eating noodles yesterday...

Couldn't one convert a non NO_NORMS field to a genuine NO_NORMS field by:
1. expanding an index to a multi-file index (if the index was a .cfs one)
2. removing the appropriate .fN file from the index directory
3. switching that omitNorms bit in FieldInfo

I'm not sure how possible or how hard 3. is, but I see this omitNorms bit in FieldInfo.read(IndexInput) and FieldInfo.write(IndexOutput).

Maybe something like:

  public void change(IndexInput input, IndexOutput output, boolean omitNorms) throws IOException {
    int size = input.readVInt();//read in the size
    for (int i = 0; i < size; i++) {
      String name = input.readString().intern();
      byte bits = input.readByte();
      boolean isIndexed = (bits & IS_INDEXED) != 0;
      boolean storeTermVector = (bits & STORE_TERMVECTOR) != 0;
      boolean storePositionsWithTermVector = (bits & STORE_POSITIONS_WITH_TERMVECTOR) != 0;
      boolean storeOffsetWithTermVector = (bits & STORE_OFFSET_WITH_TERMVECTOR) != 0;
      // ignore what's in the index, use what the caller says it wants
      //boolean omitNorms = (bits & OMIT_NORMS) != 0;
     
      addInternal(name, isIndexed, storeTermVector, storePositionsWithTermVector, storeOffsetWithTermVector, omitNorms);

      write(output);
    }

I didn't try this, of course, but I'm curious if this general approach would work, at least in case of norms.  If it works for norms, maybe it would also work for other field attributes, if their data is stoerd in separate files and easily detachable from the other index files.

Thanks,
Otis


----- Original Message ----
From: Yonik Seeley <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 9, 2007 1:41:33 PM
Subject: Re: .sN (separate norms files) and NO_NORMS

On 1/9/07, Otis Gospodnetic <[hidden email]> wrote:
> After a little digging/debugging, it seems to me that what I am seeing is actually normal and expected behaviour.  Morever, it seems that once a Field is indexed without it being NO_NORMS field, it is not really possible to make it a trully NO_NORMS field.

Correct.  As with many index changes, reindexing from scratch is the
best way to go.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: .sN (separate norms files) and NO_NORMS

Yonik Seeley-2
On 1/9/07, Otis Gospodnetic <[hidden email]> wrote:
> Couldn't one convert a non NO_NORMS field to a genuine NO_NORMS field by:
> 1. expanding an index to a multi-file index (if the index was a .cfs one)
> 2. removing the appropriate .fN file from the index directory
> 3. switching that omitNorms bit in FieldInfo

Yes, that would work for some custom code.  Step (2) might even be
done for you if you do step 3 first and then do an optimize.

This stuff seems more like the exception than the norm :-) though, so
I think it might not be worth the burden of supporting it in the
public API.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: .sN (separate norms files) and NO_NORMS

Otis Gospodnetic-2
In reply to this post by Otis Gospodnetic-2
Hi,

I agree.  I wrote a custom class and it actually works in the 1, 3, 2 order below.  I changed FieldInfos to non-final, but I think that's it.
Actually, my class doesn't do 1 yet.  This used to work:

        // unpack cfs
        writer = new IndexWriter("/tmp/fnm", new SimpleAnalyzer(), false);
        writer.setUseCompoundFile(false);
        writer.optimize();
        writer.close();

But this doesn't seem to work any more.  I think now one has to modify the index ... holllly, nice conditional in optimize()! :)

Would adding forceOptimize() to IndexWriter be a good thing?

  public synchronized void forceOptimize() throws IOException {
      flushRamSegments();
      int minSegment = segmentInfos.size() - mergeFactor;
      mergeSegments(minSegment < 0 ? 0 : minSegment);      
  }

Then the above cfs unpacking with writer.forceOptimize() call works.

Otis

----- Original Message ----
From: Yonik Seeley <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 9, 2007 8:33:09 PM
Subject: Re: .sN (separate norms files) and NO_NORMS

On 1/9/07, Otis Gospodnetic <[hidden email]> wrote:
> Couldn't one convert a non NO_NORMS field to a genuine NO_NORMS field by:
> 1. expanding an index to a multi-file index (if the index was a .cfs one)
> 2. removing the appropriate .fN file from the index directory
> 3. switching that omitNorms bit in FieldInfo

Yes, that would work for some custom code.  Step (2) might even be
done for you if you do step 3 first and then do an optimize.

This stuff seems more like the exception than the norm :-) though, so
I think it might not be worth the burden of supporting it in the
public API.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]