[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

iverase
GitHub user nknize opened a pull request:

    https://github.com/apache/lucene-solr/pull/451

    LUCENE-8496: Add selective indexing to BKD/POINTS codec

    PR for [LUCENE-8496](https://issues.apache.org/jira/browse/LUCENE-8496) enabling fields to use the first `numIndexDimensions` from `numDataDimensions` while creating the BKD index.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nknize/lucene-solr selectiveIndexing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/lucene-solr/pull/451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #451
   
----
commit a2f88a22297ca9c4a1471beb23e4c598704bc144
Author: Nicholas Knize <nicholas knize>
Date:   2018-09-14T17:39:40Z

    LUCENE-8496: Add selective indexing to BKD/POINTS codec
   
    Allows fields to use the first `numIndexDimensions` from `numDataDimensions` when creating the BKD index.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

iverase
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r217970618
 
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ---
    @@ -981,15 +988,15 @@ public long finish(IndexOutput out) throws IOException {
         assert pointCount / numLeaves <= maxPointsInLeafNode: "pointCount=" + pointCount + " numLeaves=" + numLeaves + " maxPointsInLeafNode=" + maxPointsInLeafNode;
     
         // Sort all docs once by each dimension:
    -    PathSlice[] sortedPointWriters = new PathSlice[numDims];
    +    PathSlice[] sortedPointWriters = new PathSlice[numDataDims];
     
         // This is only used on exception; on normal code paths we close all files we opened:
         List<Closeable> toCloseHeroically = new ArrayList<>();
     
         boolean success = false;
         try {
           //long t0 = System.nanoTime();
    -      for(int dim=0;dim<numDims;dim++) {
    +      for(int dim=0;dim<numDataDims;dim++) {
    --- End diff --
   
    I think we should only create sorted point writers for indexed dims. There is no point sort data dims as it only creates overhead when writing the tree.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

iverase
In reply to this post by iverase
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r217971501
 
    --- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
    @@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
     
         Directory dir = newFSDirectory(createTempDir());
         int numDocs = 100000;
    -    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
    +    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
    --- End diff --
   
    Couldn't we add a mode on the random tests where dataDim != indexedDim. The test can only check that the data dims are properly propagated to the leaf nodes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

iverase
In reply to this post by iverase
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r217970947
 
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
    @@ -52,10 +53,17 @@
       /** Caller must pre-seek the provided {@link IndexInput} to the index location that {@link BKDWriter#finish} returned */
       public BKDReader(IndexInput in) throws IOException {
         version = CodecUtil.checkHeader(in, BKDWriter.CODEC_NAME, BKDWriter.VERSION_START, BKDWriter.VERSION_CURRENT);
    -    numDims = in.readVInt();
    +    int packedDims = in.readVInt();
    +    if (version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
    +      numDataDims = packedDims & 0x0000FFFF;
    +      numIndexDims = packedDims >>> 16;
    +    } else {
    +      numDataDims = packedDims;
    +      numIndexDims = numDataDims;
    +    }
    --- End diff --
   
    Should we create a new version of the tree for this change and not mix it up with VERSION_LEAF_STORES_BOUNDS?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

iverase
In reply to this post by iverase
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r218112746
 
    --- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
    @@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
     
         Directory dir = newFSDirectory(createTempDir());
         int numDocs = 100000;
    -    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
    +    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
    --- End diff --
   
    I have just seen there is already a test using different index and data dimensions, still it would be worthy to check if possible that data dimensions are properly propagated.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

iverase
In reply to this post by iverase
Github user jpountz commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r218197102
 
    --- Diff: lucene/core/src/java/org/apache/lucene/codecs/lucene60/Lucene60FieldInfosFormat.java ---
    @@ -149,9 +149,14 @@ public FieldInfos read(Directory directory, SegmentInfo segmentInfo, String segm
                 attributes = lastAttributes;
               }
               lastAttributes = attributes;
    -          int pointDimensionCount = input.readVInt();
    +          int packedDimensionCount = input.readVInt();
    +          int pointDataDimensionCount = 0x0000FFFF & packedDimensionCount;
    +          int pointIndexDimensionCount = packedDimensionCount >>> 16;
    +          if (pointIndexDimensionCount == 0) {
    +            pointIndexDimensionCount = pointDataDimensionCount;
    +          }
    --- End diff --
   
    could we use a new version number instead to handle backcompat?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

iverase
In reply to this post by iverase
Github user jpountz commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r218184521
 
    --- Diff: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextBKDReader.java ---
    @@ -53,15 +54,16 @@
       final int version;
       protected final int packedBytesLength;
     
    -  public SimpleTextBKDReader(IndexInput in, int numDims, int maxPointsInLeafNode, int bytesPerDim, long[] leafBlockFPs, byte[] splitPackedValues,
    +  public SimpleTextBKDReader(IndexInput in, int numDataDims, int numIndexDims, int maxPointsInLeafNode, int bytesPerDim, long[] leafBlockFPs, byte[] splitPackedValues,
                                  byte[] minPackedValue, byte[] maxPackedValue, long pointCount, int docCount) throws IOException {
         this.in = in;
    -    this.numDims = numDims;
    +    this.numDataDims = numDataDims;
    +    this.numIndexDims = numIndexDims;
         this.maxPointsInLeafNode = maxPointsInLeafNode;
         this.bytesPerDim = bytesPerDim;
         // no version check here because callers of this API (SimpleText) have no back compat:
    -    bytesPerIndexEntry = numDims == 1 ? bytesPerDim : bytesPerDim + 1;
    -    packedBytesLength = numDims * bytesPerDim;
    +    bytesPerIndexEntry = numDataDims == 1 ? bytesPerDim : bytesPerDim + 1;
    +    packedBytesLength = numDataDims * bytesPerDim;
    --- End diff --
   
    can we make it numIndexDims on the two above lines?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

iverase
In reply to this post by iverase
Github user jpountz commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r218199848
 
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
    @@ -52,10 +53,17 @@
       /** Caller must pre-seek the provided {@link IndexInput} to the index location that {@link BKDWriter#finish} returned */
       public BKDReader(IndexInput in) throws IOException {
         version = CodecUtil.checkHeader(in, BKDWriter.CODEC_NAME, BKDWriter.VERSION_START, BKDWriter.VERSION_CURRENT);
    -    numDims = in.readVInt();
    +    int packedDims = in.readVInt();
    +    if (version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
    +      numDataDims = packedDims & 0x0000FFFF;
    +      numIndexDims = packedDims >>> 16;
    --- End diff --
   
    let's read two vints instead?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]