In Place Updates: Can we filter on fields with only docValues="true"

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

In Place Updates: Can we filter on fields with only docValues="true"

Doss
Hi,

4 to 5 million documents.

For an NTR index, we need a field to be updated very frequently and filter
results based on it. Will In-Place updates help us?

<field name="status" type="pint" indexed="false" stored="false"
docValues="true" />


Thanks,
Doss.
Reply | Threaded
Open this post in threaded view
|

Re: In Place Updates: Can we filter on fields with only docValues="true"

Mikhail Khludnev-2
It's worth to try. I know about folks who build NRT system on it. One
thing, I might be wrong but, "pint" might mean points which is hardly
compatible with inPlace update. It should be the simplest numbers, if you
can debug Solr, check that it creates NumericDocValues, not sorted ones.
These are updateable inplace.

On Tue, Sep 10, 2019 at 4:15 PM Doss <[hidden email]> wrote:

> Hi,
>
> 4 to 5 million documents.
>
> For an NTR index, we need a field to be updated very frequently and filter
> results based on it. Will In-Place updates help us?
>
> <field name="status" type="pint" indexed="false" stored="false"
> docValues="true" />
>
>
> Thanks,
> Doss.
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: In Place Updates: Can we filter on fields with only docValues="true"

Shawn Heisey-2
In reply to this post by Doss
On 9/10/2019 7:15 AM, Doss wrote:
> 4 to 5 million documents.
>
> For an NTR index, we need a field to be updated very frequently and filter
> results based on it. Will In-Place updates help us?
>
> <field name="status" type="pint" indexed="false" stored="false"
> docValues="true" />

Although you CAN search on docValues-only fields, the performance is
terrible.  So the answer I have for you is "maybe, but you won't like
it."  For good filtering performance, you need the field to be indexed.
Which means you can't do in-place updates.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: In Place Updates: Can we filter on fields with only docValues="true"

Mikhail Khludnev-2
Shawn, would you mind to provide some numbers?
I'm experimenting with lucene 8.0.0.
I have 100 shard index of 100M docs with 2000 docVals only updateable
fields. Searching for such field turns to be blazingly fast
$ curl 'localhost:39200/books/_search?pretty&size=20' -d '
{"query": {"bool": {"filter": {"range": {"subscription_0x1": {"lte": 666,
"gte": 666}}}}}}'
{
  "took" : 148,
  "timed_out" : false,
  "_shards" : {    "total" : 100,    "successful" : 100,    "skipped" : 0,
  "failed" : 0
  },
  "hits" : {
    "total" : {      "value" : 1,      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "books",
        "_id" : "28113070",
        "_score" : 0.0
      }
    ]
  }
}

I've just updated this field in this particular doc. Other 245K of 100M
docs has 1 in it

$ curl -H 'Content-Type:application/json'
'localhost:39200/books/_search?pretty&size=20' -d '
{"track_total_hits": true, "query": {"bool": {"filter": {"range":
{"subscription_0x1": {"lte": 1, "gte":1}}}}}}'
{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 100,
    "successful" : 100,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 245335,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "books",
        "_id" : "30155366",
        "_score" : 0.0
      },

It's dv field without index

$ curl -s
'localhost:39200/books/_mapping/field/subscription_0x1?pretty&include_defaults=true'
{
  "books" : {
    "mappings" : {
      "subscription_0x1" : {
        "full_name" : "subscription_0x1",
        "mapping" : {
          "subscription_0x1" : {
            "type" : "integer",
            "boost" : 1.0,
            "index" : false,
            "store" : false,
            "doc_values" : true,
            "term_vector" : "no",
            "norms" : false,
            "eager_global_ordinals" : false,
            "similarity" : "BM25",
            "ignore_malformed" : false,
            "coerce" : true,
            "null_value" : null
          }
        }
      }
    }
  }
}



On Tue, Sep 10, 2019 at 4:55 PM Shawn Heisey <[hidden email]> wrote:

> On 9/10/2019 7:15 AM, Doss wrote:
> > 4 to 5 million documents.
> >
> > For an NTR index, we need a field to be updated very frequently and
> filter
> > results based on it. Will In-Place updates help us?
> >
> > <field name="status" type="pint" indexed="false" stored="false"
> > docValues="true" />
>
> Although you CAN search on docValues-only fields, the performance is
> terrible.  So the answer I have for you is "maybe, but you won't like
> it."  For good filtering performance, you need the field to be indexed.
> Which means you can't do in-place updates.
>
> Thanks,
> Shawn
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: In Place Updates: Can we filter on fields with only docValues="true"

Shawn Heisey
On 9/14/2019 4:29 PM, Mikhail Khludnev wrote:
> Shawn, would you mind to provide some numbers?
> I'm experimenting with lucene 8.0.0.
> I have 100 shard index of 100M docs with 2000 docVals only updateable
> fields. Searching for such field turns to be blazingly fast
> $ curl 'localhost:39200/books/_search?pretty&size=20' -d '

I have no idea how to read the json you've pasted.  Neither that or the
URLs look like Solr.

> I've just updated this field in this particular doc. Other 245K of 100M
> docs has 1 in it
>
> $ curl -H 'Content-Type:application/json'

<snip>

> It's dv field without index
>
> $ curl -s
> 'localhost:39200/books/_mapping/field/subscription_0x1?pretty&include_defaults=true'

What's the cardinality of the field you're searching on?  If it's small,
then even an inefficient search will be fast.  Try on a field with
millions or billions of possible values.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: In Place Updates: Can we filter on fields with only docValues="true"

Erick Erickson
Filtering is really searching. As Shawn says, you _might_ get away with it in some circumstances, but it’s not something I’d recommend.

Here’s the problem: For most searches, you’re trying to ask “for term X, what docs contain it?”. That’s exactly what the inverted index is for, it’s an ordered list of terms, each term has the list of documents it appears in.

DocValues is the exact opposite. It answers “For doc X, what is the value of field Y?”. When _searching_ on a DV only field, think “table scan” in DB terms.

Pick a field with high cardinality. Worst-case, every doc has a unique value and try searching on that. If it’s fast, then I need to go into the code and understand why it’s not doing what I expect ;).

I’ll add parenthetically that 100M docs with 100 shards seems excessively sharded. Perhaps you have so many fields that that’s warranted, but it seems high. My rule-of-thumb starting place is 50M docs/shard. Admittedly that can be low or high, I’ve seen 300M docs fit in 12G and 10M docs strain 31G. You might try testing a node to destruction, see: https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

> On Sep 14, 2019, at 7:54 PM, Shawn Heisey <[hidden email]> wrote:
>
> On 9/14/2019 4:29 PM, Mikhail Khludnev wrote:
>> Shawn, would you mind to provide some numbers?
>> I'm experimenting with lucene 8.0.0.
>> I have 100 shard index of 100M docs with 2000 docVals only updateable
>> fields. Searching for such field turns to be blazingly fast
>> $ curl 'localhost:39200/books/_search?pretty&size=20' -d '
>
> I have no idea how to read the json you've pasted.  Neither that or the URLs look like Solr.
>
>> I've just updated this field in this particular doc. Other 245K of 100M
>> docs has 1 in it
>> $ curl -H 'Content-Type:application/json'
>
> <snip>
>
>> It's dv field without index
>> $ curl -s
>> 'localhost:39200/books/_mapping/field/subscription_0x1?pretty&include_defaults=true'
>
> What's the cardinality of the field you're searching on?  If it's small, then even an inefficient search will be fast.  Try on a field with millions or billions of possible values.
>
> Thanks,
> Shawn