Performance if there is a large number of field

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance if there is a large number of field

Issei Nishigata
Hi, all


I am designing a schema.

I calculated the number of the necessary field as trial, and found that I
need at least more than 35000.
I do not use all these fields in 1 document.
I use 300 field each document at maximum, and do not use remaining 34700
fields.

Does this way of using it affect performance such as retrieving and
sorting?
If it is affected, what kind of alternative idea do we have?


Thanks,
Issei

--
Issei Nishigata
Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Shawn Heisey-2
On 5/10/2018 7:51 AM, Issei Nishigata wrote:

> I am designing a schema.
>
> I calculated the number of the necessary field as trial, and found that I
> need at least more than 35000.
> I do not use all these fields in 1 document.
> I use 300 field each document at maximum, and do not use remaining 34700
> fields.
>
> Does this way of using it affect performance such as retrieving and
> sorting?
> If it is affected, what kind of alternative idea do we have?

There are no storage efficiency degradations from having fields defined
that aren't used in particular documents.

It is likely that having so many fields is going to result in extremely
large and complex queries.  That is the potential performance problem.

The efficiency of each clause of the query will not be affected by
having several thousand fields unused in each document, but if your
queries include clauses for searching thousands of fields, then the
query will run slowly.  If you are constructing relatively simple
queries that only touch a small number of fields, then that won't be a
worry.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Deepak Goel
I wonder what does Solr stores in the document for fields which are not
being used. And if the queries have a performance difference
https://lucene.apache.org/solr/guide/6_6/defining-fields.html
(A default value that will be added automatically to any document that does
not have a value in this field when it is indexed. If this property is not
specified, there is no default)





Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
[hidden email]

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Thu, May 10, 2018 at 9:10 PM, Shawn Heisey <[hidden email]> wrote:

> On 5/10/2018 7:51 AM, Issei Nishigata wrote:
>
>> I am designing a schema.
>>
>> I calculated the number of the necessary field as trial, and found that I
>> need at least more than 35000.
>> I do not use all these fields in 1 document.
>> I use 300 field each document at maximum, and do not use remaining 34700
>> fields.
>>
>> Does this way of using it affect performance such as retrieving and
>> sorting?
>> If it is affected, what kind of alternative idea do we have?
>>
>
> There are no storage efficiency degradations from having fields defined
> that aren't used in particular documents.
>
> It is likely that having so many fields is going to result in extremely
> large and complex queries.  That is the potential performance problem.
>
> The efficiency of each clause of the query will not be affected by having
> several thousand fields unused in each document, but if your queries
> include clauses for searching thousands of fields, then the query will run
> slowly.  If you are constructing relatively simple queries that only touch
> a small number of fields, then that won't be a worry.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Shawn Heisey-2
On 5/10/2018 10:58 AM, Deepak Goel wrote:
> I wonder what does Solr stores in the document for fields which are not
> being used. And if the queries have a performance difference
> https://lucene.apache.org/solr/guide/6_6/defining-fields.html
> (A default value that will be added automatically to any document that does
> not have a value in this field when it is indexed. If this property is not
> specified, there is no default)

If a field is missing from a document, the Lucene index doesn't contain
anything for that field.  That is why there is no storage disadvantage
to having fields that are not being used.

Lucene does not have the concept of a schema.  That is part of Solr. 
Solr uses the information in the schema to control its interaction with
Lucene.  When there is a default value specified in the schema, the
field is never missing from the document.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Deepak Goel
Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
[hidden email]

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Thu, May 10, 2018 at 10:50 PM, Shawn Heisey <[hidden email]> wrote:

> On 5/10/2018 10:58 AM, Deepak Goel wrote:
>
>> I wonder what does Solr stores in the document for fields which are not
>> being used. And if the queries have a performance difference
>> https://lucene.apache.org/solr/guide/6_6/defining-fields.html
>> (A default value that will be added automatically to any document that
>> does
>> not have a value in this field when it is indexed. If this property is not
>> specified, there is no default)
>>
>
> If a field is missing from a document, the Lucene index doesn't contain
> anything for that field.  That is why there is no storage disadvantage to
> having fields that are not being used.
>
> Lucene does not have the concept of a schema.  That is part of Solr.  Solr
> uses the information in the schema to control its interaction with Lucene.
> When there is a default value specified in the schema, the field is never
> missing from the document.
>
> Sorry but I am unclear about - "What if there is no default value and the
field does not contain anything"? What does Solr pass on to Lucene? Or is
the field itself omitted from the document?

What if I want to query for documents where the field is not used? Is that
possible?

Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Shawn Heisey-2
On 5/10/2018 11:49 AM, Deepak Goel wrote:
> Sorry but I am unclear about - "What if there is no default value and the
> field does not contain anything"? What does Solr pass on to Lucene? Or is
> the field itself omitted from the document?

If there is no default value and the field doesn't exist in what's
indexed, then nothing is sent to Lucene for that field. The Lucene index
will have nothing in it for that field.  Pro tip: The empty string is
not the same thing as no value.

> What if I want to query for documents where the field is not used? Is that
> possible?

This is the best performing approach for finding documents where a field
doesn't exist:

q=*:* -field:[* TO *]

Summary: all documents, minus those where the field value is in an
all-inclusive range.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Deepak Goel
On Fri, 11 May 2018, 01:15 Shawn Heisey, <[hidden email]> wrote:

> On 5/10/2018 11:49 AM, Deepak Goel wrote:
> > Sorry but I am unclear about - "What if there is no default value and the
> > field does not contain anything"? What does Solr pass on to Lucene? Or is
> > the field itself omitted from the document?
>
> If there is no default value and the field doesn't exist in what's
> indexed, then nothing is sent to Lucene for that field. The Lucene index
> will have nothing in it for that field.  Pro tip: The empty string is
> not the same thing as no value.
>
> > What if I want to query for documents where the field is not used? Is
> that
> > possible?
>
> This is the best performing approach for finding documents where a field
> doesn't exist:
>
> q=*:* -field:[* TO *]
>

Are there any benchmarks for this approach? If not, I can give it a spin.
Also wondering if there are any alternative approach (i guess lucene stores
data in a inverted field format)

>
> Summary: all documents, minus those where the field value is in an
> all-inclusive range.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Shawn Heisey-2
On 5/10/2018 2:22 PM, Deepak Goel wrote:
> Are there any benchmarks for this approach? If not, I can give it a spin.
> Also wondering if there are any alternative approach (i guess lucene stores
> data in a inverted field format)

Here is the only other query I know of that can find documents missing a
field:

q=*:* -field:*

The potential problem with this query is that it uses a wildcard.  On
non-point fields with very low cardinality, the performance might be
similar.  But if the field is a Point type, or has a large number of
unique values, then performance would be a lot worse than the range
query I mentioned before.  The range query is the best general purpose
option.

The *:* query, despite appearances, does not use wildcards.  It is
special query syntax.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Andy C
Shawn,

Why are range searches more efficient than wildcard searches? I guess I
would have expected that they just provide different mechanism for defining
the range of unique terms that are of interest, and that the merge
processing would be identical.

Would a search such as:

field:c*

be more efficient if rewritten as:

field:[c TO d}

then?

On Fri, May 11, 2018 at 10:45 AM, Shawn Heisey <[hidden email]> wrote:

> On 5/10/2018 2:22 PM, Deepak Goel wrote:
>
>> Are there any benchmarks for this approach? If not, I can give it a spin.
>> Also wondering if there are any alternative approach (i guess lucene
>> stores
>> data in a inverted field format)
>>
>
> Here is the only other query I know of that can find documents missing a
> field:
>
> q=*:* -field:*
>
> The potential problem with this query is that it uses a wildcard.  On
> non-point fields with very low cardinality, the performance might be
> similar.  But if the field is a Point type, or has a large number of unique
> values, then performance would be a lot worse than the range query I
> mentioned before.  The range query is the best general purpose option.
>
> The *:* query, despite appearances, does not use wildcards.  It is special
> query syntax.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Deepak Goel
In reply to this post by Shawn Heisey-2
Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
[hidden email]

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Fri, May 11, 2018 at 8:15 PM, Shawn Heisey <[hidden email]> wrote:

> On 5/10/2018 2:22 PM, Deepak Goel wrote:
>
>> Are there any benchmarks for this approach? If not, I can give it a spin.
>> Also wondering if there are any alternative approach (i guess lucene
>> stores
>> data in a inverted field format)
>>
>
> Here is the only other query I know of that can find documents missing a
> field:
>
> q=*:* -field:*
>
> The potential problem with this query is that it uses a wildcard.  On
> non-point fields with very low cardinality, the performance might be
> similar.  But if the field is a Point type, or has a large number of unique
> values, then performance would be a lot worse than the range query I
> mentioned before.  The range query is the best general purpose option.
>
>
I wonder if giving a default value would help. Since Lucene stores all the
document id's which contain the default value (not changed by user) in a
single block (inverted index format), this could be retrieved much faster


> The *:* query, despite appearances, does not use wildcards.  It is special
> query syntax.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Shawn Heisey-2
In reply to this post by Andy C
On 5/11/2018 9:26 AM, Andy C wrote:
> Why are range searches more efficient than wildcard searches? I guess I
> would have expected that they just provide different mechanism for defining
> the range of unique terms that are of interest, and that the merge
> processing would be identical.

I hope I can explain the reason that wildcard queries tend to be slow. 
I will use an example field from one of my own indexes.

Choosing one of the shards of my main index, and focusing on the
"keywords" field for that Solr core:  Here's the histogram data that the
Luke handler gives for this field:

      "histogram":[
        "1",14095268,
        "2",767777,
        "4",425610,
        "8",312156,
        "16",236743,
        "32",177718,
        "64",122603,
        "128",80513,
        "256",52746,
        "512",34925,
        "1024",24770,
        "2048",17516,
        "4096",11467,
        "8192",7748,
        "16384",5210,
        "32768",3433,
        "65536",2164,
        "131072",1280,
        "262144",688,
        "524288",355,
        "1048576",163,
        "2097152",53,
        "4194304",12]}},


The first entry means that there are 14 million terms that only appear
once in the keywords field across the whole index. The last entry means
that there are twelve terms that appear 4 million times in the keywords
field across the whole index.

Adding this all up, I can see that there are a little more than 16
million unique terms in this field.

This means that when I do a "keywords:*" query, that Solr/Lucene will
expand this query such that the query literally contains 16 million
individual terms.  It's going to take time just to make the query.  And
then that query will have to be executed.  No matter how quickly each
term in the query executes, doing 16 million of them is going to be slow.

Just for giggles, I used my dev server to execute that "keywords:*"
query on this single shard.  The reported QTime in the response was
18017 milliseconds.  Then I ran the full range query.  The reported
QTime for that was 14569 milliseconds.  Which is honestly slower than I
thought it would be, but faster than the wildcard.  The number of unique
terms in the field affects both kinds of queries, but the effect of a
large number of terms on the wildcard is usually greater than the effect
on the range.

> Would a search such as:
>
> field:c*
>
> be more efficient if rewritten as:
>
> field:[c TO d}

On most indexes, probably.  That would depend on the number of terms in
the field, I think.  But there's something to consider:  Not every
wildcard query can be easily rewritten as a range.  I think this one is
impossible to rewrite as a range:  field:abc*xyz

I tried your c* example as well on my keywords field.  The wildcard had
a QTime of 1702 milliseconds.  The range query had a QTime of 1434
milliseconds.  The numFound on both queries was identical, at 16399711.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Performance if there is a large number of field

Erick Erickson
Deepak:

I would strongly urge you to consider changing your problem solution
to _not_ need 35,000 fields. What that usually indicates is that there
are much better ways of tackling the problem. As Shawn says, 35,000
fields won't make much difference for an individual search. But 35,000
fields _do_ take up meta-data space, there has to be a catalog of all
the possibilities somewhere.

The question about missing fields is tricky. For the inverted index,
consider the structure. For each _field_ the structure looks like
this:
term, doc1, doc45, doc93.....

so really, the doc not having the field is pretty much similar to the
doc not having a term in that field, it's just missing.

But back to your problem. Think hard about _why_ you think you need
35,000 fields. Could you tag your field? Say you are storing prices
for stores for some item. Instead of having a field for store1_price,
store2_price... what about having a single field store1_price_1.53
store2_price_2.35 etc.

Or consider payloads. store1_price|1.53 store2_price|2.35 and using
that See: https://lucidworks.com/2017/09/14/solr-payloads/

I've rarely seen situations where having that many fields is an
optimal solution.

Best,
Erick

On Fri, May 11, 2018 at 12:20 PM, Shawn Heisey <[hidden email]> wrote:

> On 5/11/2018 9:26 AM, Andy C wrote:
>> Why are range searches more efficient than wildcard searches? I guess I
>> would have expected that they just provide different mechanism for defining
>> the range of unique terms that are of interest, and that the merge
>> processing would be identical.
>
> I hope I can explain the reason that wildcard queries tend to be slow.
> I will use an example field from one of my own indexes.
>
> Choosing one of the shards of my main index, and focusing on the
> "keywords" field for that Solr core:  Here's the histogram data that the
> Luke handler gives for this field:
>
>       "histogram":[
>         "1",14095268,
>         "2",767777,
>         "4",425610,
>         "8",312156,
>         "16",236743,
>         "32",177718,
>         "64",122603,
>         "128",80513,
>         "256",52746,
>         "512",34925,
>         "1024",24770,
>         "2048",17516,
>         "4096",11467,
>         "8192",7748,
>         "16384",5210,
>         "32768",3433,
>         "65536",2164,
>         "131072",1280,
>         "262144",688,
>         "524288",355,
>         "1048576",163,
>         "2097152",53,
>         "4194304",12]}},
>
>
> The first entry means that there are 14 million terms that only appear
> once in the keywords field across the whole index. The last entry means
> that there are twelve terms that appear 4 million times in the keywords
> field across the whole index.
>
> Adding this all up, I can see that there are a little more than 16
> million unique terms in this field.
>
> This means that when I do a "keywords:*" query, that Solr/Lucene will
> expand this query such that the query literally contains 16 million
> individual terms.  It's going to take time just to make the query.  And
> then that query will have to be executed.  No matter how quickly each
> term in the query executes, doing 16 million of them is going to be slow.
>
> Just for giggles, I used my dev server to execute that "keywords:*"
> query on this single shard.  The reported QTime in the response was
> 18017 milliseconds.  Then I ran the full range query.  The reported
> QTime for that was 14569 milliseconds.  Which is honestly slower than I
> thought it would be, but faster than the wildcard.  The number of unique
> terms in the field affects both kinds of queries, but the effect of a
> large number of terms on the wildcard is usually greater than the effect
> on the range.
>
>> Would a search such as:
>>
>> field:c*
>>
>> be more efficient if rewritten as:
>>
>> field:[c TO d}
>
> On most indexes, probably.  That would depend on the number of terms in
> the field, I think.  But there's something to consider:  Not every
> wildcard query can be easily rewritten as a range.  I think this one is
> impossible to rewrite as a range:  field:abc*xyz
>
> I tried your c* example as well on my keywords field.  The wildcard had
> a QTime of 1702 milliseconds.  The range query had a QTime of 1434
> milliseconds.  The numFound on both queries was identical, at 16399711.
>
> Thanks,
> Shawn
>