Quantcast

[jira] [Created] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

classic Classic list List threaded Threaded
67 messages Options
1234
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Created] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
Pluggable shard lookup mechanism for SolrCloud
----------------------------------------------

                 Key: SOLR-2592
                 URL: https://issues.apache.org/jira/browse/SOLR-2592
             Project: Solr
          Issue Type: New Feature
          Components: SolrCloud
    Affects Versions: 4.0
            Reporter: Noble Paul


If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050243#comment-13050243 ]

Noble Paul commented on SOLR-2592:
----------------------------------

This is why I created the issue SOLR-1431 . It may have a configuration as follows

{code:xml}
<requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- other params go here -->
 
     <shardHandler class="CloudShardHandler"/>
</requestHandler>
{code}



The CloudShardHandler should lookup ZK and return all the shards return all the shards by default.


I should be able to write a custom FqFilterCloudShardHandler and narrow down the requests to one or more shards
{code:xml}
<requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- other params go here -->
 
     <shardHandler class="FqFilterCloudShardHandler"/>
</requestHandler>
{code}




> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Updated] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Garski updated SOLR-2592:
---------------------------------

    Attachment: pluggable_sharding.patch

This patch is intended to be a cocktail napkin sketch to get feedback (as such forwarding queries to the appropriate shards is not yet implemented). I can iterate on this as needed.

The attached patch is a very simple implementation of pluggable sharding which works as follows:

1. Configure a ShardingStrategy in SolrConfig under config/shardingStrategy, if none is configured the default implementation of sharding on the document's unique id will be performed.
{code:xml}
     <shardingStrategy class="solr.UniqueIdShardingStrategy"/>
{code}

2. The ShardingStrategy accepts an AddUpdateCommand, DeleteUpdateCommand, or SolrParams to return a BytesRef that is hashed to determine the destination slice.

3. I have only implemented updates at this time, queries are still distributed across all shards in the collection. I have added a param to common.params.ShardParams for a 'shard.keys' parameter that would contain the value(s) which is(are) to be hashed to determine the shard(s) which is(are) to be queried within the the HttpShardHandler.checkDistributed method. if 'shard.keys' does not have a value the query would be distributed across all shards in the collection.

Notes:

There are no unit tests yet however all existing tests pass.

I am not quite sure about the configuration location within solr config, however as sharding is used by both update and search requests placing it in the udpateHandler and (potentially multiple) requestHandler sections would require a duplication of the same information in the solr config for what I believe is more of a collection-wide configuration.

As hashing currently requires the lucene.util.BytesRef class the solrj client can not currently hash the request to send the request to a specific node without having solrj add a dependency on lucene core - something that is most likely not desired.  Additionally, hashing on a unique id also requires access to the schema as well to determine the field that contains the unique id. Are there any thoughts on how to alter the hashing to remove these dependencies and allow for solrj to be a 'smart' client that submits requests directly to nodes that contain the data?

How would solrj work when multiple updates are included in the request that belong to different shards? Send the request to one of the nodes and let the server distribute them to the proper nodes? Perform concurrent requests to the specific nodes?


               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: pluggable_sharding.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279906#comment-13279906 ]

Michael Garski commented on SOLR-2592:
--------------------------------------

I've been tinkering with this a bit more and discovered that the real-time get component requires the ability to hash on the unique id of a document to determine the shard to forward the request to, requiring any hashing implementation to support hashing based off the unique id of a document. To account for this any custom hashing based on an arbitrary document property or other value would have to hash to the same value as the unique document id.

Looking at what representation of the unique id to hash on, by hashing based off of the string value rather than the indexed value, solrj and any other smart client can hash it and submit requests directly to the proper shard.  The method oas.common.util.ByteUtils.UTF16toUTF8 would serve the purpose of generating the bytes for the murmur hash on both client and server.

On both the client and server side updates and deletes by id are simple enough to hash, queries would require an additional parameter to specify a value to be hashed to determine the shard to forward the request to. It would also be a good idea to be able to optionally specify a value to hash on to direct the query to a particular shard rather than broadcast to all shards.

I'm going to rework what I had in the previous patch to account for this.

               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: pluggable_sharding.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Updated] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Garski updated SOLR-2592:
---------------------------------

    Attachment: pluggable_sharding_V2.patch

Here is an update to my original patch that accounts for the requirement of hashing based on unique id and works as follows:

1. Configure a ShardKeyParserFactory in SolrConfig under config/shardKeyParserFactory. If there is not one configured the default implementation of sharding on the document's unique id will be performed. The default configuration is equivalent to:
{code:xml}
<shardKeyParserFactory class="solr.ShardKeyParserFactory"/>
{code}

2. The ShardKeyParser has two methods to parse a shard key out of the unique id or a delete by query. The default implementation returns the string value of the unique id when parsing the unique id to forward it to the specific shard, and null when parsing the delete by query to broadcast a delete by query to the entire collection.

3. Queries can be directed to a subset of shards in the collection by specifying one or more shard keys in the request parameter 'shard.keys'.

Notes:

There are no distinct unit tests for this change yet, however all current unit tests pass. The switch to hashing on the string value rather than the indexed value is how I realized the real-time get component requires support for hashing based on the document's unique id with a failing test.

By hashing on the string values rather than indexed values, the solrj client can direct queries to a specific shard however this is not yet implemented.

I put the hashing function in the oas.common.cloud.HashPartioner class, which encapsulates the hashing and partitioning in one place.  I can see a desire for a pluggable collection partitioning where a collection could be partitioned on time periods or some other criteria but that is outside of the scope of pluggable shard hashing.

               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284051#comment-13284051 ]

Nicholas Ball commented on SOLR-2592:
-------------------------------------

What's the status on this and does the patch work with the latest trunk version of https://issues.apache.org/jira/browse/SOLR-2358 ?

Nick
               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284187#comment-13284187 ]

Michael Garski commented on SOLR-2592:
--------------------------------------

The pluggable_sharding_V2.patch does work with the latest version of trunk, however I would recommend running the unit tests after applying the patch to verify things on your end. I am hesitant to recommend the patch for production use as it does change the value that is hashed from the indexed value to the string value of the unique id which changes the behavior of the current hashing if your unique id is a numeric field type.

I am using the patch with the latest from a few days ago for testing and am using composite unique ids (123_456_789) so I created a ShardKeyParser that hashes on the first id in the composite (123) and parses a clause out of a delete by query to route updates and deletes to the appropriate shards. In order to direct the query to a particular shard use the shard.keys param set to the value that should be hashed (123).
               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Updated] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Garski updated SOLR-2592:
---------------------------------

    Attachment: dbq_fix.patch

Here is an updated patch that fixes an issue where a delete by query is silently dropped. I have not yet added any additional unit tests for this patch, however all existing unit test pass - which may not be that significant as that did not catch the DBQ issue.
               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290763#comment-13290763 ]

Lance Norskog commented on SOLR-2592:
-------------------------------------

Hash-based distribution (deterministic pseudo-randomness) is only one way to divvy up documents among shards. Another common one is date-based: each month is in a different shard. You can search recent or older without searching all shards.

If you're going to make pluggable distribution policies, why stick to hashing? Why not do the general case?


               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291120#comment-13291120 ]

Michael Garski commented on SOLR-2592:
--------------------------------------

{quote}
If you're going to make pluggable distribution policies, why stick to hashing? Why not do the general case?
{quote}

Very good point. The patches I've attached do not attempt to address other partitioning schemes such as date based or federated as my immediate requirement is to be able to customize the hashing. Implementation of a pluggable distribution policy is the logical end state, however I have not had the time to investigate potential approaches for such a change yet.
               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294837#comment-13294837 ]

Andy Laird commented on SOLR-2592:
----------------------------------

I have tried out Michael's patch and would like to provide some feedback to the community.  We are using a very-recent build from the 4x branch but I grabbed this patch from trunk and tried it out anyway...

Our needs were driven by the fact that, currently, the counts returned when using field collapse are only accurate when the documents getting collapsed together are all on the same shard (see comments for https://issues.apache.org/jira/browse/SOLR-2066).  For our case we collapse on a field, xyz, so we need to ensure that all documents with the same value for xyz are on the same shard (overall distribution is not a problem here) if we want counting to work.

I grabbed the latest patch (dbq_fix.patch) in hopes of finding a solution to our problem.  The great news is that Michael's patch worked like a charm for what we needed -- thank you kindly, Michael, for this effort!  The not-so-good news is that for our particular issue we needed a way to get at data other than the uniqueKey (the only data available with ShardKeyParser) -- in our case we need access to the xyz field data.  Since this implementation provides nothing but uniqueKey we had to encode the xyz data in our uniqueKey (e.g. newUniqueKey = what-used-to-be-our-uniqueKey + xyz), which is certainly less-than-ideal and adds unsavory coupling.

Nonetheless, as a fix to a last-minute gotcha (our counts with field collapse need to be accurate in a multi-shard environment) I was happily surprised at how easy it was to find a solution to our particular problem with this patch.  I would definitely like to see a second iteration that incorporates the ability to get at other document data, then you could do whatever you want by looking at dates and other fields, etc. though I understand that that probably goes quite a bit deeper in the codebase, especially with distributed search.

               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295390#comment-13295390 ]

Michael Garski commented on SOLR-2592:
--------------------------------------

The reason for requiring the unique id to be hashable is that it is required to support the distributed real-time get component to retrieve a document based on only the unique id, which in turn is required for SolrCloud. Unit tests that exercise the patch thoroughly are still needed and I will be diving into later this week, so please keep that in mind if you are using this outside of a test environment.
               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Updated] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Garski updated SOLR-2592:
---------------------------------

    Attachment: SOLR-2592.patch

Attached is an updated patch that includes unit tests and applies cleanly to the latest in branch_4x. This patch corrects a bug found in unit testing where a delete by ID may be not be forwarded to the correct shard. I created a shard key parser implementation for testing that uses the string value of the unique id reversed, and modified the FullSolrCloudDistribCmdsTest test to use that implementation (which is how I found the delete by id bug). I've also added a shard key parser with unit tests that parses the shard key value from an underscore-delimited unique id and from a clause in a delete query. As the shard must be parsed from the unique id for the realtime get handler a composite id of some sort will be required for placing a document in a specific shard.

While the solrj client is able to parse the shard key from the query and forward it to the appropriate shard, I have not yet implemented that piece yet.
               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933 ]

Andy Laird commented on SOLR-2592:
----------------------------------

I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch, involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field, xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789 (so that field collapse counts are correct since we use that to drive paging).

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="clusterId">123:789</field>    <-- old clusterId was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for maximum flexibility.  In addition to a solution to our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Comment Edited] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933 ]

Andy Laird edited comment on SOLR-2592 at 6/23/12 1:01 PM:
-----------------------------------------------------------

I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch -- involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field, xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789 (so that field collapse counts are correct since we use that to drive paging).

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="clusterId">123:789</field>    <-- old clusterId was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for maximum flexibility.  In addition to a solution to our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

               
      was (Author: clavius):
    I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch, involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field, xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789 (so that field collapse counts are correct since we use that to drive paging).

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="clusterId">123:789</field>    <-- old clusterId was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for maximum flexibility.  In addition to a solution to our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

                 

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Comment Edited] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933 ]

Andy Laird edited comment on SOLR-2592 at 6/23/12 1:13 PM:
-----------------------------------------------------------

I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.  For us the counts returned during field-collapse have to be exact since we use that to drive paging (we can't have a user going to page 29 and finding nothing there).

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch -- involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field, xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789.

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="indexId">123:789</field>    <-- old value was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for more flexibility.  In addition to solving our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

               
      was (Author: clavius):
    I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch -- involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field, xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789 (so that field collapse counts are correct since we use that to drive paging).

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="clusterId">123:789</field>    <-- old clusterId was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for maximum flexibility.  In addition to a solution to our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

                 

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Comment Edited] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933 ]

Andy Laird edited comment on SOLR-2592 at 6/23/12 1:15 PM:
-----------------------------------------------------------

I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.  For us the counts returned from a search using field collapse have to be exact since that drives paging logic (we can't have a user going to page 29 and finding nothing there).

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch -- involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field, xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789.

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="indexId">123:789</field>    <-- old value was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for more flexibility.  In addition to solving our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

               
      was (Author: clavius):
    I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.  For us the counts returned during field-collapse have to be exact since we use that to drive paging (we can't have a user going to page 29 and finding nothing there).

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch -- involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field, xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789.

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="indexId">123:789</field>    <-- old value was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for more flexibility.  In addition to solving our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

                 

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Comment Edited] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933 ]

Andy Laird edited comment on SOLR-2592 at 6/23/12 1:18 PM:
-----------------------------------------------------------

I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.  For us the counts returned from a search using field collapse have to be exact since that drives paging logic (we can't have a user going to page 29 and finding nothing there).

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch -- involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra {{indexId}} data in the form: {{id:xyz}}.  Our custom ShardKeyParser extracts out the {{xyz}} portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

Consider what happens when we need to change the value of the field, xyz, however.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789.

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="indexId">123:789</field>    <-- old value was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for more flexibility.  In addition to solving our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

               
      was (Author: clavius):
    I've been using some code very similar to Michael's latest patch for a few weeks now and am liking it less and less for our use case.  As I described above, we are using this patch to ensure that all docs with the same value for a specific field end up on the same shard -- this is so that the field collapse counting will work for distributed searches, otherwise the returned counts are only an upper bound.  For us the counts returned from a search using field collapse have to be exact since that drives paging logic (we can't have a user going to page 29 and finding nothing there).

The problems we've encountered have entirely to do with our need to update the value of the field we're doing a field-collapse on.  Our approach -- conceptually similar to the CompositeIdShardKeyParserFactory in Michael's latest patch -- involved creating a new schema field, indexId, that was a combination of what used to be our uniqueKey plus the field that we collapse on:

*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}

*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="xyz" type="string" indexed="true" stored="true" multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true" multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}

During indexing we insert the extra "indexId" data in the form, "id:xyz".  Our custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns that as the hash value for shard selection.  Everything works great in terms of field collapse, counts, etc.

The problems begin when considering what happens when we need to change the value of the field, xyz.  Suppose that our document starts out with these values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}

We then want to change xyz to the value 789, say.  In other words, we want to end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}

...so that the doc lives on the same shard along with other docs that have xyz=789.

Before any of this we would simply pass in a new document and all would be good since we weren't changing the uniqueKey.  However, now we need to delete the old document (with the old uniqueKey) or we'll end up with duplicates.  We don't know whether a given update changes the value of xyz or not and we don't know what the old value for xyz was (without doing an additional lookup) so we must include an extra delete along with every change:

*Before*
{code:xml}
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
  <doc>
</add>
{code}

*Now*
{code:xml}
<delete>
  <query>id:123 AND NOT xyz:789</query>
</delete>
<add>
  <doc>
    <field name="id">123</field>
    <field name="xyz">789</field>
    <field name="indexId">123:789</field>    <-- old value was 123:456
  <doc>
</add>
{code}

So in addition to the "unsavory coupling" between id and xyz there is a significant performance hit to this approach (as we're doing this in the context of NRT).  The fundamental issue, of course, is that we only have the uniqueKey value (id) and score for the first phase of distributed search -- we really need the other field that we are using for shard ownership, too.

One idea is to have another standard schema field similar to uniqueKey that is used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask for uniqueKey, shardKey and score.  Perhaps the ShardKeyParser gets both uniqueKey and shardKey data for more flexibility.  In addition to solving our issue with field collapse counts, date-based sharding could be done by setting the shardKey to a date field and doing appropriate slicing in the ShardKeyParser.

                 

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399999#comment-13399999 ]

Michael Garski commented on SOLR-2592:
--------------------------------------

One of the use cases I have is identical to yours Andy, where shard membership is used to ensure accuracy of the numGroups in the response for a distributed grouping query.

The challenge in that use case is that during an update a document could potentially move from one shard to another, requiring deleting it from its current shard along with adding it to the shard where it will now reside. If the previous value of the shardKey is not known, the same delete by query operation you have in 'Now' would have to be broadcast to all shards to ensure there are no duplicate unique ids in the collection. It looks like that would result in the same overhead as using the composite id. Do you have any ideas on how to handle that during an update?

Adding a separate shardKey definition to the schema would also cascade the change to the real-time get handler, which currently only uses the unique document ids as an input.

Regarding date-based sharding, I look at that as being handled differently. With hashing a document is being assigned to a specific shard from a set of known shards where with date-based sharding I would imagine one would want to bring up a new shard for a specific time period, perhaps daily or hourly. I can imagine that it might be desirable in some cases to merge shards with older date ranges together as well if the use case favors recent updates at search time.
               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Updated] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Garski updated SOLR-2592:
---------------------------------

    Attachment: SOLR-2592_rev_2.patch

Updated patch (SOLR-2592_rev_2.patch) to remove custom similarities from the test configs as they were causing the test to hang.
               

> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0
>            Reporter: Noble Paul
>         Attachments: SOLR-2592.patch, SOLR-2592_rev_2.patch, dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

1234
Loading...