Regarding document routing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Regarding document routing

manish tanger
Hello All,

I am having a doubt in implicit routing and didn't find much info about
this over the internet, so Please help me out on this.

*About environment:*
M/c 1: Zookeeper 1 and Solr 1
M/c 2: Zookeeper 2 and Solr 2

I am using clustered zookeeper and using "CloudSolrClient" from solrJ
API in java.

*this.solrCloudClient = new
CloudSolrClient.Builder().withZkHost(zkHostList).build();*

*Requirement:*

My requirement is to store lots of data on solr using a single collection.
so my idea is that i am going to create a new shard for every hour so that
indexing doesn't take much time.

I choose for the implicit document routing, but I am unable to redirect the
docs on the particular shard. Zookeeper is still distributing it on all
nodes and shards.


*What I have tried:*
1. I have created a collection with implicit routing and put customer
routing field "*dateandhour*" and add it as a filed in my collection.

    While adding solr input doc I am setting this filed with shard name.


2. I have also tried to add shard name to id filed like:
     id="*shardName!*uniquedocumentId"


If you guys have some example or doc Please share with me.

Thanks for all your help.


Best Regards,

Manish
Reply | Threaded
Open this post in threaded view
|

Re: Regarding document routing

Shawn Heisey-2
On 1/10/2018 12:18 AM, manish tanger wrote:
> I am having a doubt in implicit routing and didn't find much info about
> this over the internet, so Please help me out on this.
>
> *About environment:*
> M/c 1: Zookeeper 1 and Solr 1
> M/c 2: Zookeeper 2 and Solr 2

For redundancy with ZK, you need three hosts minimum.  A two-host ZK
ensemble is actually *less* reliable than using one server.  You aren't
protected against failure until you have at least three.  You would only
need a minimum of two Solr hosts, though.

> I am using clustered zookeeper and using "CloudSolrClient" from solrJ
> API in java.
>
> *this.solrCloudClient = new
> CloudSolrClient.Builder().withZkHost(zkHostList).build();*
>
> *Requirement:*
>
> My requirement is to store lots of data on solr using a single collection.
> so my idea is that i am going to create a new shard for every hour so that
> indexing doesn't take much time.
>
> I choose for the implicit document routing, but I am unable to redirect the
> docs on the particular shard. Zookeeper is still distributing it on all
> nodes and shards.

ZooKeeper isn't responsible for distributing documents between shards. 
It is Solr that does this, using information in the ZK database.  With
the implicit router, the only routing information in ZK is the shard
names.  Solr cannot make decisions about which shard gets the documents,
that information must come from the system doing the indexing.

> *What I have tried:*
> 1. I have created a collection with implicit routing and put customer
> routing field "*dateandhour*" and add it as a filed in my collection.
>
>      While adding solr input doc I am setting this filed with shard name.

What was the precise commands or API calls that you used to create the
collection?  What is the definition of the dateandhour field?

> 2. I have also tried to add shard name to id filed like:
>       id="*shardName!*uniquedocumentId"

If you want to use a prefix in the uniqueId field, you must be using the
compositeId router, not the implicit router.  The compositeId router
will not fit your use case, though -- you cannot add shards to a
collection if it uses compositeId.  Also, the prefix does not specify
the shard by name, the value of the prefix is hashed to determine which
shard(s) are used.

Here's the documentation on document routing:

https://lucene.apache.org/solr/guide/7_2/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Regarding document routing

manish tanger
Hello Shwana,

First of all thanks for your response.

>For redundancy with ZK, you need three hosts minimum.  A two-host ZK ensemble is actually *less* reliable than using one server.  You aren't protected against failure until you have at least three.  You would only need a minimum of two Solr hosts, though.
Yeah! same I have read somewhere but that is my local setup, not production setup. Still, I'll remember your advice while making prod setup.


>ZooKeeper isn't responsible for distributing documents between shards.  It is Solr that does this, using information in the ZK database.  With the implicit router, the only routing information in ZK is the shard names.  Solr cannot make decisions about which shard gets the documents, that information must come from the system doing the indexing.
As we are connecting through zookeeper my understanding was routing will done by a zookeeper, Thanks for the clarification.


>What was the precise commands or API calls that you used to create the collection?  What is the definition of the dateandhour field?


Collection Creation Through UI:


API for insertion the docs:
List<SolrInputDocument> inputDocuments = new ArrayList<>;
solrCloudClient = new CloudSolrClient.Builder().withZkHost(ZK_HOST_LIST).build();
solrCloudClient.setDefaultCollection(COLLECTION_NAME);

SolrInputDocument inputDocument = new SolrInputDocument();
inputDocument.addField("id", UUID.randomUUID().toString());
inputDocument.addField(dateAndHour, "20180111_04");
inputDocument.addField(__KEY__, __VALUE__);
inputDocuments.add(inputDocument);

solrCloudClient.add(inputDocuments);


dateandhour field defination:
<field name="dateandhour" type="string" indexed="false" stored="true"/>

Now here I wanted to put all one-hour data into 20180111_04 shard.

Thanks for your help.


Regards

Manish Kr. Tanger


On Wed, Jan 10, 2018 at 7:41 PM, Shawn Heisey <[hidden email]> wrote:
On 1/10/2018 12:18 AM, manish tanger wrote:
I am having a doubt in implicit routing and didn't find much info about
this over the internet, so Please help me out on this.

*About environment:*
M/c 1: Zookeeper 1 and Solr 1
M/c 2: Zookeeper 2 and Solr 2

For redundancy with ZK, you need three hosts minimum.  A two-host ZK ensemble is actually *less* reliable than using one server.  You aren't protected against failure until you have at least three.  You would only need a minimum of two Solr hosts, though.

I am using clustered zookeeper and using "CloudSolrClient" from solrJ
API in java.

*this.solrCloudClient = new
CloudSolrClient.Builder().withZkHost(zkHostList).build();*

*Requirement:*

My requirement is to store lots of data on solr using a single collection.
so my idea is that i am going to create a new shard for every hour so that
indexing doesn't take much time.

I choose for the implicit document routing, but I am unable to redirect the
docs on the particular shard. Zookeeper is still distributing it on all
nodes and shards.

ZooKeeper isn't responsible for distributing documents between shards.  It is Solr that does this, using information in the ZK database.  With the implicit router, the only routing information in ZK is the shard names.  Solr cannot make decisions about which shard gets the documents, that information must come from the system doing the indexing.

*What I have tried:*
1. I have created a collection with implicit routing and put customer
routing field "*dateandhour*" and add it as a filed in my collection.

     While adding solr input doc I am setting this filed with shard name.

What was the precise commands or API calls that you used to create the collection?  What is the definition of the dateandhour field?

2. I have also tried to add shard name to id filed like:
      id="*shardName!*uniquedocumentId"

If you want to use a prefix in the uniqueId field, you must be using the compositeId router, not the implicit router.  The compositeId router will not fit your use case, though -- you cannot add shards to a collection if it uses compositeId.  Also, the prefix does not specify the shard by name, the value of the prefix is hashed to determine which shard(s) are used.

Here's the documentation on document routing:

https://lucene.apache.org/solr/guide/7_2/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting

Thanks,
Shawn


Reply | Threaded
Open this post in threaded view
|

Re: Regarding document routing

Shawn Heisey-2
On 1/10/2018 11:00 PM, manish tanger wrote:
> As we are connecting through zookeeper my understanding was routing will
> done by a zookeeper, Thanks for the clarification.

CloudSolrClient doesn't actually connect through ZK.  When you create
the client using ZK info, the client reads information about the cloud
from ZK, and discovers where the Solr servers are.  All the actual work
that the client does is sent to those Solr servers that were discovered
by reading the ZK database.

>>*What was the precise commands or API calls that you used to create the
> collection?  What is the definition of the dateandhour field?*
> *
> *Collection Creation Through UI:
> Inline image 3

Attachments rarely make it to the list.  Your image showing the
collection creation did not make it, so I can't that information.  If
you want to use an image for that, you're going to need to find some
kind of website for sharing images and provide us with a link.  But as
you'll read below, sharing that may not be required.

> *dateandhour field defination:
> *<fieldname="dateandhour"type="string"indexed="false"stored="true"/>*

I have discovered a problem in the admin UI on version 7.2, which may
affect other versions.  Whatever you enter into the "routerField" box
gets sent as a "routerField" parameter -- *not* as the "router.field"
parameter that is actually required.  So the collection's state.json
file does not have a router field defined.

I opened an issue for that problem:

https://issues.apache.org/jira/browse/SOLR-11843

Can you try creating a collection with the API directly, rather than
with the admin UI, and using the correct "router.field" parameter?

https://lucene.apache.org/solr/guide/7_2/collections-api.html#CollectionsAPI-Input

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Regarding document routing

manish tanger
Hello Shawn,

Here are the UI options i filled and for more clarification i am using
solr 6.5.1



name : Collection_name
config set: ber
numshards: 1
replicationfactor: 1

Advance options:
router : Implicit
maxShardPerNode: 1
shards: 20180111_04,20180111_05
routerField: dateandhour




Regards

Manish Kr. Tanger


On Thu, Jan 11, 2018 at 2:59 PM, Shawn Heisey <[hidden email]> wrote:

> On 1/10/2018 11:00 PM, manish tanger wrote:
>
>> As we are connecting through zookeeper my understanding was routing will
>> done by a zookeeper, Thanks for the clarification.
>>
>
> CloudSolrClient doesn't actually connect through ZK.  When you create the
> client using ZK info, the client reads information about the cloud from ZK,
> and discovers where the Solr servers are.  All the actual work that the
> client does is sent to those Solr servers that were discovered by reading
> the ZK database.
>
> *What was the precise commands or API calls that you used to create the
>>>
>> collection?  What is the definition of the dateandhour field?*
>> *
>> *Collection Creation Through UI:
>> Inline image 3
>>
>
> Attachments rarely make it to the list.  Your image showing the collection
> creation did not make it, so I can't that information.  If you want to use
> an image for that, you're going to need to find some kind of website for
> sharing images and provide us with a link.  But as you'll read below,
> sharing that may not be required.
>
> *dateandhour field defination:
>> *<fieldname="dateandhour"type="string"indexed="false"stored="true"/>*
>>
>
> I have discovered a problem in the admin UI on version 7.2, which may
> affect other versions.  Whatever you enter into the "routerField" box gets
> sent as a "routerField" parameter -- *not* as the "router.field" parameter
> that is actually required.  So the collection's state.json file does not
> have a router field defined.
>
> I opened an issue for that problem:
>
> https://issues.apache.org/jira/browse/SOLR-11843
>
> Can you try creating a collection with the API directly, rather than with
> the admin UI, and using the correct "router.field" parameter?
>
> https://lucene.apache.org/solr/guide/7_2/collections-api.
> html#CollectionsAPI-Input
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Regarding document routing

Erick Erickson
Shawn's point (and JIRA) is that the UI doesn't pass the "router"
parameter correctly, so it is being ignored.

Simply put: You cannot create collections with the admin UI using
implicit routing because of this bug.  Don't use it.

Either use the "solr/bin create_collection" command or put the
parameters directly on the url with parameters from here:

https://lucene.apache.org/solr/guide/6_6/collections-api.html

something like:
..../solr/admin/collections?action=CREATE&router.name=implicit&shards=shard1,shard2,shard3&replicationFactor=.....

You can check whether the collection is created correctly by going to the
admin UI>>cloud>>tree>>collections>>your_collection
You should see the data about your collection, including what router
was actually used.

Best,
Erick

On Fri, Jan 12, 2018 at 2:22 AM, manish tanger <[hidden email]> wrote:

> Hello Shawn,
>
> Here are the UI options i filled and for more clarification i am using
> solr 6.5.1
>
>
>
> name : Collection_name
> config set: ber
> numshards: 1
> replicationfactor: 1
>
> Advance options:
> router : Implicit
> maxShardPerNode: 1
> shards: 20180111_04,20180111_05
> routerField: dateandhour
>
>
>
>
> Regards
>
> Manish Kr. Tanger
>
>
> On Thu, Jan 11, 2018 at 2:59 PM, Shawn Heisey <[hidden email]> wrote:
>
>> On 1/10/2018 11:00 PM, manish tanger wrote:
>>
>>> As we are connecting through zookeeper my understanding was routing will
>>> done by a zookeeper, Thanks for the clarification.
>>>
>>
>> CloudSolrClient doesn't actually connect through ZK.  When you create the
>> client using ZK info, the client reads information about the cloud from ZK,
>> and discovers where the Solr servers are.  All the actual work that the
>> client does is sent to those Solr servers that were discovered by reading
>> the ZK database.
>>
>> *What was the precise commands or API calls that you used to create the
>>>>
>>> collection?  What is the definition of the dateandhour field?*
>>> *
>>> *Collection Creation Through UI:
>>> Inline image 3
>>>
>>
>> Attachments rarely make it to the list.  Your image showing the collection
>> creation did not make it, so I can't that information.  If you want to use
>> an image for that, you're going to need to find some kind of website for
>> sharing images and provide us with a link.  But as you'll read below,
>> sharing that may not be required.
>>
>> *dateandhour field defination:
>>> *<fieldname="dateandhour"type="string"indexed="false"stored="true"/>*
>>>
>>
>> I have discovered a problem in the admin UI on version 7.2, which may
>> affect other versions.  Whatever you enter into the "routerField" box gets
>> sent as a "routerField" parameter -- *not* as the "router.field" parameter
>> that is actually required.  So the collection's state.json file does not
>> have a router field defined.
>>
>> I opened an issue for that problem:
>>
>> https://issues.apache.org/jira/browse/SOLR-11843
>>
>> Can you try creating a collection with the API directly, rather than with
>> the admin UI, and using the correct "router.field" parameter?
>>
>> https://lucene.apache.org/solr/guide/7_2/collections-api.
>> html#CollectionsAPI-Input
>>
>> Thanks,
>> Shawn
>>