Insert documents to a particular shard

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Insert documents to a particular shard

sambasivarao giddaluri
Hi All,
I am running solr in cloud mode in local with 2 shards and 2 replica on
port 8983 and 7574 and figuring out how to insert document in to a
particular shard , I read about implicit and composite route but i don't
think it will work for my usecase.

shard1 :  http://192.168.0.112:8983/family_shard1_replica_n1
 http://192.168.0.112:7574/family_shard1_replica_n2

shard2:   http://192.168.0.112:8983/family_shard2_replica_n3
 http://192.168.0.112:7574/family_shard2_replica_n4

we have documents with parent child relationship but flatten out with 2
levels down and reference to each other.
family schema documents:
{
"Id":"1"
"document_type":"parent"
"name":"John"
}
{
"Id":"2"
"document_type":"child"
"parentId":"1"
"name":"Rodney"
}
{
"Id":"3"
"document_type":"child"
"parentId":"1"
"name":"George"
}
{
"Id":"4"
"document_type":"grandchild"
"parentId":"1",
"childIdId":"2"
"name":"David"
}
we have complex queries to get data based on graph query parser and  as
graph query parser does not work on solr cloud with multiple shards. I was
trying to develop a logic like whenever a document gets inserted or updated
make sure it gets saved in the same shard where the parent doc is stored ,
in that way graph query works because all the family information will be in
the same shard.
Approach :
1) If a new child/grandchild is getting inserted then get the parent doc
shard details and add the shard details to the document in a field
ex:parentshard and save the doc in the shard.
2) If document is getting updated check if the parentshard field exists if
so update the doc to same shard.
But all these check conditions will increase response time , currently our
development is done in cloud mode with single shard and  using solrj to
save the data.
Also i an unable to figure out the query to update  doc to a particular
shard.

Any suggestions will help .

Thanks in Advance
sam
Reply | Threaded
Open this post in threaded view
|

Re: Insert documents to a particular shard

Jörn Franke
You are trying to achieve data locality by having parents and children in the same shard?
Does document routing address it?

https://lucene.apache.org/solr/guide/8_5/shards-and-indexing-data-in-solrcloud.html#document-routing


On a side node, I don’t know your complete use case, but have you explored streaming expressions for graph traversal?

https://lucene.apache.org/solr/guide/8_5/graph-traversal.html


> Am 03.06.2020 um 00:37 schrieb sambasivarao giddaluri <[hidden email]>:
>
> Hi All,
> I am running solr in cloud mode in local with 2 shards and 2 replica on
> port 8983 and 7574 and figuring out how to insert document in to a
> particular shard , I read about implicit and composite route but i don't
> think it will work for my usecase.
>
> shard1 :  http://192.168.0.112:8983/family_shard1_replica_n1
> http://192.168.0.112:7574/family_shard1_replica_n2
>
> shard2:   http://192.168.0.112:8983/family_shard2_replica_n3
> http://192.168.0.112:7574/family_shard2_replica_n4
>
> we have documents with parent child relationship but flatten out with 2
> levels down and reference to each other.
> family schema documents:
> {
> "Id":"1"
> "document_type":"parent"
> "name":"John"
> }
> {
> "Id":"2"
> "document_type":"child"
> "parentId":"1"
> "name":"Rodney"
> }
> {
> "Id":"3"
> "document_type":"child"
> "parentId":"1"
> "name":"George"
> }
> {
> "Id":"4"
> "document_type":"grandchild"
> "parentId":"1",
> "childIdId":"2"
> "name":"David"
> }
> we have complex queries to get data based on graph query parser and  as
> graph query parser does not work on solr cloud with multiple shards. I was
> trying to develop a logic like whenever a document gets inserted or updated
> make sure it gets saved in the same shard where the parent doc is stored ,
> in that way graph query works because all the family information will be in
> the same shard.
> Approach :
> 1) If a new child/grandchild is getting inserted then get the parent doc
> shard details and add the shard details to the document in a field
> ex:parentshard and save the doc in the shard.
> 2) If document is getting updated check if the parentshard field exists if
> so update the doc to same shard.
> But all these check conditions will increase response time , currently our
> development is done in cloud mode with single shard and  using solrj to
> save the data.
> Also i an unable to figure out the query to update  doc to a particular
> shard.
>
> Any suggestions will help .
>
> Thanks in Advance
> sam
Reply | Threaded
Open this post in threaded view
|

Re: Insert documents to a particular shard

Jörn Franke
Hint: you can easily try out streaming expressions in the admin UI

> Am 03.06.2020 um 07:32 schrieb Jörn Franke <[hidden email]>:
>
> 
> You are trying to achieve data locality by having parents and children in the same shard?
> Does document routing address it?
>
> https://lucene.apache.org/solr/guide/8_5/shards-and-indexing-data-in-solrcloud.html#document-routing
>
>
> On a side node, I don’t know your complete use case, but have you explored streaming expressions for graph traversal?
>
> https://lucene.apache.org/solr/guide/8_5/graph-traversal.html
>
>
>>> Am 03.06.2020 um 00:37 schrieb sambasivarao giddaluri <[hidden email]>:
>>>
>> Hi All,
>> I am running solr in cloud mode in local with 2 shards and 2 replica on
>> port 8983 and 7574 and figuring out how to insert document in to a
>> particular shard , I read about implicit and composite route but i don't
>> think it will work for my usecase.
>>
>> shard1 :  http://192.168.0.112:8983/family_shard1_replica_n1
>> http://192.168.0.112:7574/family_shard1_replica_n2
>>
>> shard2:   http://192.168.0.112:8983/family_shard2_replica_n3
>> http://192.168.0.112:7574/family_shard2_replica_n4
>>
>> we have documents with parent child relationship but flatten out with 2
>> levels down and reference to each other.
>> family schema documents:
>> {
>> "Id":"1"
>> "document_type":"parent"
>> "name":"John"
>> }
>> {
>> "Id":"2"
>> "document_type":"child"
>> "parentId":"1"
>> "name":"Rodney"
>> }
>> {
>> "Id":"3"
>> "document_type":"child"
>> "parentId":"1"
>> "name":"George"
>> }
>> {
>> "Id":"4"
>> "document_type":"grandchild"
>> "parentId":"1",
>> "childIdId":"2"
>> "name":"David"
>> }
>> we have complex queries to get data based on graph query parser and  as
>> graph query parser does not work on solr cloud with multiple shards. I was
>> trying to develop a logic like whenever a document gets inserted or updated
>> make sure it gets saved in the same shard where the parent doc is stored ,
>> in that way graph query works because all the family information will be in
>> the same shard.
>> Approach :
>> 1) If a new child/grandchild is getting inserted then get the parent doc
>> shard details and add the shard details to the document in a field
>> ex:parentshard and save the doc in the shard.
>> 2) If document is getting updated check if the parentshard field exists if
>> so update the doc to same shard.
>> But all these check conditions will increase response time , currently our
>> development is done in cloud mode with single shard and  using solrj to
>> save the data.
>> Also i an unable to figure out the query to update  doc to a particular
>> shard.
>>
>> Any suggestions will help .
>>
>> Thanks in Advance
>> sam
Reply | Threaded
Open this post in threaded view
|

Re: Insert documents to a particular shard

sambasivarao giddaluri
Thanks Jorn for your suggestions ,

It was a sample schema but each document_type will have more fields .
1) Yes i have exported graph traversal gatherNodes using streaming
expression but we found few issues
ex:  get parent doc based on grandchild doc filter
Graph Traversal -
{!graph from=parentId to=parentId traversalFilter='document_type:parent'
returnRoot=false}(name:David AND document_type:grandchild)
this request gives all the fields of the parent doc  but  gather nodes i
can gather only a single field of the parent doc and then i have to query
to get all the fields also we are looking for pagination where streams does
not support pagination .


2) I tried document routing with explicit way and it might work for us but
i have to explore more on what happens when we split the shards.
ex: curl 'localhost:8983/solr/admin/collections?action=CREATE&name=family&
router.name
=implicit&router.field=rfield&collection.configName=base-config&shards=shard1,shard2&maxShardsPerNode=2&numShards=1&replicationFactor=2'

   - when inserting the parent doc i can randomly pick one of the shard
   (shard1 or shard2) for the rfield
   - while inserting any child doc or grandchild doc i use the parent doc
   rfield to keep them in the same shard.

Regards
sam


On Tue, Jun 2, 2020 at 10:35 PM Jörn Franke <[hidden email]> wrote:

> Hint: you can easily try out streaming expressions in the admin UI
>
> > Am 03.06.2020 um 07:32 schrieb Jörn Franke <[hidden email]>:
> >
> > 
> > You are trying to achieve data locality by having parents and children
> in the same shard?
> > Does document routing address it?
> >
> >
> https://lucene.apache.org/solr/guide/8_5/shards-and-indexing-data-in-solrcloud.html#document-routing
> >
> >
> > On a side node, I don’t know your complete use case, but have you
> explored streaming expressions for graph traversal?
> >
> > https://lucene.apache.org/solr/guide/8_5/graph-traversal.html
> >
> >
> >>> Am 03.06.2020 um 00:37 schrieb sambasivarao giddaluri <
> [hidden email]>:
> >>>
> >> Hi All,
> >> I am running solr in cloud mode in local with 2 shards and 2 replica on
> >> port 8983 and 7574 and figuring out how to insert document in to a
> >> particular shard , I read about implicit and composite route but i don't
> >> think it will work for my usecase.
> >>
> >> shard1 :  http://192.168.0.112:8983/family_shard1_replica_n1
> >> http://192.168.0.112:7574/family_shard1_replica_n2
> >>
> >> shard2:   http://192.168.0.112:8983/family_shard2_replica_n3
> >> http://192.168.0.112:7574/family_shard2_replica_n4
> >>
> >> we have documents with parent child relationship but flatten out with 2
> >> levels down and reference to each other.
> >> family schema documents:
> >> {
> >> "Id":"1"
> >> "document_type":"parent"
> >> "name":"John"
> >> }
> >> {
> >> "Id":"2"
> >> "document_type":"child"
> >> "parentId":"1"
> >> "name":"Rodney"
> >> }
> >> {
> >> "Id":"3"
> >> "document_type":"child"
> >> "parentId":"1"
> >> "name":"George"
> >> }
> >> {
> >> "Id":"4"
> >> "document_type":"grandchild"
> >> "parentId":"1",
> >> "childIdId":"2"
> >> "name":"David"
> >> }
> >> we have complex queries to get data based on graph query parser and  as
> >> graph query parser does not work on solr cloud with multiple shards. I
> was
> >> trying to develop a logic like whenever a document gets inserted or
> updated
> >> make sure it gets saved in the same shard where the parent doc is
> stored ,
> >> in that way graph query works because all the family information will
> be in
> >> the same shard.
> >> Approach :
> >> 1) If a new child/grandchild is getting inserted then get the parent doc
> >> shard details and add the shard details to the document in a field
> >> ex:parentshard and save the doc in the shard.
> >> 2) If document is getting updated check if the parentshard field exists
> if
> >> so update the doc to same shard.
> >> But all these check conditions will increase response time , currently
> our
> >> development is done in cloud mode with single shard and  using solrj to
> >> save the data.
> >> Also i an unable to figure out the query to update  doc to a particular
> >> shard.
> >>
> >> Any suggestions will help .
> >>
> >> Thanks in Advance
> >> sam
>