[jira] [Created] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
Documents droping when using DistributedUpdateProcessor
-------------------------------------------------------

                 Key: SOLR-3001
                 URL: https://issues.apache.org/jira/browse/SOLR-3001
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.0
         Environment: Windows 7, Ubuntu
            Reporter: Rafał Kuć


I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.

Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
{noformat}
<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true">
 <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
  <core shard="shard1" instanceDir="." name="collection1" />
 </cores>
</solr>
{noformat}

The solrconfig.xml file on each of the shard consisted of the following entries:
{noformat}
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
 <lst name="defaults">
  <str name="update.chain">distrib</str>
 </lst>
</requestHandler>
{noformat}

{noformat}
<updateRequestProcessorChain name="distrib">
 <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
{noformat}

I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
{{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}

I've added the following code:
{noformat}
if (urls == null) {
 urls = new ArrayList<String>(1);
 urls.add(leaderUrl);  
} else {
 if (!urls.contains(leaderUrl)) {
  urls.add(leaderUrl);  
 }
}
{noformat}

after:
{noformat}
urls = getReplicaUrls(req, collection, shardId, nodeName);
{noformat}

If this is the proper approach I'll be glad to provide a patch with the modification.

--
Regards
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178823#comment-13178823 ]

Mark Miller commented on SOLR-3001:
-----------------------------------

I fixed a bug around this a week or two ago (adding more than one doc per request) - I'll check around this again, but best may be to just try with an updated version.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178827#comment-13178827 ]

Rafał Kuć commented on SOLR-3001:
---------------------------------

Thanks for the information Mark. It may be the case, as I'm using solrcloud which is about 2 - 3 weeks old. I'll verify that as soon as I can.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178840#comment-13178840 ]

Mark Miller commented on SOLR-3001:
-----------------------------------

Also, it's worth noting that it's been a bit since you have needed to define your own update chain - the distrib update processor is now part of the default chain - so of course you can define a custom chain - but no need to.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179400#comment-13179400 ]

Rafał Kuć commented on SOLR-3001:
---------------------------------

Mark, I've tried the newest solrcloud branch and I'm affraid the problem still exists. What I did to test is indexing 425543 using StreamingUpdateSolrServer (10000 queue size, 3 threads). Those documents were sent to the shard1. After indexation ended, the following number of documents were at all three shards:

shard1: 5 documents
shard2: 142424 documents
shard3: 141275 documents

and the query like: q=*:*&distrib=true returns 283704 documents total. So Solr dropped about 141839 which should probably be in the first shard, the one I'm sending the documents to.

If I send the documents on by one with the use of CommonsHttpSolrServer the numbers are as follows:
shard1: 141725 documents
shard2: 142474 documents
shard3: 141344 documents

I'm using the Solr version: solr-spec-version 4.0.0.2012.01.04.10.42.06 (from Solr admin). I did the test with update.chain set and without it. Both times the same behavior.

Btw. The fact is the indexing is much, much faster right now using distributed indexing as the shards are getting document in batches.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180120#comment-13180120 ]

Mark Miller commented on SOLR-3001:
-----------------------------------

Thanks for the verification Rafal - I'll dig into this tomorrow.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated SOLR-3001:
------------------------------

    Fix Version/s: 4.0
   

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Assigned] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller reassigned SOLR-3001:
---------------------------------

    Assignee: Mark Miller
   

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180209#comment-13180209 ]

Mark Miller commented on SOLR-3001:
-----------------------------------

I've committed a fix to the solrcloud branch.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180213#comment-13180213 ]

Mark Miller commented on SOLR-3001:
-----------------------------------

bq. Btw. The fact is the indexing is much, much faster right now using distributed indexing as the shards are getting document in batches.

You mean faster after you updated to the latest rev? There was no buffering originally, so even if you where streaming it would use httpcommons server and send docs around one by one. Late last week I added the buffering though. Right now it buffers 10 docs per target shard, but I was thinking about whether or not we should make that configurable and/or raise it.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180487#comment-13180487 ]

Rafał Kuć commented on SOLR-3001:
---------------------------------

Thanks Mark it works flawlessly right now.

And about updated - yes I was comparing the two version, one without buffering and the newest one. The newest version of solrcloud is much faster when indexing documents to shards. If you ask me, I would like to be able to setup the size of the buffer :)
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180886#comment-13180886 ]

Otis Gospodnetic commented on SOLR-3001:
----------------------------------------

+1 for controlling this.

Is this issue resolved now?

               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181198#comment-13181198 ]

Rafał Kuć commented on SOLR-3001:
---------------------------------

Yes Otis, closing.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rafał Kuć resolved SOLR-3001.
-----------------------------

    Resolution: Fixed

Fixed.
               

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Closed] (SOLR-3001) Documents droping when using DistributedUpdateProcessor

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rafał Kuć closed SOLR-3001.
---------------------------

   

> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
>                 Key: SOLR-3001
>                 URL: https://issues.apache.org/jira/browse/SOLR-3001
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Ubuntu
>            Reporter: Rafał Kuć
>            Assignee: Mark Miller
>             Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the distributed indexing. What I've noticed is when indexing with StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list which have more than one document the documents seems to be dropped. I did some tests which tried to index 450k documents. If I was sending the documents one by one, the indexing was properly executed and the three Solr instances was holding 450k documents (when summed up). However if when I tried to add documents in batches (for example with StreamingUpdateSolrServer and a queue of 1000) the shard I was sending the documents to had a minimum number of documents (about 100) while the other shards had about 150k documents.
> Each Solr was started with a single core and in Zookeeper mode. An example solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
>  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="10000" hostPort="8983" hostContext="solr">
>   <core shard="shard1" instanceDir="." name="collection1" />
>  </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
>  <lst name="defaults">
>   <str name="update.chain">distrib</str>
>  </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
>  <processor class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
>  <processor class="solr.LogUpdateProcessorFactory" />
>  <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
>  urls = new ArrayList<String>(1);
>  urls.add(leaderUrl);  
> } else {
>  if (!urls.contains(leaderUrl)) {
>   urls.add(leaderUrl);  
>  }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]