how to update billions of docs

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

how to update billions of docs

Mohsin Beg Beg
Hi,

I have a requirement to replace a value of a field in 100B's of docs in 100's of cores.
The field is multiValued=false docValues=true type=StrField stored=true indexed=true.

Atomic Updates performance is on the order of 5K docs per sec per core in solr 5.3 (other fields are quite big).

Any suggestions ?

-Mohsin
Reply | Threaded
Open this post in threaded view
|

Re: how to update billions of docs

sudsport s
I think there are no inplace updates in solr , that means updates behaves
like inserts and marking old version deleted. so behaviors should be same
as indexing billions of docs.

On Wed, Mar 16, 2016 at 3:52 PM, Mohsin Beg Beg <[hidden email]>
wrote:

> Hi,
>
> I have a requirement to replace a value of a field in 100B's of docs in
> 100's of cores.
> The field is multiValued=false docValues=true type=StrField stored=true
> indexed=true.
>
> Atomic Updates performance is on the order of 5K docs per sec per core in
> solr 5.3 (other fields are quite big).
>
> Any suggestions ?
>
> -Mohsin
>
Reply | Threaded
Open this post in threaded view
|

Re: how to update billions of docs

Toke Eskildsen
In reply to this post by Mohsin Beg Beg
Mohsin Beg Beg <[hidden email]> wrote:
> I have a requirement to replace a value of a field in 100B's of docs
> in 100's of cores. The field is multiValued=false docValues=true
> type=StrField stored=true indexed=true.

If this is just a simple one-time search-replace, then don't update the value in the index. Instead, post-process the search result and do the transformation there. If you want it to happen inside of Solr, you can use the XSLT Response Writer: https://cwiki.apache.org/confluence/display/solr/Response+Writers#ResponseWriters-TheXSLTResponseWriter

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

RE: how to update billions of docs

kkrugler
In reply to this post by Mohsin Beg Beg
As others noted, currently updating a field means deleting and inserting the entire document.

Depending on how you use the field, you might be able to create another core/container with that one field (plus the key field), and use join support.

Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an improvement, which looks like it's in the 5.x code line, though I don't see a fix version.

-- Ken

> From: Mohsin Beg Beg
> Sent: March 16, 2016 3:52:47pm PDT
> To: [hidden email]
> Subject: how to update billions of docs
>
> Hi,
>
> I have a requirement to replace a value of a field in 100B's of docs in 100's of cores.
> The field is multiValued=false docValues=true type=StrField stored=true indexed=true.
>
> Atomic Updates performance is on the order of 5K docs per sec per core in solr 5.3 (other fields are quite big).
>
> Any suggestions ?
>
> -Mohsin


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply | Threaded
Open this post in threaded view
|

Re: how to update billions of docs

Jack Krupansky-3
It would be nice to have a wiki/doc for "Bulk Field Update" that listed all
of these techniques and tricks.

And, of course, it would be so much better to have an explicit Lucene
feature for this. It could work in the background like merge and process
one segment at a time as efficiently as possible.

Have several modes:

1. Set a field of all documents to explicit value.
2. Set a field of query documents to an explicit value.
3. Increment by n.
4. Add new field to all document, or maybe by query.
5. Delete existing field for all documents.
6. Delete field value for all documents or a specified query.


-- Jack Krupansky

On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <[hidden email]>
wrote:

> As others noted, currently updating a field means deleting and inserting
> the entire document.
>
> Depending on how you use the field, you might be able to create another
> core/container with that one field (plus the key field), and use join
> support.
>
> Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> improvement, which looks like it's in the 5.x code line, though I don't see
> a fix version.
>
> -- Ken
>
> > From: Mohsin Beg Beg
> > Sent: March 16, 2016 3:52:47pm PDT
> > To: [hidden email]
> > Subject: how to update billions of docs
> >
> > Hi,
> >
> > I have a requirement to replace a value of a field in 100B's of docs in
> 100's of cores.
> > The field is multiValued=false docValues=true type=StrField stored=true
> indexed=true.
> >
> > Atomic Updates performance is on the order of 5K docs per sec per core
> in solr 5.3 (other fields are quite big).
> >
> > Any suggestions ?
> >
> > -Mohsin
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to update billions of docs

Ishan Chattopadhyaya
Hi Mohsin,
There's some work in progress for in-place updates to docValued fields,
https://issues.apache.org/jira/browse/SOLR-5944. Can you try the latest
patch there (or ping me if you need a git branch)?
It would be nice to know how fast the updates go for your usecase with that
patch. Please note that for that patch, both the version field and the
updated field needs to have stored=false, indexed=false, docValues=true.
Regards,
Ishan


On Thu, Mar 17, 2016 at 10:55 PM, Jack Krupansky <[hidden email]>
wrote:

> It would be nice to have a wiki/doc for "Bulk Field Update" that listed all
> of these techniques and tricks.
>
> And, of course, it would be so much better to have an explicit Lucene
> feature for this. It could work in the background like merge and process
> one segment at a time as efficiently as possible.
>
> Have several modes:
>
> 1. Set a field of all documents to explicit value.
> 2. Set a field of query documents to an explicit value.
> 3. Increment by n.
> 4. Add new field to all document, or maybe by query.
> 5. Delete existing field for all documents.
> 6. Delete field value for all documents or a specified query.
>
>
> -- Jack Krupansky
>
> On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <[hidden email]
> >
> wrote:
>
> > As others noted, currently updating a field means deleting and inserting
> > the entire document.
> >
> > Depending on how you use the field, you might be able to create another
> > core/container with that one field (plus the key field), and use join
> > support.
> >
> > Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> > improvement, which looks like it's in the 5.x code line, though I don't
> see
> > a fix version.
> >
> > -- Ken
> >
> > > From: Mohsin Beg Beg
> > > Sent: March 16, 2016 3:52:47pm PDT
> > > To: [hidden email]
> > > Subject: how to update billions of docs
> > >
> > > Hi,
> > >
> > > I have a requirement to replace a value of a field in 100B's of docs in
> > 100's of cores.
> > > The field is multiValued=false docValues=true type=StrField stored=true
> > indexed=true.
> > >
> > > Atomic Updates performance is on the order of 5K docs per sec per core
> > in solr 5.3 (other fields are quite big).
> > >
> > > Any suggestions ?
> > >
> > > -Mohsin
> >
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: how to update billions of docs

Jack Krupansky-3
That's another great example of a mode that Bulk Field Update (my mythical
feature) needs - switch a list of fields from stored to docvalues.

And maybe even the opposite since there are scenarios in which docValues is
worse than stored and you would only find that out after indexing...
billions of documents.

Being able to switch indexed mode of a field (or list of fields) is also a
mode needed for bulk update (reindex).


-- Jack Krupansky

On Fri, Mar 18, 2016 at 4:12 AM, Ishan Chattopadhyaya <
[hidden email]> wrote:

> Hi Mohsin,
> There's some work in progress for in-place updates to docValued fields,
> https://issues.apache.org/jira/browse/SOLR-5944. Can you try the latest
> patch there (or ping me if you need a git branch)?
> It would be nice to know how fast the updates go for your usecase with that
> patch. Please note that for that patch, both the version field and the
> updated field needs to have stored=false, indexed=false, docValues=true.
> Regards,
> Ishan
>
>
> On Thu, Mar 17, 2016 at 10:55 PM, Jack Krupansky <[hidden email]
> >
> wrote:
>
> > It would be nice to have a wiki/doc for "Bulk Field Update" that listed
> all
> > of these techniques and tricks.
> >
> > And, of course, it would be so much better to have an explicit Lucene
> > feature for this. It could work in the background like merge and process
> > one segment at a time as efficiently as possible.
> >
> > Have several modes:
> >
> > 1. Set a field of all documents to explicit value.
> > 2. Set a field of query documents to an explicit value.
> > 3. Increment by n.
> > 4. Add new field to all document, or maybe by query.
> > 5. Delete existing field for all documents.
> > 6. Delete field value for all documents or a specified query.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <
> [hidden email]
> > >
> > wrote:
> >
> > > As others noted, currently updating a field means deleting and
> inserting
> > > the entire document.
> > >
> > > Depending on how you use the field, you might be able to create another
> > > core/container with that one field (plus the key field), and use join
> > > support.
> > >
> > > Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> > > improvement, which looks like it's in the 5.x code line, though I don't
> > see
> > > a fix version.
> > >
> > > -- Ken
> > >
> > > > From: Mohsin Beg Beg
> > > > Sent: March 16, 2016 3:52:47pm PDT
> > > > To: [hidden email]
> > > > Subject: how to update billions of docs
> > > >
> > > > Hi,
> > > >
> > > > I have a requirement to replace a value of a field in 100B's of docs
> in
> > > 100's of cores.
> > > > The field is multiValued=false docValues=true type=StrField
> stored=true
> > > indexed=true.
> > > >
> > > > Atomic Updates performance is on the order of 5K docs per sec per
> core
> > > in solr 5.3 (other fields are quite big).
> > > >
> > > > Any suggestions ?
> > > >
> > > > -Mohsin
> > >
> > >
> > > --------------------------
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://www.scaleunlimited.com
> > > custom big data solutions & training
> > > Hadoop, Cascading, Cassandra & Solr
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: how to update billions of docs

Mohsin Beg Beg
In reply to this post by Mohsin Beg Beg

An update on how I ended up implementing the requirement in case it helps others. There are lots of other code I did not include but the general logic is below.

While performance is still not great, it is 10x faster than atomic updates ( because RealTimeGetComponent.getInputDocument() is not needed )


1. Wrote an update handler
   /myupdater?q=*:* & sort=fieldx desc & fl=fieldx, fieldy & stream.file=exampledocs/oldvalueToNewValue.properties & update.chain=myprocessor


2. In the handler read the map from content stream and invoke the export handler for the query params
   SolrRequestHandler handler = core.getRequestHandler("/export");
   core.execute(handler, req, rsp);
   numFound = (Integer) req.getContext().get("totalHits");


3. Iterate using /export handler response, similar to SortingResponseWrite.write() method
 
   List<LeafReaderContext> leaves = req.getSearcher().getTopReaderContext().leaves();
   for(int i=0; i<leaves.size(); i++) {
     DocIdSetIterator it = new BitSetIterator(sets[i], 0);
     while((docId = it.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        // get lucene doc
        Document luceneDoc = leaves.get(i).reader.document(docId);

        // update lucene doc with new values
        updateDoc(luceneDoc, oldValueToNewValuesMap)

        // post lucene doc to a linked blocking queue
        queue.add(luceneDoc);
     }
  }


4. have N threads waiting on queue for docs and invokes UpdateRequestProcessor chain using the update.chain param
   AddUpdateCommand cmd = new AddUpdateCommand(request);
   IndexSchema schema = req.getLatestSchema();
   while (true) {
      Document luceneDoc = queue.take();
      SolrDocument doc = toSolrDocument(luceneDoc, schema);

      cmd.doc = doc;

      // set these fields as needed
      cmd.overwrite = false;
      cmd.setVersion(0);
      doc.removeField("_version"_);

      // post doc
      updateProcessor.processAdd(cmd);
   }


-Mohsin


----- Original Message -----
From: [hidden email]
To: [hidden email]
Sent: Friday, March 18, 2016 6:55:17 AM GMT -08:00 US/Canada Pacific
Subject: Re: how to update billions of docs

That's another great example of a mode that Bulk Field Update (my mythical
feature) needs - switch a list of fields from stored to docvalues.

And maybe even the opposite since there are scenarios in which docValues is
worse than stored and you would only find that out after indexing...
billions of documents.

Being able to switch indexed mode of a field (or list of fields) is also a
mode needed for bulk update (reindex).


-- Jack Krupansky

On Fri, Mar 18, 2016 at 4:12 AM, Ishan Chattopadhyaya <
[hidden email]> wrote:

> Hi Mohsin,
> There's some work in progress for in-place updates to docValued fields,
> https://issues.apache.org/jira/browse/SOLR-5944. Can you try the latest
> patch there (or ping me if you need a git branch)?
> It would be nice to know how fast the updates go for your usecase with that
> patch. Please note that for that patch, both the version field and the
> updated field needs to have stored=false, indexed=false, docValues=true.
> Regards,
> Ishan
>
>
> On Thu, Mar 17, 2016 at 10:55 PM, Jack Krupansky <[hidden email]
> >
> wrote:
>
> > It would be nice to have a wiki/doc for "Bulk Field Update" that listed
> all
> > of these techniques and tricks.
> >
> > And, of course, it would be so much better to have an explicit Lucene
> > feature for this. It could work in the background like merge and process
> > one segment at a time as efficiently as possible.
> >
> > Have several modes:
> >
> > 1. Set a field of all documents to explicit value.
> > 2. Set a field of query documents to an explicit value.
> > 3. Increment by n.
> > 4. Add new field to all document, or maybe by query.
> > 5. Delete existing field for all documents.
> > 6. Delete field value for all documents or a specified query.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <
> [hidden email]
> > >
> > wrote:
> >
> > > As others noted, currently updating a field means deleting and
> inserting
> > > the entire document.
> > >
> > > Depending on how you use the field, you might be able to create another
> > > core/container with that one field (plus the key field), and use join
> > > support.
> > >
> > > Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> > > improvement, which looks like it's in the 5.x code line, though I don't
> > see
> > > a fix version.
> > >
> > > -- Ken
> > >
> > > > From: Mohsin Beg Beg
> > > > Sent: March 16, 2016 3:52:47pm PDT
> > > > To: [hidden email]
> > > > Subject: how to update billions of docs
> > > >
> > > > Hi,
> > > >
> > > > I have a requirement to replace a value of a field in 100B's of docs
> in
> > > 100's of cores.
> > > > The field is multiValued=false docValues=true type=StrField
> stored=true
> > > indexed=true.
> > > >
> > > > Atomic Updates performance is on the order of 5K docs per sec per
> core
> > > in solr 5.3 (other fields are quite big).
> > > >
> > > > Any suggestions ?
> > > >
> > > > -Mohsin
> > >
> > >
> > > --------------------------
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://www.scaleunlimited.com
> > > custom big data solutions & training
> > > Hadoop, Cascading, Cassandra & Solr
> > >
> > >
> > >
> > >
> > >
> > >
> >
>