solr configuration issue

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

solr configuration issue

Danilo Tomasoni
Hello all,

we have a solr 7.3.1 instance with around 40 MLN documents in it.

After the initial one-shot import, we found an issue in the import
software, we updated it and re-run the import that will atomically
update (with set)

the existing documents.

The import is divided into processes, each process is responsible of
updating a portion of the documents.

For every document processed, a soft commit is performed to make the
update visible to other concurrent update processes.

Every process at the end will perform an hard commit.

The issue I have is that hard commits never terminate (it's ongoing by
more than 3 days) and the number of segments and the solr index will
grow a lot.

In the past when the commit finished I was used to incrementally
optimize the index (from 40 segments to 39, to 38 and so on)

but also here the process is very slow.


Any advice on how to speed up things?

I checked the system usage in the solr machine and neither I/O nor CPU
are heavily used..


Thanks

Danilo

--
Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
[hidden email]
http://www.cosbi.eu
 
As for the European General Data Protection Regulation 2016/679 on the protection of natural persons with regard to the processing of personal data, we inform you that all the data we possess are object of treatment in the respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may ask for their correction, cancellation or you may oppose to their use by written request sent by recorded delivery to The Microsoft Research – University of Trento Centre for Computational and Systems Biology Scarl, Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to

Reply | Threaded
Open this post in threaded view
|

Re: solr configuration issue

Paras Lehana
Hi Danilo,

We have a solr 7.3.1 instance with around 40 MLN documents in it.


I guess you are hard committing after few of millions of docs are indexed,
right? I suggest you not to fully avoid hard committing. Set *autoCommit*
(not autoSoftCommit) at around half a million of documents (that's from my
experience given my core of 250 million documents). Obviously, you need to
find the sweet spot yourself but you can start with this number.

Also, play with values of *IndexConfig*
<https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html>
(merge
factor, segment size, maxBufferedDocs, Merge Policies). We, at
Auto-Suggest, also do atomic updates daily and specifically changing merge
factor gave us a boost of ~4x during indexing. At current configuration,
our core atomically updates ~423 documents per second. I also do few core
optimizations in between the full indexing.

On Thu, 24 Oct 2019 at 13:31, Danilo Tomasoni <[hidden email]> wrote:

> Hello all,
>
> we have a solr 7.3.1 instance with around 40 MLN documents in it.
>
> After the initial one-shot import, we found an issue in the import
> software, we updated it and re-run the import that will atomically
> update (with set)
>
> the existing documents.
>
> The import is divided into processes, each process is responsible of
> updating a portion of the documents.
>
> For every document processed, a soft commit is performed to make the
> update visible to other concurrent update processes.
>
> Every process at the end will perform an hard commit.
>
> The issue I have is that hard commits never terminate (it's ongoing by
> more than 3 days) and the number of segments and the solr index will
> grow a lot.
>
> In the past when the commit finished I was used to incrementally
> optimize the index (from 40 segments to 39, to 38 and so on)
>
> but also here the process is very slow.
>
>
> Any advice on how to speed up things?
>
> I checked the system usage in the solr machine and neither I/O nor CPU
> are heavily used..
>
>
> Thanks
>
> Danilo
>
> --
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> [hidden email]
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
>
>

--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.
Reply | Threaded
Open this post in threaded view
|

Re: solr configuration issue

Shawn Heisey-2
In reply to this post by Danilo Tomasoni
On 10/24/2019 1:52 AM, Danilo Tomasoni wrote:
> For every document processed, a soft commit is performed to make the
> update visible to other concurrent update processes.

This is not the way to do things.  Doing a commit after every document
means that Solr will spend more time doing commits than anything else.

Documents should be indexed in batches.

https://lucidworks.com/post/really-batch-updates-solr-2/

> Every process at the end will perform an hard commit.

Use autoCommit to do hard commits.  I would suggest NOT using maxDoc,
only use maxTime, and set it to 60000 -- one minute.  Also ensure that
openSearcher is set to false.  Commits that do not open a new searcher
are VERY fast.  These hard commits will not do anything for document
visibility, they are about data durability.

Then you can use autoSoftCommit for change visibility, and not worry
about sending commits in your indexing application.  Again, don't set
maxDoc.  Set maxTime to as long an interval as you can stand.  I would
suggest a minumum of two minutes, but make it longer if you can.
Something like 5 or 10 minutes.

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

> The issue I have is that hard commits never terminate (it's ongoing by
> more than 3 days) and the number of segments and the solr index will
> grow a lot.

What do you mean by "terminate" here?  I cannnot figure this out from
the context.  The only thing I'm aware of that a hard commit is going to
terminate is the current transaction log ... the current log is closed
and the next time a document is indexed, a new one will be created.
Hard commits are the only thing that will close a transaction log.

> In the past when the commit finished I was used to incrementally
> optimize the index (from 40 segments to 39, to 38 and so on)
> but also here the process is very slow.

If you're going to optimize, which we generally recommend NOT doing,
optimize in a single pass.  Optimizing with multiple passes means
reading the index and writing the index multiple times ... and each
forced merge will require significant system resources.  It may not
require them all, but it is significant.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: solr configuration issue

Erick Erickson
In reply to this post by Danilo Tomasoni
"For every document processed, a soft commit is performed to make the update visible to other concurrent update processes.”

Please do not do this! First, Real Time Get will always return the current doc, whether you’ve opened a new reader or not. Second, this is an anti-pattern. I agree with Paras, set your defaults in solrconfig and forget about it.

I’d also set the hard commits to something like 15 seconds (openSearcher=false). Or, if you can stand 15 second latency, set openSearcher=true and leave the soft commit set to -1.

Opening a searcher is a heavyweight operation. doing it after _every_ document is a poor choice. If you absolutely _must_, at least batch your updates up in groups of, say, 1,000 and open a new searcher after that.

Best,
Erick

> On Oct 24, 2019, at 3:52 AM, Danilo Tomasoni <[hidden email]> wrote:
>
> For every document processed, a soft commit is performed to make the update visible to other concurrent update processes.

Reply | Threaded
Open this post in threaded view
|

Re: solr configuration issue

Danilo Tomasoni
Thank you all for your suggestions.

Now I changed my import strategy to ensure that the same document will
be updated eventually by different "batches",

in this way I need a single programmatic softcommit at the end of each
batch.


Configuration-side I enabled autoCommit with opensearcher=false and
maxtime=60000 (1 minute)


Hope this will do it.

Another question, is softCommit sufficient to ensure visibility or
should I call a commit to ensure a new searcher will be opened?

softCommit automatically opens a new searcher?


Thanks


Danilo


On 24/10/19 17:06, Erick Erickson wrote:

> "For every document processed, a soft commit is performed to make the update visible to other concurrent update processes.”
>
> Please do not do this! First, Real Time Get will always return the current doc, whether you’ve opened a new reader or not. Second, this is an anti-pattern. I agree with Paras, set your defaults in solrconfig and forget about it.
>
> I’d also set the hard commits to something like 15 seconds (openSearcher=false). Or, if you can stand 15 second latency, set openSearcher=true and leave the soft commit set to -1.
>
> Opening a searcher is a heavyweight operation. doing it after _every_ document is a poor choice. If you absolutely _must_, at least batch your updates up in groups of, say, 1,000 and open a new searcher after that.
>
> Best,
> Erick
>
>> On Oct 24, 2019, at 3:52 AM, Danilo Tomasoni <[hidden email]> wrote:
>>
>> For every document processed, a soft commit is performed to make the update visible to other concurrent update processes.

--
Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
[hidden email]
http://www.cosbi.eu
 
As for the European General Data Protection Regulation 2016/679 on the protection of natural persons with regard to the processing of personal data, we inform you that all the data we possess are object of treatment in the respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may ask for their correction, cancellation or you may oppose to their use by written request sent by recorded delivery to The Microsoft Research – University of Trento Centre for Computational and Systems Biology Scarl, Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to

Reply | Threaded
Open this post in threaded view
|

Re: solr configuration issue

Shawn Heisey-2
On 10/25/2019 5:44 AM, Danilo Tomasoni wrote:
> Another question, is softCommit sufficient to ensure visibility or
> should I call a commit to ensure a new searcher will be opened?
>
> softCommit automatically opens a new searcher?

There would be little point to doing a soft commit with openSearcher set
to false.  I actually don't even know if you CAN do such a commit, but
there would be no reason to ever do it even if you can.  If you're not
opening a searcher, the performance characteristics say that you might
as well use hard commit for better data durability.  Creating searchers
is the expensive part of a commit.

So in my mind, a soft commit always opens a new searcher.  It exists as
a commit that MIGHT perform faster than a hard commit that opens a new
searcher.  It also might not perform faster ... there are situations in
which it must do just as much work writing to disk as a hard commit.
But it does go faster sometimes, so it's what we recommend for
visibility commits.

Thanks,
Shawn