SolrClient#updateByQuery?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrClient#updateByQuery?

Clemens Wyss DEV
 SolrClient has the method(s) deleteByQuery (which I make use of when I need to reindex).
#updateByQuery does nicht exist. What if I want to "update all documents matching a query"?


Thx
Clemens
Reply | Threaded
Open this post in threaded view
|

Re: SolrClient#updateByQuery?

Emir Arnautović
Hi Clemens,
You are thinking too RDMS. You can use query to select doc, but how would you provide what are updated doc? I guess you could use this approach only for incremental updates or with some scripting language. That is not supported at the moment. The best you can do is select and send updates as a single bulk.

Also use DBQ with caution - it does not work well with concurrent updates.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Jan 2018, at 17:10, Clemens Wyss DEV <[hidden email]> wrote:
>
> SolrClient has the method(s) deleteByQuery (which I make use of when I need to reindex).
> #updateByQuery does nicht exist. What if I want to "update all documents matching a query"?
>
>
> Thx
> Clemens

Reply | Threaded
Open this post in threaded view
|

AW: SolrClient#updateByQuery?

Clemens Wyss DEV
Thx Emir!

> You are thinking too RDMS
maybe the DBQ "missled" me 😉

> The best you can do is select and send updates as a single bulk
how can I do "In-Place Updates" (https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates) from/through SolrJ?

>Also use DBQ with caution - it does not work well with concurrent updates
we "prevent" this through sequentialization (per core )

Why do I want to do all this (dumb things)? The context is as follows:
when a document is deleted in an index/core this deletion is not immediately reflected in the searchresults. Deletions at not really NRT (or has this changed?). Till now we "solved" this brutely by forcing a commit (with "expunge deletes"), till we noticed that this results in quite a "heavy load", to say the least.
Now I have the idea to add a "deleted"-flag to all the documents that is filtered on on all queries.
When it comes to deletions, I would upate the document's deleted flag and then effectively delete it. For single deletion this is ok, but what if I need to re-index?

-----Ursprüngliche Nachricht-----
Von: Emir Arnautović [mailto:[hidden email]]
Gesendet: Freitag, 26. Januar 2018 17:31
An: [hidden email]
Betreff: Re: SolrClient#updateByQuery?

Hi Clemens,
You are thinking too RDMS. You can use query to select doc, but how would you provide what are updated doc? I guess you could use this approach only for incremental updates or with some scripting language. That is not supported at the moment. The best you can do is select and send updates as a single bulk.

Also use DBQ with caution - it does not work well with concurrent updates.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Jan 2018, at 17:10, Clemens Wyss DEV <[hidden email]> wrote:
>
> SolrClient has the method(s) deleteByQuery (which I make use of when I need to reindex).
> #updateByQuery does nicht exist. What if I want to "update all documents matching a query"?
>
>
> Thx
> Clemens

Reply | Threaded
Open this post in threaded view
|

Re: SolrClient#updateByQuery?

Walter Underwood
Use a filter query to filter out all the documents marked deleted.

Don’t use “expunge deletes”, it does more than you want because it forces a merge. Just commit after sending the delete.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


> On Jan 26, 2018, at 8:55 AM, Clemens Wyss DEV <[hidden email]> wrote:
>
> Thx Emir!
>
>> You are thinking too RDMS
> maybe the DBQ "missled" me 😉
>
>> The best you can do is select and send updates as a single bulk
> how can I do "In-Place Updates" (https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates) from/through SolrJ?
>
>> Also use DBQ with caution - it does not work well with concurrent updates
> we "prevent" this through sequentialization (per core )
>
> Why do I want to do all this (dumb things)? The context is as follows:
> when a document is deleted in an index/core this deletion is not immediately reflected in the searchresults. Deletions at not really NRT (or has this changed?). Till now we "solved" this brutely by forcing a commit (with "expunge deletes"), till we noticed that this results in quite a "heavy load", to say the least.
> Now I have the idea to add a "deleted"-flag to all the documents that is filtered on on all queries.
> When it comes to deletions, I would upate the document's deleted flag and then effectively delete it. For single deletion this is ok, but what if I need to re-index?
>
> -----Ursprüngliche Nachricht-----
> Von: Emir Arnautović [mailto:[hidden email]]
> Gesendet: Freitag, 26. Januar 2018 17:31
> An: [hidden email]
> Betreff: Re: SolrClient#updateByQuery?
>
> Hi Clemens,
> You are thinking too RDMS. You can use query to select doc, but how would you provide what are updated doc? I guess you could use this approach only for incremental updates or with some scripting language. That is not supported at the moment. The best you can do is select and send updates as a single bulk.
>
> Also use DBQ with caution - it does not work well with concurrent updates.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 26 Jan 2018, at 17:10, Clemens Wyss DEV <[hidden email]> wrote:
>>
>> SolrClient has the method(s) deleteByQuery (which I make use of when I need to reindex).
>> #updateByQuery does nicht exist. What if I want to "update all documents matching a query"?
>>
>>
>> Thx
>> Clemens
>

Reply | Threaded
Open this post in threaded view
|

Re: SolrClient#updateByQuery?

Erick Erickson
Wait. What do you mean by: "... this deletion is not immediately
reflected in the search results..."? Like all other index operations
this change won't be "visible" until the next commit, but
expungeDeletes is (or should be) totally unnecessary. And very costly
for reasons other than you might be aware of, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

If you commit after docs are deleted and _still_ see them in search
results, that's a JIRA. That should simply _not_ be the case.

Do note, however, that DBQ can take quite a long time to run. Is it
possible that the delete isn't complete yet for some reason?

As for why there's not an "Update By Query", it's actually fairly
awful to deal with. Imagine in Solr's case what
UpdateByQuery, set fieldX=32, q=*:*. In order for that to work

1> It's possible that the update single-valued docValues fields (which
doesn't need all fields stored) could be made to work with that. That
functionality is so new, though, that it hasn't been addressed (and
I'm not totally sure it's possible).

Assuming the case isn't <1>, it would require
2> it's use "atomic updates under the covers, meaning:
  2a> all fields be stored (a pre-requisite for Atomic Updates)
  2b> each and every document would be completely re-indexed.

Inverted indexes don't lend themselves well to bulk updates....
FWIW,
Erick


On Fri, Jan 26, 2018 at 9:50 AM, Walter Underwood <[hidden email]> wrote:

> Use a filter query to filter out all the documents marked deleted.
>
> Don’t use “expunge deletes”, it does more than you want because it forces a merge. Just commit after sending the delete.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Jan 26, 2018, at 8:55 AM, Clemens Wyss DEV <[hidden email]> wrote:
>>
>> Thx Emir!
>>
>>> You are thinking too RDMS
>> maybe the DBQ "missled" me 😉
>>
>>> The best you can do is select and send updates as a single bulk
>> how can I do "In-Place Updates" (https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html#UpdatingPartsofDocuments-In-PlaceUpdates) from/through SolrJ?
>>
>>> Also use DBQ with caution - it does not work well with concurrent updates
>> we "prevent" this through sequentialization (per core )
>>
>> Why do I want to do all this (dumb things)? The context is as follows:
>> when a document is deleted in an index/core this deletion is not immediately reflected in the searchresults. Deletions at not really NRT (or has this changed?). Till now we "solved" this brutely by forcing a commit (with "expunge deletes"), till we noticed that this results in quite a "heavy load", to say the least.
>> Now I have the idea to add a "deleted"-flag to all the documents that is filtered on on all queries.
>> When it comes to deletions, I would upate the document's deleted flag and then effectively delete it. For single deletion this is ok, but what if I need to re-index?
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Emir Arnautović [mailto:[hidden email]]
>> Gesendet: Freitag, 26. Januar 2018 17:31
>> An: [hidden email]
>> Betreff: Re: SolrClient#updateByQuery?
>>
>> Hi Clemens,
>> You are thinking too RDMS. You can use query to select doc, but how would you provide what are updated doc? I guess you could use this approach only for incremental updates or with some scripting language. That is not supported at the moment. The best you can do is select and send updates as a single bulk.
>>
>> Also use DBQ with caution - it does not work well with concurrent updates.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 26 Jan 2018, at 17:10, Clemens Wyss DEV <[hidden email]> wrote:
>>>
>>> SolrClient has the method(s) deleteByQuery (which I make use of when I need to reindex).
>>> #updateByQuery does nicht exist. What if I want to "update all documents matching a query"?
>>>
>>>
>>> Thx
>>> Clemens
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: AW: SolrClient#updateByQuery?

Shawn Heisey-2
In reply to this post by Clemens Wyss DEV
On 1/26/2018 9:55 AM, Clemens Wyss DEV wrote:
> Why do I want to do all this (dumb things)? The context is as follows:
> when a document is deleted in an index/core this deletion is not immediately reflected in the searchresults. Deletions at not really NRT (or has this changed?). Till now we "solved" this brutely by forcing a commit (with "expunge deletes"), till we noticed that this results in quite a "heavy load", to say the least.
> Now I have the idea to add a "deleted"-flag to all the documents that is filtered on on all queries.
> When it comes to deletions, I would upate the document's deleted flag and then effectively delete it. For single deletion this is ok, but what if I need to re-index?

The deleteByQuery functionality is known to have some issues getting
along with other things happening at the same time.

For best performance and compatibility with concurrent operations, I
would strongly recommend that you change all deleteByQuery calls into
two steps:  Do a standard query with fl=id (or whatever your uniqueKey
field is), gather up the ID values (possibly with start/rows pagination
or cursorMark), and then proceed to do one or more deleteById calls with
those ID values.  Both the query and the ID-based delete can coexist
with other concurrent operations very well.

I would expect that doing atomic updates to a deleted field in your
documents is going to be slower than the query/deleteById approach.  I
cannot be sure this is the case, but I think it would be.  It should be
a lot more friendly to NRT operation than deleteByQuery.

As Walter said, expungeDeletes will result in Solr doing a lot more work
than it should, slowing things down even more.  It also won't affect
search results at all.  Once the commit finishes and opens a new
searcher, Solr will not include deleted documents in search results. 
The expungeDeletes parameter can make commits take a VERY long time.

I have no idea whether the issues surrounding deleteByQuery can be fixed
or not.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

AW: AW: SolrClient#updateByQuery?

Clemens Wyss DEV
Thanks for all these (main contributor's 😉) valuable inputs!

First thing I did was getting getting rid of "expungeDeletes". My "single-deletion" unittest failed unti I added the optimize-param
> updateReques.setParam( "optimize", "true" );
Does this make sense or should JIRA it?
How expensive ist this "optimization"?


-----Ursprüngliche Nachricht-----
Von: Shawn Heisey [mailto:[hidden email]]
Gesendet: Samstag, 27. Januar 2018 00:49
An: [hidden email]
Betreff: Re: AW: SolrClient#updateByQuery?

On 1/26/2018 9:55 AM, Clemens Wyss DEV wrote:
> Why do I want to do all this (dumb things)? The context is as follows:
> when a document is deleted in an index/core this deletion is not immediately reflected in the searchresults. Deletions at not really NRT (or has this changed?). Till now we "solved" this brutely by forcing a commit (with "expunge deletes"), till we noticed that this results in quite a "heavy load", to say the least.
> Now I have the idea to add a "deleted"-flag to all the documents that is filtered on on all queries.
> When it comes to deletions, I would upate the document's deleted flag and then effectively delete it. For single deletion this is ok, but what if I need to re-index?

The deleteByQuery functionality is known to have some issues getting along with other things happening at the same time.

For best performance and compatibility with concurrent operations, I would strongly recommend that you change all deleteByQuery calls into two steps:  Do a standard query with fl=id (or whatever your uniqueKey field is), gather up the ID values (possibly with start/rows pagination or cursorMark), and then proceed to do one or more deleteById calls with those ID values.  Both the query and the ID-based delete can coexist with other concurrent operations very well.

I would expect that doing atomic updates to a deleted field in your documents is going to be slower than the query/deleteById approach.  I cannot be sure this is the case, but I think it would be.  It should be a lot more friendly to NRT operation than deleteByQuery.

As Walter said, expungeDeletes will result in Solr doing a lot more work than it should, slowing things down even more.  It also won't affect search results at all.  Once the commit finishes and opens a new searcher, Solr will not include deleted documents in search results. The expungeDeletes parameter can make commits take a VERY long time.

I have no idea whether the issues surrounding deleteByQuery can be fixed or not.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

AW: AW: SolrClient#updateByQuery?

Clemens Wyss DEV
<again with hopefully less typos>
Thanks for all these (main contributor's 😉) valuable inputs!

First thing I did was getting rid of "expungeDeletes". My "single-deletion" unittest failed until I added the optimize-param
> updateReques.setParam( "optimize", "true" );
Does this make sense or should JIRA it?
How expensive is this "optimization"?
BTW: we are on Solr 6.6.0

-----Ursprüngliche Nachricht-----
Von: Clemens Wyss DEV [mailto:[hidden email]]
Gesendet: Samstag, 27. Januar 2018 08:50
An: '[hidden email]' <[hidden email]>
Betreff: AW: AW: SolrClient#updateByQuery?

Thanks for all these (main contributor's 😉) valuable inputs!

First thing I did was getting getting rid of "expungeDeletes". My "single-deletion" unittest failed unti I added the optimize-param
> updateReques.setParam( "optimize", "true" );
Does this make sense or should JIRA it?
How expensive ist this "optimization"?


-----Ursprüngliche Nachricht-----
Von: Shawn Heisey [mailto:[hidden email]]
Gesendet: Samstag, 27. Januar 2018 00:49
An: [hidden email]
Betreff: Re: AW: SolrClient#updateByQuery?

On 1/26/2018 9:55 AM, Clemens Wyss DEV wrote:
> Why do I want to do all this (dumb things)? The context is as follows:
> when a document is deleted in an index/core this deletion is not immediately reflected in the searchresults. Deletions at not really NRT (or has this changed?). Till now we "solved" this brutely by forcing a commit (with "expunge deletes"), till we noticed that this results in quite a "heavy load", to say the least.
> Now I have the idea to add a "deleted"-flag to all the documents that is filtered on on all queries.
> When it comes to deletions, I would upate the document's deleted flag and then effectively delete it. For single deletion this is ok, but what if I need to re-index?

The deleteByQuery functionality is known to have some issues getting along with other things happening at the same time.

For best performance and compatibility with concurrent operations, I would strongly recommend that you change all deleteByQuery calls into two steps:  Do a standard query with fl=id (or whatever your uniqueKey field is), gather up the ID values (possibly with start/rows pagination or cursorMark), and then proceed to do one or more deleteById calls with those ID values.  Both the query and the ID-based delete can coexist with other concurrent operations very well.

I would expect that doing atomic updates to a deleted field in your documents is going to be slower than the query/deleteById approach.  I cannot be sure this is the case, but I think it would be.  It should be a lot more friendly to NRT operation than deleteByQuery.

As Walter said, expungeDeletes will result in Solr doing a lot more work than it should, slowing things down even more.  It also won't affect search results at all.  Once the commit finishes and opens a new searcher, Solr will not include deleted documents in search results. The expungeDeletes parameter can make commits take a VERY long time.

I have no idea whether the issues surrounding deleteByQuery can be fixed or not.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: AW: AW: SolrClient#updateByQuery?

Shawn Heisey-2
In reply to this post by Clemens Wyss DEV
On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
> Thanks for all these (main contributor's 😉) valuable inputs!
>
> First thing I did was getting getting rid of "expungeDeletes". My "single-deletion" unittest failed unti I added the optimize-param
>> updateReques.setParam( "optimize", "true" );
> Does this make sense or should JIRA it?
> How expensive ist this "optimization"?

An optimize operation is a complete rewrite of the entire index to one
segment.  It will typically double the size of the index.  The rewritten
index will not have any documents that were deleted in it.  It's slow
and extremely expensive.  If the index is one gigabyte, expect an
optimize to take at least half an hour, possibly longer, to complete.
The CPU and disk I/O are going to take a beating while the optimize is
occurring.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

AW: AW: AW: SolrClient#updateByQuery?

Clemens Wyss DEV
Erick said/wrote:
> If you commit after docs are deleted and _still_ see them in search results, that's a JIRA
should I JIRA it?

-----Ursprüngliche Nachricht-----
Von: Shawn Heisey [mailto:[hidden email]]
Gesendet: Samstag, 27. Januar 2018 12:05
An: [hidden email]
Betreff: Re: AW: AW: SolrClient#updateByQuery?

On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
> Thanks for all these (main contributor's 😉) valuable inputs!
>
> First thing I did was getting getting rid of "expungeDeletes". My
> "single-deletion" unittest failed unti I added the optimize-param
>> updateReques.setParam( "optimize", "true" );
> Does this make sense or should JIRA it?
> How expensive ist this "optimization"?

An optimize operation is a complete rewrite of the entire index to one segment.  It will typically double the size of the index.  The rewritten index will not have any documents that were deleted in it.  It's slow and extremely expensive.  If the index is one gigabyte, expect an optimize to take at least half an hour, possibly longer, to complete.
The CPU and disk I/O are going to take a beating while the optimize is occurring.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: AW: AW: SolrClient#updateByQuery?

Erick Erickson
Clemens:

Let's not raise a JIRA quite yet. I am 99% sure your test is not doing
what you think or you have some invalid expectations. This is such a
fundamental feature that it'd surprise me a _lot_ if it were a bug.
Also, there are a bunch of DeleteByQuery tests in the junit tests
that's run all the time..

Wait, are you issuing an explicit commit or not? I saw this phrase
"...brutely by forcing a commit (with "expunge deletes")..." and saw
the word "commit" and assumed you were issuing a commit, but
re-reading that's not clear at all. Code should look something like

update-via-delete-by-query
solrClient.commit();
query to see if doc is gone.

So here's what I'd try next:

1> Issue an explicit commit command (SolrCient.commit()) after the
DBQ. The defaults there are openSearcher = true and waitSearcher=
true. When that returns _then_ issue your query.
2> If that doesn't work, try (just for information gathering) waiting,
several seconds after the commit to try your request. This should
_not_ be necessary, but it'll give us a clue what's going on.
3> Show us the code if you can.

Best,
Erick


On Sat, Jan 27, 2018 at 6:55 AM, Clemens Wyss DEV <[hidden email]> wrote:

> Erick said/wrote:
>> If you commit after docs are deleted and _still_ see them in search results, that's a JIRA
> should I JIRA it?
>
> -----Ursprüngliche Nachricht-----
> Von: Shawn Heisey [mailto:[hidden email]]
> Gesendet: Samstag, 27. Januar 2018 12:05
> An: [hidden email]
> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>
> On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
>> Thanks for all these (main contributor's 😉) valuable inputs!
>>
>> First thing I did was getting getting rid of "expungeDeletes". My
>> "single-deletion" unittest failed unti I added the optimize-param
>>> updateReques.setParam( "optimize", "true" );
>> Does this make sense or should JIRA it?
>> How expensive ist this "optimization"?
>
> An optimize operation is a complete rewrite of the entire index to one segment.  It will typically double the size of the index.  The rewritten index will not have any documents that were deleted in it.  It's slow and extremely expensive.  If the index is one gigabyte, expect an optimize to take at least half an hour, possibly longer, to complete.
> The CPU and disk I/O are going to take a beating while the optimize is occurring.
>
> Thanks,
> Shawn
Reply | Threaded
Open this post in threaded view
|

AW: AW: AW: SolrClient#updateByQuery?

Clemens Wyss DEV
I must clarify a few things:
the unittest I noted does not check/perform a DBQ but a "simple" deleteById.
The deleted document is no more found (as expected) BUT I am still getting "suggestions" (from spellcheck.q). So my problem is not that I find deleted documents but suggestions resulting from the deleted document.

The suggestions-configuration is as follows:
<searchComponent name="suggest_phrase" class="solr.SpellCheckComponent">
            <lst name="spellchecker">
                <str name="name">suggest_phrase_fuzzy</str>
                <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
                <str name="lookupImpl">org.apache.solr.spelling.suggest.fst.FuzzyLookupFactory</str>
                <str name="allTermsRequired">true</str>
                <str name="maxEdits">2</str>
                <str name="ignoreCase=">true</str>
                <str name="field">_my_suggest_phrase</str>
                <str name="suggestAnalyzerFieldType">string</str> <!-- suggest_phrase -->
                <!--  <str name="storeDir">suggest_phrase_fuzzy</str>  -->
                <str name="buildOnOptimize">false</str>
                <str name="buildOnStartup">false</str> <!-- ?? -->
                <str name="buildOnCommit">true</str>
            </lst>
        </searchComponent>

Most importantly: "buildOnCommit"->true.

The question hence is:
What (which commit?) do I need to do after
>solrClient.deleteById( toBeDeletedDocumentIDs );

for the suggestions to be up-to-date too (without heavy commit/optimize)?

thx and sorry for the misunderstandings

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:[hidden email]]
Gesendet: Samstag, 27. Januar 2018 18:20
An: solr-user <[hidden email]>
Betreff: Re: AW: AW: SolrClient#updateByQuery?

Clemens:

Let's not raise a JIRA quite yet. I am 99% sure your test is not doing what you think or you have some invalid expectations. This is such a fundamental feature that it'd surprise me a _lot_ if it were a bug.
Also, there are a bunch of DeleteByQuery tests in the junit tests that's run all the time..

Wait, are you issuing an explicit commit or not? I saw this phrase "...brutely by forcing a commit (with "expunge deletes")..." and saw the word "commit" and assumed you were issuing a commit, but re-reading that's not clear at all. Code should look something like

update-via-delete-by-query
solrClient.commit();
query to see if doc is gone.

So here's what I'd try next:

1> Issue an explicit commit command (SolrCient.commit()) after the
DBQ. The defaults there are openSearcher = true and waitSearcher= true. When that returns _then_ issue your query.
2> If that doesn't work, try (just for information gathering) waiting,
several seconds after the commit to try your request. This should _not_ be necessary, but it'll give us a clue what's going on.
3> Show us the code if you can.

Best,
Erick


On Sat, Jan 27, 2018 at 6:55 AM, Clemens Wyss DEV <[hidden email]> wrote:

> Erick said/wrote:
>> If you commit after docs are deleted and _still_ see them in search
>> results, that's a JIRA
> should I JIRA it?
>
> -----Ursprüngliche Nachricht-----
> Von: Shawn Heisey [mailto:[hidden email]]
> Gesendet: Samstag, 27. Januar 2018 12:05
> An: [hidden email]
> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>
> On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
>> Thanks for all these (main contributor's 😉) valuable inputs!
>>
>> First thing I did was getting getting rid of "expungeDeletes". My
>> "single-deletion" unittest failed unti I added the optimize-param
>>> updateReques.setParam( "optimize", "true" );
>> Does this make sense or should JIRA it?
>> How expensive ist this "optimization"?
>
> An optimize operation is a complete rewrite of the entire index to one segment.  It will typically double the size of the index.  The rewritten index will not have any documents that were deleted in it.  It's slow and extremely expensive.  If the index is one gigabyte, expect an optimize to take at least half an hour, possibly longer, to complete.
> The CPU and disk I/O are going to take a beating while the optimize is occurring.
>
> Thanks,
> Shawn
Reply | Threaded
Open this post in threaded view
|

Re: AW: AW: SolrClient#updateByQuery?

Erick Erickson
bq: I am still getting "suggestions" (from spellcheck.q)

OK, this is actually expected behavior. The spellcheck is done from
the _indexed_ terms. Documents deleted from the index are marked as
deleted, the associated terms are not purged from the index until the
segment is merged. When just checking the terms for spellcheck,
there's no good way to figure out that a term is part of a deleted
doc.

Your expungeDeletes "fix" really wouldn't have actually fixed your
problem in any kind of production environment. ExpungeDeletes merges
segments with > n% deleted docs. It only fixed your test case because,
I suspect, you have very few documents (perhaps only one) in your
segment, so it was merged away. In a situation where you had, say,
10,000 docs in a segments and you deleted the one (and only) document
with some term, expungeDeletes would skip merging the segment and
spellcheck would still have returned the suggestion.

Optimize on the other hand unconditionally rewrites all segments into
a single segment, so that was removing the indexed term. As discussed,
optimize is a _very_ expensive operation and, unless you're able to
optimize after every indexing session it will not scale. The
situations where I've seen this be acceptable are ones in which the
index changes rarely, for example you update your index once a day. If
you continually update your index, optimizing will actually make this
problem worse between optimizations, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

At a higher level, you're expending a lot of effort to handle the case
where a document is deleted and it's the last "live" doc in your
entire corpus that contains a term. For a decent sized corpus this
will be quite rare so often people simply don't worry about it. The
scenario in your test case is somewhat artificial and makes it seem
more likely than it will probably be "in the real world".

Consider setting spellcheck's thresholdTokenFrequency to some value.
That parameter's primary purpose is to handle situations where words
are misspelled in the documents so you don't suggest those misspelled
words, but I think it would cover this situation too. Unfortunately it
will not work very well in a simple test setup either. Let's say you
set it to 2%. You index 100 documents and 3 of them contain the term.
It's now found in your spellcheck test. Now you delete two of them
(without merging any segments). The term frequency is _still_ 3% so it
still may be found after you delete and commit.

I suppose you could structure your test this way:
index 100 docs, 3 of them have a specific term.
Set your threshold to 2%
Check that the term is suggested
index 100 more docs
Check that the term is _not_ suggested.

Best,
Erick

On Sun, Jan 28, 2018 at 7:24 AM, Clemens Wyss DEV <[hidden email]> wrote:

> I must clarify a few things:
> the unittest I noted does not check/perform a DBQ but a "simple" deleteById.
> The deleted document is no more found (as expected) BUT I am still getting "suggestions" (from spellcheck.q). So my problem is not that I find deleted documents but suggestions resulting from the deleted document.
>
> The suggestions-configuration is as follows:
> <searchComponent name="suggest_phrase" class="solr.SpellCheckComponent">
>             <lst name="spellchecker">
>                 <str name="name">suggest_phrase_fuzzy</str>
>                 <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
>                 <str name="lookupImpl">org.apache.solr.spelling.suggest.fst.FuzzyLookupFactory</str>
>                 <str name="allTermsRequired">true</str>
>                 <str name="maxEdits">2</str>
>                 <str name="ignoreCase=">true</str>
>                 <str name="field">_my_suggest_phrase</str>
>                 <str name="suggestAnalyzerFieldType">string</str> <!-- suggest_phrase -->
>                 <!--  <str name="storeDir">suggest_phrase_fuzzy</str>  -->
>                 <str name="buildOnOptimize">false</str>
>                 <str name="buildOnStartup">false</str> <!-- ?? -->
>                 <str name="buildOnCommit">true</str>
>             </lst>
>         </searchComponent>
>
> Most importantly: "buildOnCommit"->true.
>
> The question hence is:
> What (which commit?) do I need to do after
>>solrClient.deleteById( toBeDeletedDocumentIDs );
>
> for the suggestions to be up-to-date too (without heavy commit/optimize)?
>
> thx and sorry for the misunderstandings
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Samstag, 27. Januar 2018 18:20
> An: solr-user <[hidden email]>
> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>
> Clemens:
>
> Let's not raise a JIRA quite yet. I am 99% sure your test is not doing what you think or you have some invalid expectations. This is such a fundamental feature that it'd surprise me a _lot_ if it were a bug.
> Also, there are a bunch of DeleteByQuery tests in the junit tests that's run all the time..
>
> Wait, are you issuing an explicit commit or not? I saw this phrase "...brutely by forcing a commit (with "expunge deletes")..." and saw the word "commit" and assumed you were issuing a commit, but re-reading that's not clear at all. Code should look something like
>
> update-via-delete-by-query
> solrClient.commit();
> query to see if doc is gone.
>
> So here's what I'd try next:
>
> 1> Issue an explicit commit command (SolrCient.commit()) after the
> DBQ. The defaults there are openSearcher = true and waitSearcher= true. When that returns _then_ issue your query.
> 2> If that doesn't work, try (just for information gathering) waiting,
> several seconds after the commit to try your request. This should _not_ be necessary, but it'll give us a clue what's going on.
> 3> Show us the code if you can.
>
> Best,
> Erick
>
>
> On Sat, Jan 27, 2018 at 6:55 AM, Clemens Wyss DEV <[hidden email]> wrote:
>> Erick said/wrote:
>>> If you commit after docs are deleted and _still_ see them in search
>>> results, that's a JIRA
>> should I JIRA it?
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Shawn Heisey [mailto:[hidden email]]
>> Gesendet: Samstag, 27. Januar 2018 12:05
>> An: [hidden email]
>> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>>
>> On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
>>> Thanks for all these (main contributor's 😉) valuable inputs!
>>>
>>> First thing I did was getting getting rid of "expungeDeletes". My
>>> "single-deletion" unittest failed unti I added the optimize-param
>>>> updateReques.setParam( "optimize", "true" );
>>> Does this make sense or should JIRA it?
>>> How expensive ist this "optimization"?
>>
>> An optimize operation is a complete rewrite of the entire index to one segment.  It will typically double the size of the index.  The rewritten index will not have any documents that were deleted in it.  It's slow and extremely expensive.  If the index is one gigabyte, expect an optimize to take at least half an hour, possibly longer, to complete.
>> The CPU and disk I/O are going to take a beating while the optimize is occurring.
>>
>> Thanks,
>> Shawn
Reply | Threaded
Open this post in threaded view
|

AW: AW: AW: SolrClient#updateByQuery?

Clemens Wyss DEV
Yet again: thanks a lot!

Spellchecking@solr:
what are the best (up-to-date) sources/links for spellchecking and suggestions?

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:[hidden email]]
Gesendet: Sonntag, 28. Januar 2018 18:23
An: solr-user <[hidden email]>
Betreff: Re: AW: AW: SolrClient#updateByQuery?

bq: I am still getting "suggestions" (from spellcheck.q)

OK, this is actually expected behavior. The spellcheck is done from the _indexed_ terms. Documents deleted from the index are marked as deleted, the associated terms are not purged from the index until the segment is merged. When just checking the terms for spellcheck, there's no good way to figure out that a term is part of a deleted doc.

Your expungeDeletes "fix" really wouldn't have actually fixed your problem in any kind of production environment. ExpungeDeletes merges segments with > n% deleted docs. It only fixed your test case because, I suspect, you have very few documents (perhaps only one) in your segment, so it was merged away. In a situation where you had, say,
10,000 docs in a segments and you deleted the one (and only) document with some term, expungeDeletes would skip merging the segment and spellcheck would still have returned the suggestion.

Optimize on the other hand unconditionally rewrites all segments into a single segment, so that was removing the indexed term. As discussed, optimize is a _very_ expensive operation and, unless you're able to optimize after every indexing session it will not scale. The situations where I've seen this be acceptable are ones in which the index changes rarely, for example you update your index once a day. If you continually update your index, optimizing will actually make this problem worse between optimizations, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

At a higher level, you're expending a lot of effort to handle the case where a document is deleted and it's the last "live" doc in your entire corpus that contains a term. For a decent sized corpus this will be quite rare so often people simply don't worry about it. The scenario in your test case is somewhat artificial and makes it seem more likely than it will probably be "in the real world".

Consider setting spellcheck's thresholdTokenFrequency to some value.
That parameter's primary purpose is to handle situations where words are misspelled in the documents so you don't suggest those misspelled words, but I think it would cover this situation too. Unfortunately it will not work very well in a simple test setup either. Let's say you set it to 2%. You index 100 documents and 3 of them contain the term.
It's now found in your spellcheck test. Now you delete two of them (without merging any segments). The term frequency is _still_ 3% so it still may be found after you delete and commit.

I suppose you could structure your test this way:
index 100 docs, 3 of them have a specific term.
Set your threshold to 2%
Check that the term is suggested
index 100 more docs
Check that the term is _not_ suggested.

Best,
Erick

On Sun, Jan 28, 2018 at 7:24 AM, Clemens Wyss DEV <[hidden email]> wrote:

> I must clarify a few things:
> the unittest I noted does not check/perform a DBQ but a "simple" deleteById.
> The deleted document is no more found (as expected) BUT I am still getting "suggestions" (from spellcheck.q). So my problem is not that I find deleted documents but suggestions resulting from the deleted document.
>
> The suggestions-configuration is as follows:
> <searchComponent name="suggest_phrase" class="solr.SpellCheckComponent">
>             <lst name="spellchecker">
>                 <str name="name">suggest_phrase_fuzzy</str>
>                 <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
>                 <str name="lookupImpl">org.apache.solr.spelling.suggest.fst.FuzzyLookupFactory</str>
>                 <str name="allTermsRequired">true</str>
>                 <str name="maxEdits">2</str>
>                 <str name="ignoreCase=">true</str>
>                 <str name="field">_my_suggest_phrase</str>
>                 <str name="suggestAnalyzerFieldType">string</str> <!-- suggest_phrase -->
>                 <!--  <str name="storeDir">suggest_phrase_fuzzy</str>  -->
>                 <str name="buildOnOptimize">false</str>
>                 <str name="buildOnStartup">false</str> <!-- ?? -->
>                 <str name="buildOnCommit">true</str>
>             </lst>
>         </searchComponent>
>
> Most importantly: "buildOnCommit"->true.
>
> The question hence is:
> What (which commit?) do I need to do after
>>solrClient.deleteById( toBeDeletedDocumentIDs );
>
> for the suggestions to be up-to-date too (without heavy commit/optimize)?
>
> thx and sorry for the misunderstandings
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Samstag, 27. Januar 2018 18:20
> An: solr-user <[hidden email]>
> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>
> Clemens:
>
> Let's not raise a JIRA quite yet. I am 99% sure your test is not doing what you think or you have some invalid expectations. This is such a fundamental feature that it'd surprise me a _lot_ if it were a bug.
> Also, there are a bunch of DeleteByQuery tests in the junit tests that's run all the time..
>
> Wait, are you issuing an explicit commit or not? I saw this phrase
> "...brutely by forcing a commit (with "expunge deletes")..." and saw
> the word "commit" and assumed you were issuing a commit, but
> re-reading that's not clear at all. Code should look something like
>
> update-via-delete-by-query
> solrClient.commit();
> query to see if doc is gone.
>
> So here's what I'd try next:
>
> 1> Issue an explicit commit command (SolrCient.commit()) after the
> DBQ. The defaults there are openSearcher = true and waitSearcher= true. When that returns _then_ issue your query.
> 2> If that doesn't work, try (just for information gathering) waiting,
> several seconds after the commit to try your request. This should _not_ be necessary, but it'll give us a clue what's going on.
> 3> Show us the code if you can.
>
> Best,
> Erick
>
>
> On Sat, Jan 27, 2018 at 6:55 AM, Clemens Wyss DEV <[hidden email]> wrote:
>> Erick said/wrote:
>>> If you commit after docs are deleted and _still_ see them in search
>>> results, that's a JIRA
>> should I JIRA it?
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Shawn Heisey [mailto:[hidden email]]
>> Gesendet: Samstag, 27. Januar 2018 12:05
>> An: [hidden email]
>> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>>
>> On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
>>> Thanks for all these (main contributor's 😉) valuable inputs!
>>>
>>> First thing I did was getting getting rid of "expungeDeletes". My
>>> "single-deletion" unittest failed unti I added the optimize-param
>>>> updateReques.setParam( "optimize", "true" );
>>> Does this make sense or should JIRA it?
>>> How expensive ist this "optimization"?
>>
>> An optimize operation is a complete rewrite of the entire index to one segment.  It will typically double the size of the index.  The rewritten index will not have any documents that were deleted in it.  It's slow and extremely expensive.  If the index is one gigabyte, expect an optimize to take at least half an hour, possibly longer, to complete.
>> The CPU and disk I/O are going to take a beating while the optimize is occurring.
>>
>> Thanks,
>> Shawn