different results in numFound vs using the cursor

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

different results in numFound vs using the cursor

rhys J
i am using this logic in perl:

my $decoded = decode_json( $solrResponse->{_content} );
my $numFound = $decoded->{response}{numFound};

$cursor = "*";
$prevCursor = '';

while ( $prevCursor ne $cursor )
{
  my $solrURI = "\"http://[SOLR URL]:8983/solr/";
  $solrURI .= $fdat{core};

  $solrSort = ( $fdat{core} eq 'dbtr' ) ? "debtor_id+asc" : "id+asc";
  $solrOptions = "/select?indent=on&rows=$getrows&sort=$solrSort&q=";
  $solrURI .= $solrOptions;
  $solrURI .= $query;

 $solrURI .= ( $prevCursor eq '' ) ? "&cursorMark=*\"":
                     "&cursorMark=$cursor\"";

 print STDERR "solrURI '$solrURI'\n";
 my $solrResponse = $ua->post( $solrURI );
   my $decoded = decode_json( $solrResponse->{_content} );
  my $numFound = $decoded->{response}{numFound};

 foreach my $d ( $decoded->{response}{docs} )
  {
      my @docs = @$d;
      print STDERR "size of docs '" . scalar( @docs ) . "'\n";
   foreach my $r ( @docs )
           {
               if ( $fdat{cust_num} and $fdat{core} eq 'dbtr' )
               {
                   push ( @solrResults, $r->{debtor_id} );
               }
               elsif ( $fdat{cust_num} and $fdat{core} eq 'debt' )
               {
                   push ( @solrResults, $r->{debt_id} );
               }
           }

}
   $prevCursor = ( $prevCursor eq '' ) ? "*" : $cursor;
 $cursor = $decoded->{nextCursorMark};
  print STDERR "cursor '$cursor'\n";
  print STDERR "prevCursor '$prevCursor'\n";
  print STDERR "size of solrResults '" . scalar( @solrResults ) . "'\n";
}

print out:

http://[SOLR
URL]:8983/solr/debt/select?indent=on&rows=1000&sort=id+asc&q=debt_id:
608384 OR debt_id: 393291&cursorMark=AoEmMzkzMjkx

The numFound: 35008
final size of solrResults: 22006

Am I missing something I should be using with cursorMark? Or is this
expected?

I've checked my logic, and I'm using the cursors the way this page is using
them in examples:

https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html

Thanks

Rhys
Reply | Threaded
Open this post in threaded view
|

Re: different results in numFound vs using the cursor

Chris Hostetter-3

Based on the info provided, it's hard to be certain, but reading between
the lines here are hte assumptions i'm making...

1) your core name is "dbtr"
2) the uniqueId field for the "dbtr" core is "debtor_id"

..are those assumptions correct?

Two key pieces of information that doesn't seem to be assumable from the
imfo you've provided:

a) What is the fieldType of the uniqueKey field in use?
b) how are you determining that "The numFound: 35008"

...

You show the code that prints out "size of solrResults: 22006" but nothing
in your code ever prints $numFound.  there is a snippet of code at the top
of your perl logic that seems disconnected from the rest of the code which
makes me think that before you do anything with a cursor you are already
parsing some *other* query response to get $numFound that way...

: i am using this logic in perl:
:
: my $decoded = decode_json( $solrResponse->{_content} );
: my $numFound = $decoded->{response}{numFound};
:
: $cursor = "*";
: $prevCursor = '';
:
: while ( $prevCursor ne $cursor )
: {
:   my $solrURI = "\"http://[SOLR URL]:8983/solr/";
:   $solrURI .= $fdat{core};
        ...

...what exactly does all the code *before* this look like? what is the
request that you are using to get that initial '$solrResponse' that you
are parsing to extract '$numFound'  are you sure it's exactly the same as
the query whose cursor you are iterating over?

It looks like you are (also) extracting 'my $numFound =
$decoded->{response}{numFound};' on every (cusor) request ... what do you
get if add this to your cursor loop...

   print STDERR "numFound = $numFound at '$cursor'";


...because unless documents are being added/deleted as you iterate over
hte cursor, the numFound value should be consistent on each request.


-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: different results in numFound vs using the cursor

rhys J
On Mon, Nov 11, 2019 at 8:32 PM Chris Hostetter <[hidden email]>
wrote:

>
> Based on the info provided, it's hard to be certain, but reading between
> the lines here are hte assumptions i'm making...
>
> 1) your core name is "dbtr"
> 2) the uniqueId field for the "dbtr" core is "debtor_id"
>
> ..are those assumptions correct?
>

Yes they are. Sorry I didn't provide that from the beginning.


> Two key pieces of information that doesn't seem to be assumable from the
> imfo you've provided:
>
> a) What is the fieldType of the uniqueKey field in use?
>

It is a textField


> b) how are you determining that "The numFound: 35008"
>
>
I do a preliminary query to the solr core and print out the numFound from
this:

 my $solrResponse = $ua->post( $solrURI );

 my $decoded = decode_json( $solrResponse->{_content} );
 my $numFound = $decoded->{response}{numFound};


> ...
>
> You show the code that prints out "size of solrResults: 22006" but nothing
> in your code ever prints $numFound.  there is a snippet of code at the top
>

I am printing numFound every time it loops. This should remain constant,
because it is the total of all documents found. It's not really necessary
that I am printing it.

The number of docs is the size that I also print, and that is 1000 every
time, until the last little bit, and then it is 6 docs found.


> of your perl logic that seems disconnected from the rest of the code which
> makes me think that before you do anything with a cursor you are already
> parsing some *other* query response to get $numFound that way...
>
>
I am running this query first, to get the cursor set:

"http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id
asc&q=debt_id: 608384 OR debt_id: 393291&cursorMark=*"

This sets the cursor, and then returns a cursorMark that I start using in
order to grab 1000 documents at a time.



> ...what exactly does all the code *before* this look like? what is the
> request that you are using to get that initial '$solrResponse' that you
> are parsing to extract '$numFound'  are you sure it's exactly the same as
> the query whose cursor you are iterating over?
>
>
query from before the loop:

"http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id
asc&q=debt_id: 608384 OR debt_id: 393291&cursorMark=*"

query in the loop:

http://10.40.10.14:8983/solr/debt/select?indent=on&rows=1000&sort=id+asc&q=debt_id:
608384 OR debt_id: 393291&cursorMark=AoElMTg1MzE=

I do have some logic to make sure i grab the first 1000 from the first
query, but other than that, it's a simple loop.


> It looks like you are (also) extracting 'my $numFound =
> $decoded->{response}{numFound};' on every (cusor) request ... what do you
> get if add this to your cursor loop...
>
>    print STDERR "numFound = $numFound at '$cursor'";
>
> numFound is always 35008 because that is how many total documents are
found. The number of docs in the response is the number that I care about,
because that shows me how many came back for this slice.


> ...because unless documents are being added/deleted as you iterate over
> hte cursor, the numFound value should be consistent on each request.
>
>
numFound is consistently 35008.

Thanks

Rhys
Reply | Threaded
Open this post in threaded view
|

Re: different results in numFound vs using the cursor

Chris Hostetter-3

: > a) What is the fieldType of the uniqueKey field in use?
: >
:
: It is a textField

whoa... that's not normal .. what *exactly* does the fieldType declaration
(with all analyzers) look like, and what does the <field/> declaration
look like?

you should really never use TextField for a uniqueKey ... it's possible,
but incredibly tricky to get "right".

Independent from that, "sorting" on a TextField doesn't always do what you
might think (again: depending on the analysis in use)

With a cursorMark you have other factors to consider: i bet what's
happening is that the post-analysis terms for your docs result it
duplicate values, so the cursorMark is skipping all docs that have hte
same (post analysis) sort value ... this could also manifest itself in
other weird ways, like trying to deleteById.

Step #1: switch to using a simple StrField for your uniqueKey field and
see if htat solves all your problems.


-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: different results in numFound vs using the cursor

rhys J
On Tue, Nov 12, 2019 at 12:18 PM Chris Hostetter <[hidden email]>
wrote:

>
> : > a) What is the fieldType of the uniqueKey field in use?
> : >
> :
> : It is a textField
>
> whoa... that's not normal .. what *exactly* does the fieldType declaration
> (with all analyzers) look like, and what does the <field/> declaration
> look like?
>
>
<field name="debtor_id" type="text_general" multiValued="false"
indexed="true" required="true" stored="true"/>

<fieldType name="text_gen_sort" class="solr.SortableTextField"
positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true"
ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>



> you should really never use TextField for a uniqueKey ... it's possible,
> but incredibly tricky to get "right".
>
>
I am going to adjust my schema, re-index, and try again. See if that
doesn't fix this problem. I didn't know that having the uniqueKey be a
textField was a bad idea.


> Independent from that, "sorting" on a TextField doesn't always do what you
> might think (again: depending on the analysis in use)
>
> With a cursorMark you have other factors to consider: i bet what's
> happening is that the post-analysis terms for your docs result it
> duplicate values, so the cursorMark is skipping all docs that have hte
> same (post analysis) sort value ... this could also manifest itself in
> other weird ways, like trying to deleteById.
>
> Step #1: switch to using a simple StrField for your uniqueKey field and
> see if htat solves all your problems.
>
>
Thanks, doing this now.

Rhys
Reply | Threaded
Open this post in threaded view
|

Re: different results in numFound vs using the cursor

Chris Hostetter-3

: > whoa... that's not normal .. what *exactly* does the fieldType declaration
: > (with all analyzers) look like, and what does the <field/> declaration
: > look like?
: >
: >
: <field name="debtor_id" type="text_general" multiValued="false"
: indexed="true" required="true" stored="true"/>
:
: <fieldType name="text_gen_sort" class="solr.SortableTextField"
: positionIncrementGap="100" multiValued="true">

NOTE: "text_general" != "text_gen_sort"

Assuming your "text_general" declaration looks like it does in the
_default config set, then using that for uniqueKey or sorting is definitly
not a good idea.

If you were *actually* using SortableTextField for your uniqueKeyField ...
well, that should be ok to *sort* on, but i still wouldn't suggest using
it as a uniqueKey field ... honestly not sure what behavior that might
have with things like deleteById, etc...


: I am going to adjust my schema, re-index, and try again. See if that
: doesn't fix this problem. I didn't know that having the uniqueKey be a
: textField was a bad idea.

https://lucene.apache.org/solr/guide/8_3/other-schema-elements.html#OtherSchemaElements-UniqueKey

"The fieldType of uniqueKey must not be analyzed...."

(hence my comment baout "possible, but hard to get right ... you can use
something like the KeywordTokenizer, but at that point you might as well
use StrField except in some really esoteric special situations)



-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: different results in numFound vs using the cursor

rhys J
> : I am going to adjust my schema, re-index, and try again. See if that
> : doesn't fix this problem. I didn't know that having the uniqueKey be a
> : textField was a bad idea.
>
>
> https://lucene.apache.org/solr/guide/8_3/other-schema-elements.html#OtherSchemaElements-UniqueKey
>
> "The fieldType of uniqueKey must not be analyzed...."
>
> (hence my comment baout "possible, but hard to get right ... you can use
> something like the KeywordTokenizer, but at that point you might as well
> use StrField except in some really esoteric special situations)
>
>
Good news. I added a field called ID, and made it string. Then I deleted
documents, re-indexed my data, and tried the search again.

Now solrResults size and numFound size are exactly the same.

Thanks for your help.

Rhys