Query number of Lucene documents using Solr?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Query number of Lucene documents using Solr?

Bram Van Dam
Possibly somewhat unusual question: I'm looking for a way to query the
number of *lucene documents* from within Solr. This can be different
from the number of Solr documents (because of unmerged deletes/updates/
etc).

As a bit of background; we recently found this lovely little error
message in a Solr log, and we'd like to get a bit of an early warning
system going :-)

> Too many documents, composite IndexReaders cannot exceed 2147483647

If no way currently exists, I'm not adverse to hacking one in, but I
could use a few pointers in the general direction.

As an alternative strategy, I guess I could use Lucene to walk through
each index segment and add the segment info maxDoc values. But I'm not
sure if that would be a good idea.

Thanks a bunch,

 - Bram
Reply | Threaded
Open this post in threaded view
|

Re: Query number of Lucene documents using Solr?

Alexandre Rafalovitch
Luke may have it at least for a quick check:
https://github.com/dmitrykey/luke (part of last/next? version of
Lucene now).

Regards,
   Alex.

On Mon, 26 Aug 2019 at 16:20, Bram Van Dam <[hidden email]> wrote:

>
> Possibly somewhat unusual question: I'm looking for a way to query the
> number of *lucene documents* from within Solr. This can be different
> from the number of Solr documents (because of unmerged deletes/updates/
> etc).
>
> As a bit of background; we recently found this lovely little error
> message in a Solr log, and we'd like to get a bit of an early warning
> system going :-)
>
> > Too many documents, composite IndexReaders cannot exceed 2147483647
>
> If no way currently exists, I'm not adverse to hacking one in, but I
> could use a few pointers in the general direction.
>
> As an alternative strategy, I guess I could use Lucene to walk through
> each index segment and add the segment info maxDoc values. But I'm not
> sure if that would be a good idea.
>
> Thanks a bunch,
>
>  - Bram
Reply | Threaded
Open this post in threaded view
|

Re: Query number of Lucene documents using Solr?

Shawn Heisey-2
In reply to this post by Bram Van Dam
On 8/26/2019 2:19 PM, Bram Van Dam wrote:
> Possibly somewhat unusual question: I'm looking for a way to query the
> number of *lucene documents* from within Solr. This can be different
> from the number of Solr documents (because of unmerged deletes/updates/
> etc).
>
> As a bit of background; we recently found this lovely little error
> message in a Solr log, and we'd like to get a bit of an early warning
> system going :-)

The numbers shown in Solr's LukeRequestHandler come directly from
Lucene.  This is the URL endpoint it will normally be at, for core XXX:

<a href="http://host:port/solr/XXX/admin/luke">http://host:port/solr/XXX/admin/luke

The specific error you encountered is why old hands will recommended
staying below a billion documents in a core.  That leaves room for
deleted documents as well.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Query number of Lucene documents using Solr?

Bernd Fehling
In reply to this post by Bram Van Dam
You might use the Lucene internal CheckIndex included in lucene core.
It should tell you everything you need. At least a good starting
point for writing your own tool.

Copy lucene-core-x.y.z-SNAPSHOT.jar and lucene-misc-x.y.z-SNAPSHOT.jar
to a local directory.

java -cp lucene-core-x.y.z-SNAPSHOT.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /path/to/your/index

If you append a "-verbose" you will get tons of info about your index.

Regards
Bernd


Am 26.08.19 um 22:19 schrieb Bram Van Dam:

> Possibly somewhat unusual question: I'm looking for a way to query the
> number of *lucene documents* from within Solr. This can be different
> from the number of Solr documents (because of unmerged deletes/updates/
> etc).
>
> As a bit of background; we recently found this lovely little error
> message in a Solr log, and we'd like to get a bit of an early warning
> system going :-)
>
>> Too many documents, composite IndexReaders cannot exceed 2147483647
>
> If no way currently exists, I'm not adverse to hacking one in, but I
> could use a few pointers in the general direction.
>
> As an alternative strategy, I guess I could use Lucene to walk through
> each index segment and add the segment info maxDoc values. But I'm not
> sure if that would be a good idea.
>
> Thanks a bunch,
>
>   - Bram
>
Reply | Threaded
Open this post in threaded view
|

Re: Query number of Lucene documents using Solr?

Bram Van Dam
In reply to this post by Shawn Heisey-2
On 26/08/2019 23:12, Shawn Heisey wrote:
> The numbers shown in Solr's LukeRequestHandler come directly from
> Lucene.  This is the URL endpoint it will normally be at, for core XXX:
>
> <a href="http://host:port/solr/XXX/admin/luke">http://host:port/solr/XXX/admin/luke

Thanks Shawn, that's a great entry point!

> The specific error you encountered is why old hands will recommended
> staying below a billion documents in a core.  That leaves room for
> deleted documents as well.

Indeed, that's what we usually try. But every once in a while Stuff
Happens(TM), and so it'd be nice if we could monitor the actual count.

 - Bram
Reply | Threaded
Open this post in threaded view
|

Re: Query number of Lucene documents using Solr?

Erick Erickson
Bram:

If you optimize (Solr 7.4 and earlier), that may be part of the “stuff” as an index with a single segment can accumulate far more deleted documents. Shot in the dark. See:

https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/

Plus the linked article to how Solr 7.5 is different

Best,
Erick

> On Aug 27, 2019, at 3:23 AM, Bram Van Dam <[hidden email]> wrote:
>
> On 26/08/2019 23:12, Shawn Heisey wrote:
>> The numbers shown in Solr's LukeRequestHandler come directly from
>> Lucene.  This is the URL endpoint it will normally be at, for core XXX:
>>
>> <a href="http://host:port/solr/XXX/admin/luke">http://host:port/solr/XXX/admin/luke
>
> Thanks Shawn, that's a great entry point!
>
>> The specific error you encountered is why old hands will recommended
>> staying below a billion documents in a core.  That leaves room for
>> deleted documents as well.
>
> Indeed, that's what we usually try. But every once in a while Stuff
> Happens(TM), and so it'd be nice if we could monitor the actual count.
>
> - Bram