Info on document number limitations

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Info on document number limitations

Doug Tarr
Hi!

I'm working on a team that is building a lucene based search platform.   I've been lurking on this list for a while as we are spooling up on learning the various components of Lucene.  Thank you all for your amazing work!

I'm interested in learning more about what work has been done around document count limitations in the Lucene 8 codec (as described here) related to using int32 vs VInt or Int64:

"Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit."

I've looked through JIRA and couldn't find any discussions about it, trade-offs, difficulties, etc.  If there's any information about this, I'd appreciate any links or info that you might have.

Thanks!
- Doug
--


name     : "Doug Tarr",

  title    : "Director of Engineering, Search",

  location : "San Francisco, CA", 

  company  : "MongoDB",

  email:   : "[hidden email]",

  linkedin : "douglastarr",

  twitter  : "@doug_tarr}

Reply | Threaded
Open this post in threaded view
|

Re: Info on document number limitations

Tim Casey

Hi Doug,

I don't know the specific limits.  But the document limits are going to be around an int, probably signed.  This comes out to mean about 2 billion documents per lucene index.  This is fairly embedded into the lucene code.  The way the collective we have solved this is through forms of sharding.

tim

On Fri, Feb 7, 2020 at 11:27 AM Doug Tarr <[hidden email]> wrote:
Hi!

I'm working on a team that is building a lucene based search platform.   I've been lurking on this list for a while as we are spooling up on learning the various components of Lucene.  Thank you all for your amazing work!

I'm interested in learning more about what work has been done around document count limitations in the Lucene 8 codec (as described here) related to using int32 vs VInt or Int64:

"Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit."

I've looked through JIRA and couldn't find any discussions about it, trade-offs, difficulties, etc.  If there's any information about this, I'd appreciate any links or info that you might have.

Thanks!
- Doug
--


name     : "Doug Tarr",

  title    : "Director of Engineering, Search",

  location : "San Francisco, CA", 

  company  : "MongoDB",

  email:   : "[hidden email]",

  linkedin : "douglastarr",

  twitter  : "@doug_tarr}

Reply | Threaded
Open this post in threaded view
|

Re: Info on document number limitations

Erick Erickson
Also, given how people use search, they hit performance issues long before running out of document IDs. Usually. Although that said I do know of one user who’s running in the 1.0-1.5B range per replica so 2B is just around the corner. Of course they have to be _very_ careful how they use Solr.

And that said, there’s just not a lot of pressure to go to longs, and as Tim says it’s be a very significant effort. And there would be memory implications for everyone to balance.

Best,
Erick

> On Feb 8, 2020, at 9:59 PM, Tim Casey <[hidden email]> wrote:
>
>
> Hi Doug,
>
> I don't know the specific limits.  But the document limits are going to be around an int, probably signed.  This comes out to mean about 2 billion documents per lucene index.  This is fairly embedded into the lucene code.  The way the collective we have solved this is through forms of sharding.
>
> tim
>
> On Fri, Feb 7, 2020 at 11:27 AM Doug Tarr <[hidden email]> wrote:
> Hi!
>
> I'm working on a team that is building a lucene based search platform.   I've been lurking on this list for a while as we are spooling up on learning the various components of Lucene.  Thank you all for your amazing work!
>
> I'm interested in learning more about what work has been done around document count limitations in the Lucene 8 codec (as described here) related to using int32 vs VInt or Int64:
>
> "Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit."
>
> I've looked through JIRA and couldn't find any discussions about it, trade-offs, difficulties, etc.  If there's any information about this, I'd appreciate any links or info that you might have.
>
> Thanks!
> - Doug
> --
>
> { name     : "Doug Tarr",
>   title    : "Director of Engineering, Search",
>   location : "San Francisco, CA",
>   company  : "MongoDB",
>   email:   : "[hidden email]",
>   linkedin : "douglastarr",
>   twitter  : "@doug_tarr" }


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Info on document number limitations

Adrien Grand
Lucene has a limit of 2^31-1-128 documents per index, see IndexWriter.MAX_DOCS. Users don't often run into this limit but I've seen it happen multiple times.

I think that it's unlikely that Lucene will ever remove this limit on a per-segment basis, however there have been some discussions about having the ability to go over this limit across multiple segments: https://issues.apache.org/jira/browse/LUCENE-8321.

On Sun, Feb 9, 2020 at 2:29 PM Erick Erickson <[hidden email]> wrote:
Also, given how people use search, they hit performance issues long before running out of document IDs. Usually. Although that said I do know of one user who’s running in the 1.0-1.5B range per replica so 2B is just around the corner. Of course they have to be _very_ careful how they use Solr.

And that said, there’s just not a lot of pressure to go to longs, and as Tim says it’s be a very significant effort. And there would be memory implications for everyone to balance.

Best,
Erick

> On Feb 8, 2020, at 9:59 PM, Tim Casey <[hidden email]> wrote:
>
>
> Hi Doug,
>
> I don't know the specific limits.  But the document limits are going to be around an int, probably signed.  This comes out to mean about 2 billion documents per lucene index.  This is fairly embedded into the lucene code.  The way the collective we have solved this is through forms of sharding.
>
> tim
>
> On Fri, Feb 7, 2020 at 11:27 AM Doug Tarr <[hidden email]> wrote:
> Hi!
>
> I'm working on a team that is building a lucene based search platform.   I've been lurking on this list for a while as we are spooling up on learning the various components of Lucene.  Thank you all for your amazing work!
>
> I'm interested in learning more about what work has been done around document count limitations in the Lucene 8 codec (as described here) related to using int32 vs VInt or Int64:
>
> "Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit."
>
> I've looked through JIRA and couldn't find any discussions about it, trade-offs, difficulties, etc.  If there's any information about this, I'd appreciate any links or info that you might have.
>
> Thanks!
> - Doug
> --
>
> { name     : "Doug Tarr",
>   title    : "Director of Engineering, Search",
>   location : "San Francisco, CA",
>   company  : "MongoDB",
>   email:   : "[hidden email]",
>   linkedin : "douglastarr",
>   twitter  : "@doug_tarr" }


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



--
Adrien