64-bit document numbers

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

64-bit document numbers

Marvin Humphrey
Greets,

Lucene, KinoSearch, and Ferret all use 32-bit document numbers.  For present
practical index sizes, that's enough.

Since we're going to optimize for 64-bit architectures, though, I think we
ought to look forward and define document numbers as 64-bit signed integers.
That way, we won't have to worry about changing things down the road to meet
the needs of growing search clusters.

Memory space and disk space are concerns but I think we get around most of
that by guaranteeing that no individual segment can contain more than I32_MAX
docs.  That way, things like document deletion maps can stay as arrays of
i32_t.

Marvin Humphrey

Reply | Threaded
Open this post in threaded view
|

Re: [Lucy] 64-bit document numbers

Peter Karman
Marvin Humphrey wrote on 12/7/09 7:25 PM:

> Since we're going to optimize for 64-bit architectures, though, I think we
> ought to look forward and define document numbers as 64-bit signed integers.
> That way, we won't have to worry about changing things down the road to meet
> the needs of growing search clusters.

+1

--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 64-bit document numbers

Nathan Kurz
In reply to this post by Marvin Humphrey
On Mon, Dec 7, 2009 at 5:25 PM, Marvin Humphrey <[hidden email]> wrote:
> Since we're going to optimize for 64-bit architectures, though, I think we
> ought to look forward and define document numbers as 64-bit signed integers.
> That way, we won't have to worry about changing things down the road to meet
> the needs of growing search clusters.

I think it's good to handle everything internally as 64-bit, but I'm
unsure how that should interface with the outside.  In the input side,
we want to make sure that index formats can save document numbers in
whatever format they desire (likely 32-bit).  I'm guessing this will
be taken care of by your segment max.

But on the output side, we may also want 32-bit sizes if we are doing
clustering and need to send intermediate results between computers.
At the same time, the gathering computer will need to handle the docs
as 64-bit so as to merge results properly.   Probably can be fit in
after the fact, but just thought I'd mention it.

--nate