Quantcast

docid is just a signed int32

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

docid is just a signed int32

Cristian Lorenzetto
docid is a signed int32 so it is not so big, but really docid seams not a
primary key unmodifiable but a temporary id for the view related to a
specific search.

So repository can contains more than 2^31 documents.

My deduction is correct ? is there a maximum size for lucene index?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Adrien Grand
No, IndexWriter enforces that the number of documents cannot go over
IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
BaseCompositeReader computes the number of documents in a long variable and
ensures it is less than 2^31, so you cannot have indexes that contain more
than 2^31 documents.

Larger collections should be written to multiple shards and use
TopDocs.merge to merge results.

Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
[hidden email]> a écrit :

> docid is a signed int32 so it is not so big, but really docid seams not a
> primary key unmodifiable but a temporary id for the view related to a
> specific search.
>
> So repository can contains more than 2^31 documents.
>
> My deduction is correct ? is there a maximum size for lucene index?
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Glen Newton
Or maybe it is time Lucene re-examined this limit.

There are use cases out there where >2^31 does make sense in a single index
(huge number of tiny docs).

Also, I think the underlying hardware and the JDK have advanced to make
this more defendable.

Constructively,
Glen


On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <[hidden email]> wrote:

> No, IndexWriter enforces that the number of documents cannot go over
> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> BaseCompositeReader computes the number of documents in a long variable and
> ensures it is less than 2^31, so you cannot have indexes that contain more
> than 2^31 documents.
>
> Larger collections should be written to multiple shards and use
> TopDocs.merge to merge results.
>
> Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
> [hidden email]> a écrit :
>
> > docid is a signed int32 so it is not so big, but really docid seams not a
> > primary key unmodifiable but a temporary id for the view related to a
> > specific search.
> >
> > So repository can contains more than 2^31 documents.
> >
> > My deduction is correct ? is there a maximum size for lucene index?
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Cristian Lorenzetto
Maybe lucene has maxsize 2^31 because result set are java array where
length is a int type.
A suggestion for possible changes in future is to not use java array but
Iterator. Iterator is a ADT more scalable , not sucking memory for
returning documents.


2016-08-18 16:03 GMT+02:00 Glen Newton <[hidden email]>:

> Or maybe it is time Lucene re-examined this limit.
>
> There are use cases out there where >2^31 does make sense in a single index
> (huge number of tiny docs).
>
> Also, I think the underlying hardware and the JDK have advanced to make
> this more defendable.
>
> Constructively,
> Glen
>
>
> On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <[hidden email]> wrote:
>
> > No, IndexWriter enforces that the number of documents cannot go over
> > IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> > BaseCompositeReader computes the number of documents in a long variable
> and
> > ensures it is less than 2^31, so you cannot have indexes that contain
> more
> > than 2^31 documents.
> >
> > Larger collections should be written to multiple shards and use
> > TopDocs.merge to merge results.
> >
> > Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
> > [hidden email]> a écrit :
> >
> > > docid is a signed int32 so it is not so big, but really docid seams
> not a
> > > primary key unmodifiable but a temporary id for the view related to a
> > > specific search.
> > >
> > > So repository can contains more than 2^31 documents.
> > >
> > > My deduction is correct ? is there a maximum size for lucene index?
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Greg Bowyer-2
What are you trying to index that has more than 3 billion documents per
shard / index and can not be split as Adrien suggests?



On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:

> Maybe lucene has maxsize 2^31 because result set are java array where
> length is a int type.
> A suggestion for possible changes in future is to not use java array but
> Iterator. Iterator is a ADT more scalable , not sucking memory for
> returning documents.
>
>
> 2016-08-18 16:03 GMT+02:00 Glen Newton <[hidden email]>:
>
> > Or maybe it is time Lucene re-examined this limit.
> >
> > There are use cases out there where >2^31 does make sense in a single index
> > (huge number of tiny docs).
> >
> > Also, I think the underlying hardware and the JDK have advanced to make
> > this more defendable.
> >
> > Constructively,
> > Glen
> >
> >
> > On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <[hidden email]> wrote:
> >
> > > No, IndexWriter enforces that the number of documents cannot go over
> > > IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> > > BaseCompositeReader computes the number of documents in a long variable
> > and
> > > ensures it is less than 2^31, so you cannot have indexes that contain
> > more
> > > than 2^31 documents.
> > >
> > > Larger collections should be written to multiple shards and use
> > > TopDocs.merge to merge results.
> > >
> > > Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
> > > [hidden email]> a écrit :
> > >
> > > > docid is a signed int32 so it is not so big, but really docid seams
> > not a
> > > > primary key unmodifiable but a temporary id for the view related to a
> > > > specific search.
> > > >
> > > > So repository can contains more than 2^31 documents.
> > > >
> > > > My deduction is correct ? is there a maximum size for lucene index?
> > > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Cristian Lorenzetto
normally databases supports at least long primary key.
try to ask to twitter application , for example increasing every year more
than 4 petabytes :) Maybe they use big storage devices bigger than a pc
storage:)
However If you offer a possibility to use shards ... it is a possibility
anyway :)
For this reason, my suggestion was different ... was not related to size of
repository , but size of research result :):):)

" A suggestion for possible changes in future is to not use java array but
> Iterator. Iterator is a ADT more scalable , not sucking memory for
> returning documents."

it is just a suggestion anyway for my loved lucene :):)


2016-08-18 17:43 GMT+02:00 Greg Bowyer <[hidden email]>:

> What are you trying to index that has more than 3 billion documents per
> shard / index and can not be split as Adrien suggests?
>
>
>
> On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:
> > Maybe lucene has maxsize 2^31 because result set are java array where
> > length is a int type.
> > A suggestion for possible changes in future is to not use java array but
> > Iterator. Iterator is a ADT more scalable , not sucking memory for
> > returning documents.
> >
> >
> > 2016-08-18 16:03 GMT+02:00 Glen Newton <[hidden email]>:
> >
> > > Or maybe it is time Lucene re-examined this limit.
> > >
> > > There are use cases out there where >2^31 does make sense in a single
> index
> > > (huge number of tiny docs).
> > >
> > > Also, I think the underlying hardware and the JDK have advanced to make
> > > this more defendable.
> > >
> > > Constructively,
> > > Glen
> > >
> > >
> > > On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <[hidden email]>
> wrote:
> > >
> > > > No, IndexWriter enforces that the number of documents cannot go over
> > > > IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> > > > BaseCompositeReader computes the number of documents in a long
> variable
> > > and
> > > > ensures it is less than 2^31, so you cannot have indexes that contain
> > > more
> > > > than 2^31 documents.
> > > >
> > > > Larger collections should be written to multiple shards and use
> > > > TopDocs.merge to merge results.
> > > >
> > > > Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
> > > > [hidden email]> a écrit :
> > > >
> > > > > docid is a signed int32 so it is not so big, but really docid seams
> > > not a
> > > > > primary key unmodifiable but a temporary id for the view related
> to a
> > > > > specific search.
> > > > >
> > > > > So repository can contains more than 2^31 documents.
> > > > >
> > > > > My deduction is correct ? is there a maximum size for lucene index?
> > > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Trejkaz
In reply to this post by Adrien Grand
On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand <[hidden email]> wrote:
> No, IndexWriter enforces that the number of documents cannot go over
> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> BaseCompositeReader computes the number of documents in a long variable and
> ensures it is less than 2^31, so you cannot have indexes that contain more
> than 2^31 documents.
>
> Larger collections should be written to multiple shards and use
> TopDocs.merge to merge results.

But hang on:
* TopDocs#merge still returns a TopDocs.
* TopDocs still uses an array of ScoreDoc.
* ScoreDoc still uses an int doc ID.

Looks like you're still screwed.

I wish IndexReader would use long IDs too, because one IndexReader can
be across multiple shards too - it doesn't make much sense to me that
this is restricted, although "it's hard to fix in a
backwards-compatible way" is certainly a good reason. :D

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Erick Erickson
OK, I'm a little out of my league here, but I'll plow on anyway....

bq: There are use cases out there where >2^31 does make sense in a single index

Ok, let's put some definition to this and define the use-case
specifically rather than
be vague. I've just run an experiment for instance where I had 200M
docs in a single
shard (very small docs) and tried to sort by a date on all of them.
Performance on the order of
5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
Faceting? If
so the performance will probably be poor.

This would be huge surgery I believe, and there hasn't been a
compelling use-case
in the search world for it. Unless and until that case is made I
suspect this idea will
meet with a lot of resistance.

That said, I do understand that this is somewhat akin to "Nobody will
ever need more
than 64K of ram", meaning that some limits are assumed and eventually become
outmoded. But given Java's issues with memory and GC I suspect that
it'll be really
hard to justify the work this would take.

FWIW,
Erick


On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz <[hidden email]> wrote:

> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand <[hidden email]> wrote:
>> No, IndexWriter enforces that the number of documents cannot go over
>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>> BaseCompositeReader computes the number of documents in a long variable and
>> ensures it is less than 2^31, so you cannot have indexes that contain more
>> than 2^31 documents.
>>
>> Larger collections should be written to multiple shards and use
>> TopDocs.merge to merge results.
>
> But hang on:
> * TopDocs#merge still returns a TopDocs.
> * TopDocs still uses an array of ScoreDoc.
> * ScoreDoc still uses an int doc ID.
>
> Looks like you're still screwed.
>
> I wish IndexReader would use long IDs too, because one IndexReader can
> be across multiple shards too - it doesn't make much sense to me that
> this is restricted, although "it's hard to fix in a
> backwards-compatible way" is certainly a good reason. :D
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Adrien Grand
In reply to this post by Trejkaz
Le ven. 19 août 2016 à 03:32, Trejkaz <[hidden email]> a écrit :

> But hang on:
> * TopDocs#merge still returns a TopDocs.
> * TopDocs still uses an array of ScoreDoc.
> * ScoreDoc still uses an int doc ID.
>

This is why ScoreDoc has a `shardId` so that you can know which index a
document comes from.

I'm not saying we should not switch to long doc ids, but as outlined in
some other responses it would be a challenging change.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Glen Newton
Making docid an int64 is a non-trivial undertaking, and this work needs to
be compared against the use cases and how compelling they are.

That said, in the lifetime of most software projects a decision is made to
break backward compatibility to move the project forward.
When/if moving to int64 happens, it will be one of these moments. It is not
a Bad Thing (necessarily).  :-)

And for use cases, if I am running a commercial JVM on a 64 core machine
with 3TB of ram (we have these running), int64 for >2^32 documents in a
single index should not be a problem...  :-)

glen

On Fri, Aug 19, 2016 at 4:43 AM, Adrien Grand <[hidden email]> wrote:

> Le ven. 19 août 2016 à 03:32, Trejkaz <[hidden email]> a écrit :
>
> > But hang on:
> > * TopDocs#merge still returns a TopDocs.
> > * TopDocs still uses an array of ScoreDoc.
> > * ScoreDoc still uses an int doc ID.
> >
>
> This is why ScoreDoc has a `shardId` so that you can know which index a
> document comes from.
>
> I'm not saying we should not switch to long doc ids, but as outlined in
> some other responses it would be a challenging change.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Cristian Lorenzetto
ah :)

"with 3TB of ram (we have these running), int64 for >2^32 documents in a
single index should not be a problem"

Maybe i m reasoning in bad way but normally the size of storage is not
the size of memory.
I dont know lucene in the deep, but i would aspect lucene index is
scanning a block step by step, not all in memory. For this reason in a
previous post, i mentioned about possibility to use iterator instead
array, because array load in memory all the results,instead iterator
load a single document (or a fixed number of them) for every step. In
the case you call loadAll() there is a problem with memory.




2016-08-19 15:39 GMT+02:00, Glen Newton <[hidden email]>:

> Making docid an int64 is a non-trivial undertaking, and this work needs to
> be compared against the use cases and how compelling they are.
>
> That said, in the lifetime of most software projects a decision is made to
> break backward compatibility to move the project forward.
> When/if moving to int64 happens, it will be one of these moments. It is not
> a Bad Thing (necessarily).  :-)
>
> And for use cases, if I am running a commercial JVM on a 64 core machine
> with 3TB of ram (we have these running), int64 for >2^32 documents in a
> single index should not be a problem...  :-)
>
> glen
>
> On Fri, Aug 19, 2016 at 4:43 AM, Adrien Grand <[hidden email]> wrote:
>
>> Le ven. 19 août 2016 à 03:32, Trejkaz <[hidden email]> a écrit :
>>
>> > But hang on:
>> > * TopDocs#merge still returns a TopDocs.
>> > * TopDocs still uses an array of ScoreDoc.
>> > * ScoreDoc still uses an int doc ID.
>> >
>>
>> This is why ScoreDoc has a `shardId` so that you can know which index a
>> document comes from.
>>
>> I'm not saying we should not switch to long doc ids, but as outlined in
>> some other responses it would be a challenging change.
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Glen Newton
I was referring to memory (RAM).

We have machines running right now with 1TB _RAM_ and will be getting
machines with 3TB RAM (Dell R830 with 48 64GM DIMMs) (Sorry, I was
incorrect when I said we were running the 3TB machines _now_).

Glen



On Fri, Aug 19, 2016 at 9:56 AM, Cristian Lorenzetto <
[hidden email]> wrote:

> ah :)
>
> "with 3TB of ram (we have these running), int64 for >2^32 documents in a
> single index should not be a problem"
>
> Maybe i m reasoning in bad way but normally the size of storage is not
> the size of memory.
> I dont know lucene in the deep, but i would aspect lucene index is
> scanning a block step by step, not all in memory. For this reason in a
> previous post, i mentioned about possibility to use iterator instead
> array, because array load in memory all the results,instead iterator
> load a single document (or a fixed number of them) for every step. In
> the case you call loadAll() there is a problem with memory.
>
>
>
>
> 2016-08-19 15:39 GMT+02:00, Glen Newton <[hidden email]>:
> > Making docid an int64 is a non-trivial undertaking, and this work needs
> to
> > be compared against the use cases and how compelling they are.
> >
> > That said, in the lifetime of most software projects a decision is made
> to
> > break backward compatibility to move the project forward.
> > When/if moving to int64 happens, it will be one of these moments. It is
> not
> > a Bad Thing (necessarily).  :-)
> >
> > And for use cases, if I am running a commercial JVM on a 64 core machine
> > with 3TB of ram (we have these running), int64 for >2^32 documents in a
> > single index should not be a problem...  :-)
> >
> > glen
> >
> > On Fri, Aug 19, 2016 at 4:43 AM, Adrien Grand <[hidden email]> wrote:
> >
> >> Le ven. 19 août 2016 à 03:32, Trejkaz <[hidden email]> a écrit :
> >>
> >> > But hang on:
> >> > * TopDocs#merge still returns a TopDocs.
> >> > * TopDocs still uses an array of ScoreDoc.
> >> > * ScoreDoc still uses an int doc ID.
> >> >
> >>
> >> This is why ScoreDoc has a `shardId` so that you can know which index a
> >> document comes from.
> >>
> >> I'm not saying we should not switch to long doc ids, but as outlined in
> >> some other responses it would be a challenging change.
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: docid is just a signed int32

Uwe Schindler
In reply to this post by Cristian Lorenzetto
Hi,

The Lucene internal DocId is not a unique identifier, it is not even stable!
It is just a temporary property to identify a document in an index segment / shard and is only valid for the lifetime of an IndexReader.

Lucene (and Solr / Elasticsearch) can hold "indexes" with much more than 2 billion documents, because they shard internally (which a database is also doing). Direct Lucene users are just on a lower level than "apllication" / "database" users. Would you take care how MySQL internally addresses the rows in tables?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Cristian Lorenzetto [mailto:[hidden email]]
> Sent: Thursday, August 18, 2016 5:58 PM
> To: Lucene Users <[hidden email]>
> Subject: Re: docid is just a signed int32
>
> normally databases supports at least long primary key.
> try to ask to twitter application , for example increasing every year more
> than 4 petabytes :) Maybe they use big storage devices bigger than a pc
> storage:)
> However If you offer a possibility to use shards ... it is a possibility
> anyway :)
> For this reason, my suggestion was different ... was not related to size of
> repository , but size of research result :):):)
>
> " A suggestion for possible changes in future is to not use java array but
> > Iterator. Iterator is a ADT more scalable , not sucking memory for
> > returning documents."
>
> it is just a suggestion anyway for my loved lucene :):)
>
>
> 2016-08-18 17:43 GMT+02:00 Greg Bowyer <[hidden email]>:
>
> > What are you trying to index that has more than 3 billion documents per
> > shard / index and can not be split as Adrien suggests?
> >
> >
> >
> > On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:
> > > Maybe lucene has maxsize 2^31 because result set are java array where
> > > length is a int type.
> > > A suggestion for possible changes in future is to not use java array but
> > > Iterator. Iterator is a ADT more scalable , not sucking memory for
> > > returning documents.
> > >
> > >
> > > 2016-08-18 16:03 GMT+02:00 Glen Newton <[hidden email]>:
> > >
> > > > Or maybe it is time Lucene re-examined this limit.
> > > >
> > > > There are use cases out there where >2^31 does make sense in a single
> > index
> > > > (huge number of tiny docs).
> > > >
> > > > Also, I think the underlying hardware and the JDK have advanced to
> make
> > > > this more defendable.
> > > >
> > > > Constructively,
> > > > Glen
> > > >
> > > >
> > > > On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <[hidden email]>
> > wrote:
> > > >
> > > > > No, IndexWriter enforces that the number of documents cannot go
> over
> > > > > IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> > > > > BaseCompositeReader computes the number of documents in a long
> > variable
> > > > and
> > > > > ensures it is less than 2^31, so you cannot have indexes that contain
> > > > more
> > > > > than 2^31 documents.
> > > > >
> > > > > Larger collections should be written to multiple shards and use
> > > > > TopDocs.merge to merge results.
> > > > >
> > > > > Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
> > > > > [hidden email]> a écrit :
> > > > >
> > > > > > docid is a signed int32 so it is not so big, but really docid seams
> > > > not a
> > > > > > primary key unmodifiable but a temporary id for the view related
> > to a
> > > > > > specific search.
> > > > > >
> > > > > > So repository can contains more than 2^31 documents.
> > > > > >
> > > > > > My deduction is correct ? is there a maximum size for lucene index?
> > > > > >
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Cristian Lorenzetto
In reply to this post by Erick Erickson
For my opinion this study dont tell any thing more than before. Obviously if you try to retrieve all data store in a single query the performance will be not good. Lucene is fantastic But no magic. The physic laws continue to work also with lucene. The query is designed for retrieving a small part of a big store, not All The store. In addition i think The time would be worst also if you dont sort documents. Using a sorted linked list persisted i dont see relevant delays . Syncerely i dont understand also gc memory limit with lucene algorithm. The size of memory used is not proporzional to the datastore size, else lucene will be not scalable. The problem to analize for me is another : considering The trend of big data to encrease in The last years , considering The classical max size of a database among those we know, considering The possibility or not to scale up sharding in lucene in arrays defined dinamically or not , we can evaluate if this refactoring has sense or not.

Inviato da iPad

> Il giorno 19 ago 2016, alle ore 05:50, Erick Erickson <[hidden email]> ha scritto:
>
> OK, I'm a little out of my league here, but I'll plow on anyway....
>
> bq: There are use cases out there where >2^31 does make sense in a single index
>
> Ok, let's put some definition to this and define the use-case
> specifically rather than
> be vague. I've just run an experiment for instance where I had 200M
> docs in a single
> shard (very small docs) and tried to sort by a date on all of them.
> Performance on the order of
> 5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
> Faceting? If
> so the performance will probably be poor.
>
> This would be huge surgery I believe, and there hasn't been a
> compelling use-case
> in the search world for it. Unless and until that case is made I
> suspect this idea will
> meet with a lot of resistance.
>
> That said, I do understand that this is somewhat akin to "Nobody will
> ever need more
> than 64K of ram", meaning that some limits are assumed and eventually become
> outmoded. But given Java's issues with memory and GC I suspect that
> it'll be really
> hard to justify the work this would take.
>
> FWIW,
> Erick
>
>
>> On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz <[hidden email]> wrote:
>>> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand <[hidden email]> wrote:
>>> No, IndexWriter enforces that the number of documents cannot go over
>>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>>> BaseCompositeReader computes the number of documents in a long variable and
>>> ensures it is less than 2^31, so you cannot have indexes that contain more
>>> than 2^31 documents.
>>>
>>> Larger collections should be written to multiple shards and use
>>> TopDocs.merge to merge results.
>>
>> But hang on:
>> * TopDocs#merge still returns a TopDocs.
>> * TopDocs still uses an array of ScoreDoc.
>> * ScoreDoc still uses an int doc ID.
>>
>> Looks like you're still screwed.
>>
>> I wish IndexReader would use long IDs too, because one IndexReader can
>> be across multiple shards too - it doesn't make much sense to me that
>> this is restricted, although "it's hard to fix in a
>> backwards-compatible way" is certainly a good reason. :D
>>
>> TX
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Cristian Lorenzetto
i m overviewing TopDocs.merge.

What is the difference to use multiple SearchIndexer and then to use
TopDocs or to use MultiReader?

2016-08-21 2:28 GMT+02:00 Cristian Lorenzetto <[hidden email]
>:

> For my opinion this study dont tell any thing more than before. Obviously
> if you try to retrieve all data store in a single query the performance
> will be not good. Lucene is fantastic But no magic. The physic laws
> continue to work also with lucene. The query is designed for retrieving a
> small part of a big store, not All The store. In addition i think The time
> would be worst also if you dont sort documents. Using a sorted linked list
> persisted i dont see relevant delays . Syncerely i dont understand also gc
> memory limit with lucene algorithm. The size of memory used is not
> proporzional to the datastore size, else lucene will be not scalable. The
> problem to analize for me is another : considering The trend of big data to
> encrease in The last years , considering The classical max size of a
> database among those we know, considering The possibility or not to scale
> up sharding in lucene in arrays defined dinamically or not , we can
> evaluate if this refactoring has sense or not.
>
> Inviato da iPad
>
> > Il giorno 19 ago 2016, alle ore 05:50, Erick Erickson <
> [hidden email]> ha scritto:
> >
> > OK, I'm a little out of my league here, but I'll plow on anyway....
> >
> > bq: There are use cases out there where >2^31 does make sense in a
> single index
> >
> > Ok, let's put some definition to this and define the use-case
> > specifically rather than
> > be vague. I've just run an experiment for instance where I had 200M
> > docs in a single
> > shard (very small docs) and tried to sort by a date on all of them.
> > Performance on the order of
> > 5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
> > Faceting? If
> > so the performance will probably be poor.
> >
> > This would be huge surgery I believe, and there hasn't been a
> > compelling use-case
> > in the search world for it. Unless and until that case is made I
> > suspect this idea will
> > meet with a lot of resistance.
> >
> > That said, I do understand that this is somewhat akin to "Nobody will
> > ever need more
> > than 64K of ram", meaning that some limits are assumed and eventually
> become
> > outmoded. But given Java's issues with memory and GC I suspect that
> > it'll be really
> > hard to justify the work this would take.
> >
> > FWIW,
> > Erick
> >
> >
> >> On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz <[hidden email]> wrote:
> >>> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand <[hidden email]>
> wrote:
> >>> No, IndexWriter enforces that the number of documents cannot go over
> >>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> >>> BaseCompositeReader computes the number of documents in a long
> variable and
> >>> ensures it is less than 2^31, so you cannot have indexes that contain
> more
> >>> than 2^31 documents.
> >>>
> >>> Larger collections should be written to multiple shards and use
> >>> TopDocs.merge to merge results.
> >>
> >> But hang on:
> >> * TopDocs#merge still returns a TopDocs.
> >> * TopDocs still uses an array of ScoreDoc.
> >> * ScoreDoc still uses an int doc ID.
> >>
> >> Looks like you're still screwed.
> >>
> >> I wish IndexReader would use long IDs too, because one IndexReader can
> >> be across multiple shards too - it doesn't make much sense to me that
> >> this is restricted, although "it's hard to fix in a
> >> backwards-compatible way" is certainly a good reason. :D
> >>
> >> TX
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Cristian Lorenzetto
maybe using TopDocs.merge you can the same query on multiple indexes, with
multireader you can also to make join operation on different indexes

2016-08-21 19:31 GMT+02:00 Cristian Lorenzetto <
[hidden email]>:

> i m overviewing TopDocs.merge.
>
> What is the difference to use multiple SearchIndexer and then to use
> TopDocs or to use MultiReader?
>
> 2016-08-21 2:28 GMT+02:00 Cristian Lorenzetto <
> [hidden email]>:
>
>> For my opinion this study dont tell any thing more than before. Obviously
>> if you try to retrieve all data store in a single query the performance
>> will be not good. Lucene is fantastic But no magic. The physic laws
>> continue to work also with lucene. The query is designed for retrieving a
>> small part of a big store, not All The store. In addition i think The time
>> would be worst also if you dont sort documents. Using a sorted linked list
>> persisted i dont see relevant delays . Syncerely i dont understand also gc
>> memory limit with lucene algorithm. The size of memory used is not
>> proporzional to the datastore size, else lucene will be not scalable. The
>> problem to analize for me is another : considering The trend of big data to
>> encrease in The last years , considering The classical max size of a
>> database among those we know, considering The possibility or not to scale
>> up sharding in lucene in arrays defined dinamically or not , we can
>> evaluate if this refactoring has sense or not.
>>
>> Inviato da iPad
>>
>> > Il giorno 19 ago 2016, alle ore 05:50, Erick Erickson <
>> [hidden email]> ha scritto:
>> >
>> > OK, I'm a little out of my league here, but I'll plow on anyway....
>> >
>> > bq: There are use cases out there where >2^31 does make sense in a
>> single index
>> >
>> > Ok, let's put some definition to this and define the use-case
>> > specifically rather than
>> > be vague. I've just run an experiment for instance where I had 200M
>> > docs in a single
>> > shard (very small docs) and tried to sort by a date on all of them.
>> > Performance on the order of
>> > 5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
>> > Faceting? If
>> > so the performance will probably be poor.
>> >
>> > This would be huge surgery I believe, and there hasn't been a
>> > compelling use-case
>> > in the search world for it. Unless and until that case is made I
>> > suspect this idea will
>> > meet with a lot of resistance.
>> >
>> > That said, I do understand that this is somewhat akin to "Nobody will
>> > ever need more
>> > than 64K of ram", meaning that some limits are assumed and eventually
>> become
>> > outmoded. But given Java's issues with memory and GC I suspect that
>> > it'll be really
>> > hard to justify the work this would take.
>> >
>> > FWIW,
>> > Erick
>> >
>> >
>> >> On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz <[hidden email]>
>> wrote:
>> >>> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand <[hidden email]>
>> wrote:
>> >>> No, IndexWriter enforces that the number of documents cannot go over
>> >>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>> >>> BaseCompositeReader computes the number of documents in a long
>> variable and
>> >>> ensures it is less than 2^31, so you cannot have indexes that contain
>> more
>> >>> than 2^31 documents.
>> >>>
>> >>> Larger collections should be written to multiple shards and use
>> >>> TopDocs.merge to merge results.
>> >>
>> >> But hang on:
>> >> * TopDocs#merge still returns a TopDocs.
>> >> * TopDocs still uses an array of ScoreDoc.
>> >> * ScoreDoc still uses an int doc ID.
>> >>
>> >> Looks like you're still screwed.
>> >>
>> >> I wish IndexReader would use long IDs too, because one IndexReader can
>> >> be across multiple shards too - it doesn't make much sense to me that
>> >> this is restricted, although "it's hard to fix in a
>> >> backwards-compatible way" is certainly a good reason. :D
>> >>
>> >> TX
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: docid is just a signed int32

Jerven Tjalling Bolleman
In reply to this post by Greg Bowyer-2
Hi All,

I too would like to have doc'ids that are larger than int32. Not today
but in 4 years that would be very nice ;) Already we are splitting some
indexes that would be nicer together (mostly allowing more lucene code
to be used instead of our own)

On the other hand we are not the default use case of lucene. We index
once a month and then have a frozen index. After "freezing" the index we
use the doc'ids in lucene to link the search results to our document
storage. We could use a stored field value instead but for now this
using of the internal lucene id was a nice optimization.

The closest we are coming to this max index number is in a index of how
our (uniprot.org) database links to other databases. These are stored as
very small documents and we have 892,236,174 of them. We can split this
into lots of smaller indexes, without to much of a hassle. On the other
hand it would be even nicer to merge them all into a larger index which
would have 1.5 billion documents as that would allow us to use the
lucene document joining logic. For now we have our own cross lucene
index joining logic which is optimized but not optimal.

We get into this problem because we somewhat abuse Lucene to act as more
than just a text retrieval engine. We actually have a number of custom
query objects that allow users to integrate certain compute results into
a lucene search.

Now I understand that splitting indexes etc... into shards is a
completely reasonable direction. On the other hand we have more than
acceptable search performance on 800 million document indexes and would
see no reason why that would not be the case on one 5 times the size.
Especially considering this performance is achieved on 32GB ram (18GB
heap) machines with 8 cores today. i.e. for us it would be far cheaper
to buy bigger machines than to re-architect. I expect that with
improvements in JVM+GC it would make sense to have 1 or 2 Solr/Elastic
search nodes on one large machine instead of 5 to 10 that we are hearing
about on some deployments.

Some of the decisions regarding what we build we would not do today if
starting from scratch. But considering we started using lucene 10 years
ago and are current with the latest release the decision to continue
with our madness makes sense, and would be possible for another 10 years
if we had 64 bits for a docid.

Again not something for now, but something that would be interesting in
the java10 time frame.

Regards,
Jerven

P.S. thank you very much for building a great search library and ecosystem.

P.P.S if you want to see the madness in action visit uniprot.org.


On 08/18/2016 05:43 PM, Greg Bowyer wrote:

> What are you trying to index that has more than 3 billion documents per
> shard / index and can not be split as Adrien suggests?
>
>
>
> On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:
>> Maybe lucene has maxsize 2^31 because result set are java array where
>> length is a int type.
>> A suggestion for possible changes in future is to not use java array but
>> Iterator. Iterator is a ADT more scalable , not sucking memory for
>> returning documents.
>>
>>
>> 2016-08-18 16:03 GMT+02:00 Glen Newton <[hidden email]>:
>>
>>> Or maybe it is time Lucene re-examined this limit.
>>>
>>> There are use cases out there where >2^31 does make sense in a single index
>>> (huge number of tiny docs).
>>>
>>> Also, I think the underlying hardware and the JDK have advanced to make
>>> this more defendable.
>>>
>>> Constructively,
>>> Glen
>>>
>>>
>>> On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <[hidden email]> wrote:
>>>
>>>> No, IndexWriter enforces that the number of documents cannot go over
>>>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>>>> BaseCompositeReader computes the number of documents in a long variable
>>> and
>>>> ensures it is less than 2^31, so you cannot have indexes that contain
>>> more
>>>> than 2^31 documents.
>>>>
>>>> Larger collections should be written to multiple shards and use
>>>> TopDocs.merge to merge results.
>>>>
>>>> Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
>>>> [hidden email]> a écrit :
>>>>
>>>>> docid is a signed int32 so it is not so big, but really docid seams
>>> not a
>>>>> primary key unmodifiable but a temporary id for the view related to a
>>>>> specific search.
>>>>>
>>>>> So repository can contains more than 2^31 documents.
>>>>>
>>>>> My deduction is correct ? is there a maximum size for lucene index?
>>>>>
>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--
-------------------------------------------------------------------
Jerven Bolleman                        [hidden email]
SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...