Rebuilding parallel indexes

classic Classic list List threaded Threaded
3 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

Rebuilding parallel indexes

adb
I have a design where I will be using multiple index shards to hold approx 7.5
million documents per index per month over many years.  These will be large
static R/O indexes but the corresponding smaller parallel index will get many
frequent changes.

I understand from previous replies by Hoss that the technique to handle this is
to use parallel indexes where the parallel index gets rebuilt periodically with
the changing data.

However, this 'periodically' needs to be quite frequent to try to provide
responsive changes to the index, potentially several times a dat.  One problem
is that there can be updates to any of the data in almost any month, so an
update by a user to 120 documents, one document per month for 10 years, requires
a full rebuild of the 120 index shards of 7.5m docs each...

I was wondering what the technical reasons were why a 'delete+add' could not
allow the original docId to be re-used, thus keeping the two parallel indexes in
sync without requiring a rebuild.

If this could be overcome, this would make this parallel index pattern so much
more useful for large volume data sets.

Any thoughts
Antony





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rebuilding parallel indexes

Andrzej Białecki-2
Antony Bowesman wrote:

> I have a design where I will be using multiple index shards to hold
> approx 7.5 million documents per index per month over many years.  These
> will be large static R/O indexes but the corresponding smaller parallel
> index will get many frequent changes.
>
> I understand from previous replies by Hoss that the technique to handle
> this is to use parallel indexes where the parallel index gets rebuilt
> periodically with the changing data.
>
> However, this 'periodically' needs to be quite frequent to try to
> provide responsive changes to the index, potentially several times a
> dat.  One problem is that there can be updates to any of the data in
> almost any month, so an update by a user to 120 documents, one document
> per month for 10 years, requires a full rebuild of the 120 index shards
> of 7.5m docs each...
>
> I was wondering what the technical reasons were why a 'delete+add' could
> not allow the original docId to be re-used, thus keeping the two
> parallel indexes in sync without requiring a rebuild.
>
> If this could be overcome, this would make this parallel index pattern
> so much more useful for large volume data sets.
>
> Any thoughts

I have a thought ;) Perhaps you could use a FilteredIndexReader to
maintain a map between new IDs and old IDs, and remap on the fly.
Although I think that some parts of Lucene depend on the fact that in a
normal index the IDs are monotonically increasing ... this would
complicate the issue.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Rebuilding parallel indexes

adb
Andrzej Bialecki wrote:

> I have a thought ;) Perhaps you could use a FilteredIndexReader to maintain a
> map between new IDs and old IDs, and remap on the fly. Although I think that
> some parts of Lucene depend on the fact that in a normal index the IDs are
> monotonically increasing ... this would complicate the issue.

Interesting thought!  I've not yet looked into the guts of the ParallelReader,
but can imagine that it could work, but it sounds like an effective rewrite of
ParallelReader.  Optimize would be a problem though as optimizing this index
would then mean the mapping table would need recreation (I'm assuming the
optimization would muck up the Ids if only the parallel index was optimized).

You'd also need to get the new doc Id for each doc that is added.  Are docIds
allocated during addDocument or during the commit?

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]