Iterating Over All Documents On a Changing Index

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Iterating Over All Documents On a Changing Index

Matt Davis
Hi All,

I am working on implementing of an in place reindex using Lucene.  In my
case, I have BSON document stored in a binary field and have a set of rules
that pull fields out of the BSON and indexes them into different Lucene
fields with different analyzers.  I would like to be able to change these
rules / schema and then iterate over the documents, indexing them using the
new schema.

I have come up with the following code block:
https://gist.github.com/mdavis95/f600e0a8233d0a1232eff77645d1dc8a

I have two questions:
1) Is this a good way to iterate over the documents
2) How can I manage documents changing when I am doing this.  New documents
coming in should be fine I believe but changes to existing documents could
be lost if I understand correctly.

I hope that this is the right place to ask this question and I apologize if
this is obvious or has been asked and answered.

Thanks,
Matt
Reply | Threaded
Open this post in threaded view
|

Re: Iterating Over All Documents On a Changing Index

Adrien Grand
This is the right place to ask these questions indeed.

This is a good way to iterate over documents. Regarding your 2nd
question, Lucene IndexReaders are point-in-time views of the data, so
changes won't become visible in-place. The tricky problem with this
kind of problem is usually to deal with documents that are getting
indexed after you pulled a new reader and while you are in the process
of reindexing.

On Sat, Oct 19, 2019 at 1:35 AM Matt Davis <[hidden email]> wrote:

>
> Hi All,
>
> I am working on implementing of an in place reindex using Lucene.  In my
> case, I have BSON document stored in a binary field and have a set of rules
> that pull fields out of the BSON and indexes them into different Lucene
> fields with different analyzers.  I would like to be able to change these
> rules / schema and then iterate over the documents, indexing them using the
> new schema.
>
> I have come up with the following code block:
> https://gist.github.com/mdavis95/f600e0a8233d0a1232eff77645d1dc8a
>
> I have two questions:
> 1) Is this a good way to iterate over the documents
> 2) How can I manage documents changing when I am doing this.  New documents
> coming in should be fine I believe but changes to existing documents could
> be lost if I understand correctly.
>
> I hope that this is the right place to ask this question and I apologize if
> this is obvious or has been asked and answered.
>
> Thanks,
> Matt



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Iterating Over All Documents On a Changing Index

Matt Davis
Thanks for the clarification.  I have written my own logic tracking changes
and ignoring documents that have been written or deleted since the reindex
started.



On Mon, Oct 21, 2019, 4:58 PM Adrien Grand <[hidden email]> wrote:

> This is the right place to ask these questions indeed.
>
> This is a good way to iterate over documents. Regarding your 2nd
> question, Lucene IndexReaders are point-in-time views of the data, so
> changes won't become visible in-place. The tricky problem with this
> kind of problem is usually to deal with documents that are getting
> indexed after you pulled a new reader and while you are in the process
> of reindexing.
>
> On Sat, Oct 19, 2019 at 1:35 AM Matt Davis <[hidden email]>
> wrote:
> >
> > Hi All,
> >
> > I am working on implementing of an in place reindex using Lucene.  In my
> > case, I have BSON document stored in a binary field and have a set of
> rules
> > that pull fields out of the BSON and indexes them into different Lucene
> > fields with different analyzers.  I would like to be able to change these
> > rules / schema and then iterate over the documents, indexing them using
> the
> > new schema.
> >
> > I have come up with the following code block:
> > https://gist.github.com/mdavis95/f600e0a8233d0a1232eff77645d1dc8a
> >
> > I have two questions:
> > 1) Is this a good way to iterate over the documents
> > 2) How can I manage documents changing when I am doing this.  New
> documents
> > coming in should be fine I believe but changes to existing documents
> could
> > be lost if I understand correctly.
> >
> > I hope that this is the right place to ask this question and I apologize
> if
> > this is obvious or has been asked and answered.
> >
> > Thanks,
> > Matt
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>