Updating specific fields of huge docs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Updating specific fields of huge docs

Luís Filipe Nassif
Hi all,

Lucene 7 still deletes and re-adds docs when an update operation is done,
as I understood.

When docs have dozens of fields and one of them is large text content
(extracted by Tika) and if I need to update some other small fields, what
is the best approach to not reindex that large text field?

Any better way than splitting the index in two (metadata and text indexes)
and using ParallelCompositeReader for searches?

Thanks in advance,
Luis
Reply | Threaded
Open this post in threaded view
|

Re: Updating specific fields of huge docs

Erick Erickson
If (and only if) the fields you need to update are single-valued,
docValues=true, indexed=false, you can do in-place update of the DV
field only.

Otherwise, you'll probably have to split the docs up. The question is
whether you have evidence that reindexing is too expensive.

If you do need to split the docs up, you might find some of the
streaming capabilities useful for join kinds of operations of other
join options don't work out or you just prefer the streaming
alternative.

Best,
Erick

On Wed, Feb 13, 2019 at 11:43 AM Luís Filipe Nassif <[hidden email]> wrote:

>
> Hi all,
>
> Lucene 7 still deletes and re-adds docs when an update operation is done,
> as I understood.
>
> When docs have dozens of fields and one of them is large text content
> (extracted by Tika) and if I need to update some other small fields, what
> is the best approach to not reindex that large text field?
>
> Any better way than splitting the index in two (metadata and text indexes)
> and using ParallelCompositeReader for searches?
>
> Thanks in advance,
> Luis

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Updating specific fields of huge docs

Luís Filipe Nassif
Thank you, Erick.

Unfortunately we need to index those fields.

Currently we do not store text because of storage requirements and it is
slow to extract it again.

Thank you for the tips.
Luis

Em qua, 13 de fev de 2019 18:13, Erick Erickson <[hidden email]
escreveu:

> If (and only if) the fields you need to update are single-valued,
> docValues=true, indexed=false, you can do in-place update of the DV
> field only.
>
> Otherwise, you'll probably have to split the docs up. The question is
> whether you have evidence that reindexing is too expensive.
>
> If you do need to split the docs up, you might find some of the
> streaming capabilities useful for join kinds of operations of other
> join options don't work out or you just prefer the streaming
> alternative.
>
> Best,
> Erick
>
> On Wed, Feb 13, 2019 at 11:43 AM Luís Filipe Nassif <[hidden email]>
> wrote:
> >
> > Hi all,
> >
> > Lucene 7 still deletes and re-adds docs when an update operation is done,
> > as I understood.
> >
> > When docs have dozens of fields and one of them is large text content
> > (extracted by Tika) and if I need to update some other small fields, what
> > is the best approach to not reindex that large text field?
> >
> > Any better way than splitting the index in two (metadata and text
> indexes)
> > and using ParallelCompositeReader for searches?
> >
> > Thanks in advance,
> > Luis
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Updating specific fields of huge docs

Marcio Napoli
Hi Luís,

If the contents of the files dont change one solution is to store the text
parsed by tika in a compressed way, ~7% extracted text size.
In updating the document, just search the old one with the contents ready
(compressed) and update the other fields that you need.

Best,
Marcio

http://www.neoco.com.br


Em qui, 14 de fev de 2019 às 15:09, Luís Filipe Nassif <[hidden email]>
escreveu:

> Thank you, Erick.
>
> Unfortunately we need to index those fields.
>
> Currently we do not store text because of storage requirements and it is
> slow to extract it again.
>
> Thank you for the tips.
> Luis
>
> Em qua, 13 de fev de 2019 18:13, Erick Erickson <[hidden email]
> escreveu:
>
> > If (and only if) the fields you need to update are single-valued,
> > docValues=true, indexed=false, you can do in-place update of the DV
> > field only.
> >
> > Otherwise, you'll probably have to split the docs up. The question is
> > whether you have evidence that reindexing is too expensive.
> >
> > If you do need to split the docs up, you might find some of the
> > streaming capabilities useful for join kinds of operations of other
> > join options don't work out or you just prefer the streaming
> > alternative.
> >
> > Best,
> > Erick
> >
> > On Wed, Feb 13, 2019 at 11:43 AM Luís Filipe Nassif <[hidden email]
> >
> > wrote:
> > >
> > > Hi all,
> > >
> > > Lucene 7 still deletes and re-adds docs when an update operation is
> done,
> > > as I understood.
> > >
> > > When docs have dozens of fields and one of them is large text content
> > > (extracted by Tika) and if I need to update some other small fields,
> what
> > > is the best approach to not reindex that large text field?
> > >
> > > Any better way than splitting the index in two (metadata and text
> > indexes)
> > > and using ParallelCompositeReader for searches?
> > >
> > > Thanks in advance,
> > > Luis
> >
> >
>