[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891256#action_12891256 ]

Michael McCandless commented on LUCENE-2324:

bq. I still see usage of docStoreOffset, but aren't we doing away with shared doc stores with the cutover to DWPT?

Do we want all segments that one DWPT writes to share the same
doc store, i.e. one doc store per DWPT, or remove doc stores

Oh good question... a single DWPT can in fact continue to share doc
store across the segments it flushes.

Hmm, but... this opto only helps in that we don't have to merge the
doc stores if we merge segments that already share their doc stores.
But if (say) I have 2 threads indexing, and I'm indexing lots of docs
and each DWPT has written 5 segments, we will then merge these 10
segments, and must merge the doc stores at that point.  So the sharing
isn't really buying us much (just not closing old files & opening new
ones, which is presumably negligible)?

bq. I think you can further simplify DocumentsWriterPerThread.DocWriter; in fact I think you can remove it & all subclasses in consumers!

I agree! Now that a high number of testcases pass it's less scary
to modify even more code  - will do this next.

bq. Also, we don't need separate closeDocStore; it should just be closed during flush.

OK sounds good.

Super :)

bq. I like the ThreadAffinityDocumentsWriterThreadPool; it's the default right (I see some tests explicitly setting in on IWC; not sure why)?

It's actually only TestStressIndexing2 and it sets it to use a different
number of max thread states than the default.

Ahh OK great.

bq. We should make the in-RAM deletes impl somehow pluggable?

Do you mean so that it's customizable how deletes are handled?

Actually I was worried about the long[] sequenceIDs (adding 8 bytes
RAM per buffered doc) -- this could be a biggish hit to RAM efficiency
for small docs.

{quote} E.g. doing live deletes vs. lazy deletes on flush?
I think that's a good idea. E.g. at Twitter we'll do live deletes always
to get the lowest latency (and we don't have too many deletes),
but that's probably not the best default for everyone.
So I agree that making this customizable is a good idea.

Yeah, this too :)

Actually deletions today are not applied on flush -- they continue to
be buffered beyond flush, and then get applied just before a merge
kicks off.  I think we should keep this (as an option and probably as
the default) -- it's important for apps w/ large indices that don't use
NRT (and don't pool readers) because it's costly to open readers.

So it sounds like we should support "lazy" (apply-before-merge like
today) and "live" (live means resolve deleted Term/Query -> docID(s)
synchronously inside deleteDocuments, right?).

Live should also be less performant because of less temporal locality
(vs lazy).

It'd also be nice to have a more efficient data structure to buffer the
deletes. With many buffered deletes the java hashmap approach
will not be very efficient. Terms could be written into a byte pool,
but what should we do with queries?

I agree w/ Yonik: let's worry only about delete by Term (not Query)
for now.

Maybe we could reuse (factor out) TermsHashPerField's custom hash
here, for the buffered Terms?  It efficiently maps a BytesRef --> int.

Another thing: it looks like finishFlushedSegment is sync'd on the IW
instance, but, it need not be sync'd for all of that?  EG
readerPool.get(), applyDeletes, building the CFS, may not need to be
inside the sync block?

> Per thread DocumentsWriters that write their own private segments
> -----------------------------------------------------------------
>                 Key: LUCENE-2324
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2324
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: Realtime Branch
>         Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]