custom segment files

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

custom segment files

John Wang-9
Hi guys:

     I am trying to figure how to add the ability to create custom segment files. Hopefully it is possible to create a plugin framework where one can provide some sort of callback to add to a segment given a doc and provide some sort of merge logic. This is in light of the flexible indexing effort.

     After digging thru the latest trunk code in that area, I see a Writer/WriterPerThread pattern for different types of segment files, e.g. Stored data, norms, inverted doc, etc.

     Do you think it is a good idea to consolidate them? Are there intricacies where there are cross dependency between different types of writers?

     Merge logic seems to be in the SegmentMerger class. Seems to do this, it would be good to separate it out to per writer type.

      I am still trying to understand the code, any help is greatly appreciated.

Thoughts?

Thanks

-John
Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

Michael McCandless-2
I'm actively working on LUCENE-1458, to enable differenct codecs for
reading/writing the terms dict and doc/freq/prox/payload postings.
I'm working now towards getting PforDelta working...

However, that change doesn't [yet] do anything for norms, stored
fields nor term vectors.

Can you describe more details about what kinds of customization you're
looking to do?

Mike

On Thu, Sep 17, 2009 at 10:00 AM, John Wang <[hidden email]> wrote:

> Hi guys:
>
>      I am trying to figure how to add the ability to create custom segment
> files. Hopefully it is possible to create a plugin framework where one can
> provide some sort of callback to add to a segment given a doc and provide
> some sort of merge logic. This is in light of the flexible indexing effort.
>
>      After digging thru the latest trunk code in that area, I see a
> Writer/WriterPerThread pattern for different types of segment files, e.g.
> Stored data, norms, inverted doc, etc.
>
>      Do you think it is a good idea to consolidate them? Are there
> intricacies where there are cross dependency between different types of
> writers?
>
>      Merge logic seems to be in the SegmentMerger class. Seems to do this,
> it would be good to separate it out to per writer type.
>
>       I am still trying to understand the code, any help is greatly
> appreciated.
>
> Thoughts?
>
> Thanks
>
> -John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

John Wang-9
Sure.

A simple example:

Say you have a type of field with fixed length data per doc, e.g. a 8 bytes. It might be good to store in a segment:
<numdocs><v1><v2>....<vn>

so if you have 1000 docs, your seg file is 8k+4 bytes.

Merging would be rather trivial as well.

Doing this right now involves storing into payload, which pays a cost of parsing byte[] to say a long per doc.

I think this problem is orthogonal to 1458.

There are other usecases, so I thought it might be a good idea to abstract it out, since on a high level it is rather similar:

start
write per doc
end
merge

Hopefully I am describing it clearly.

Thanks

-John


On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless <[hidden email]> wrote:
I'm actively working on LUCENE-1458, to enable differenct codecs for
reading/writing the terms dict and doc/freq/prox/payload postings.
I'm working now towards getting PforDelta working...

However, that change doesn't [yet] do anything for norms, stored
fields nor term vectors.

Can you describe more details about what kinds of customization you're
looking to do?

Mike

On Thu, Sep 17, 2009 at 10:00 AM, John Wang <[hidden email]> wrote:
> Hi guys:
>
>      I am trying to figure how to add the ability to create custom segment
> files. Hopefully it is possible to create a plugin framework where one can
> provide some sort of callback to add to a segment given a doc and provide
> some sort of merge logic. This is in light of the flexible indexing effort.
>
>      After digging thru the latest trunk code in that area, I see a
> Writer/WriterPerThread pattern for different types of segment files, e.g.
> Stored data, norms, inverted doc, etc.
>
>      Do you think it is a good idea to consolidate them? Are there
> intricacies where there are cross dependency between different types of
> writers?
>
>      Merge logic seems to be in the SegmentMerger class. Seems to do this,
> it would be good to separate it out to per writer type.
>
>       I am still trying to understand the code, any help is greatly
> appreciated.
>
> Thoughts?
>
> Thanks
>
> -John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

Jason Rutherglen
In reply to this post by John Wang-9
I believe you could override the IW.flush and IW.mergeSuccess
methods. flush unfortunately doesn't expose the new SegmentInfo,
however it could be obtained via
IW.getReader().getSequentialSubReaders (by comparing the before
and after).

Adjacent segment files could then be maintained without hacking into
SegmentMerger.

On Thu, Sep 17, 2009 at 7:00 AM, John Wang <[hidden email]> wrote:

> Hi guys:
>
>      I am trying to figure how to add the ability to create custom segment
> files. Hopefully it is possible to create a plugin framework where one can
> provide some sort of callback to add to a segment given a doc and provide
> some sort of merge logic. This is in light of the flexible indexing effort.
>
>      After digging thru the latest trunk code in that area, I see a
> Writer/WriterPerThread pattern for different types of segment files, e.g.
> Stored data, norms, inverted doc, etc.
>
>      Do you think it is a good idea to consolidate them? Are there
> intricacies where there are cross dependency between different types of
> writers?
>
>      Merge logic seems to be in the SegmentMerger class. Seems to do this,
> it would be good to separate it out to per writer type.
>
>       I am still trying to understand the code, any help is greatly
> appreciated.
>
> Thoughts?
>
> Thanks
>
> -John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

John Wang-9
Hi Michael:

     Is there a wiki or some sort of write up on LUCENE-1458? It looks extremely cool!

Re: Jason: isn't flush final?

-John

On Fri, Sep 18, 2009 at 9:09 AM, Jason Rutherglen <[hidden email]> wrote:
I believe you could override the IW.flush and IW.mergeSuccess
methods. flush unfortunately doesn't expose the new SegmentInfo,
however it could be obtained via
IW.getReader().getSequentialSubReaders (by comparing the before
and after).

Adjacent segment files could then be maintained without hacking into
SegmentMerger.

On Thu, Sep 17, 2009 at 7:00 AM, John Wang <[hidden email]> wrote:
> Hi guys:
>
>      I am trying to figure how to add the ability to create custom segment
> files. Hopefully it is possible to create a plugin framework where one can
> provide some sort of callback to add to a segment given a doc and provide
> some sort of merge logic. This is in light of the flexible indexing effort.
>
>      After digging thru the latest trunk code in that area, I see a
> Writer/WriterPerThread pattern for different types of segment files, e.g.
> Stored data, norms, inverted doc, etc.
>
>      Do you think it is a good idea to consolidate them? Are there
> intricacies where there are cross dependency between different types of
> writers?
>
>      Merge logic seems to be in the SegmentMerger class. Seems to do this,
> it would be good to separate it out to per writer type.
>
>       I am still trying to understand the code, any help is greatly
> appreciated.
>
> Thoughts?
>
> Thanks
>
> -John
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

Marvin Humphrey
In reply to this post by John Wang-9
On Fri, Sep 18, 2009 at 08:14:24AM +0800, John Wang wrote:

> Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
> It might be good to store in a segment:
> <numdocs><v1><v2>....<vn>

Heh.  You've just described this proof of concept class:

    http://www.rectangular.com/kinosearch/docs/devel/KSx/Index/ByteBufDocWriter.html
    http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/KSx/Index/ByteBufDocWriter.pm

> Hopefully I am describing it clearly.

Sure, I understand exactly what you mean.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

Jason Rutherglen
In reply to this post by John Wang-9
Yes, I guess you could branch the code?  It probably doesn't need to
be final Mike?

On Thu, Sep 17, 2009 at 7:16 PM, John Wang <[hidden email]> wrote:

> Hi Michael:
>
>      Is there a wiki or some sort of write up on LUCENE-1458? It looks
> extremely cool!
>
> Re: Jason: isn't flush final?
>
> -John
>
> On Fri, Sep 18, 2009 at 9:09 AM, Jason Rutherglen
> <[hidden email]> wrote:
>>
>> I believe you could override the IW.flush and IW.mergeSuccess
>> methods. flush unfortunately doesn't expose the new SegmentInfo,
>> however it could be obtained via
>> IW.getReader().getSequentialSubReaders (by comparing the before
>> and after).
>>
>> Adjacent segment files could then be maintained without hacking into
>> SegmentMerger.
>>
>> On Thu, Sep 17, 2009 at 7:00 AM, John Wang <[hidden email]> wrote:
>> > Hi guys:
>> >
>> >      I am trying to figure how to add the ability to create custom
>> > segment
>> > files. Hopefully it is possible to create a plugin framework where one
>> > can
>> > provide some sort of callback to add to a segment given a doc and
>> > provide
>> > some sort of merge logic. This is in light of the flexible indexing
>> > effort.
>> >
>> >      After digging thru the latest trunk code in that area, I see a
>> > Writer/WriterPerThread pattern for different types of segment files,
>> > e.g.
>> > Stored data, norms, inverted doc, etc.
>> >
>> >      Do you think it is a good idea to consolidate them? Are there
>> > intricacies where there are cross dependency between different types of
>> > writers?
>> >
>> >      Merge logic seems to be in the SegmentMerger class. Seems to do
>> > this,
>> > it would be good to separate it out to per writer type.
>> >
>> >       I am still trying to understand the code, any help is greatly
>> > appreciated.
>> >
>> > Thoughts?
>> >
>> > Thanks
>> >
>> > -John
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

Earwin Burrfoot
In reply to this post by Michael McCandless-2
I bet custom per-segment files could very well be used for per-segment
userdata/debuginfo we introduced earlier.
With them it could be stored neatly in a separate file instead of
being grafted onto current ones.

On Thu, Sep 17, 2009 at 18:35, Michael McCandless
<[hidden email]> wrote:

> I'm actively working on LUCENE-1458, to enable differenct codecs for
> reading/writing the terms dict and doc/freq/prox/payload postings.
> I'm working now towards getting PforDelta working...
>
> However, that change doesn't [yet] do anything for norms, stored
> fields nor term vectors.
>
> Can you describe more details about what kinds of customization you're
> looking to do?
>
> Mike
>
> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <[hidden email]> wrote:
>> Hi guys:
>>
>>      I am trying to figure how to add the ability to create custom segment
>> files. Hopefully it is possible to create a plugin framework where one can
>> provide some sort of callback to add to a segment given a doc and provide
>> some sort of merge logic. This is in light of the flexible indexing effort.
>>
>>      After digging thru the latest trunk code in that area, I see a
>> Writer/WriterPerThread pattern for different types of segment files, e.g.
>> Stored data, norms, inverted doc, etc.
>>
>>      Do you think it is a good idea to consolidate them? Are there
>> intricacies where there are cross dependency between different types of
>> writers?
>>
>>      Merge logic seems to be in the SegmentMerger class. Seems to do this,
>> it would be good to separate it out to per writer type.
>>
>>       I am still trying to understand the code, any help is greatly
>> appreciated.
>>
>> Thoughts?
>>
>> Thanks
>>
>> -John
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
Kirill Zakharenko/Кирилл Захаренко ([hidden email])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

Michael McCandless-2
In reply to this post by John Wang-9
> Say you have a type of field with fixed length data per doc, e.g. a
> 8 bytes.

OK this makes sense -- thanks for the example!  This sounds like
getting column-stride-fields before that feature is added to Lucene
"for real".

For flushing, you can plugin your own indexing chain to IndexWriter.
This (customizing what's indexed per-doc and what's written for the
new segment) is exactly what the pluggable indexing chain is for.
BUT: this API is still very experimental and package private.

I suppose, for looser integration we could add a hook that's called in
IndexWriter giving you a chance to do something at flush.
Hmm... actually could you use doAfterFlush()?

Merging, however, doesn't yet have hooks / pluggability in place to do
something custom, and I agree it's sorely needed.  Patches very
welcome here!

This could enable "loose" customization on what's flushed and how it's
merged, and you'd have to make your own reader external to Lucene.

LUCENE-1458 is aiming to cover this sort of use case, but in a more
tightly integrated way.  EG the new enumeration API in LUCENE-1458 (to
replace TermEnum, TermDocs, TermPositions) is based on AttributeSource
so that you could add your own attribute at the field, term, doc or
positions level.  However I haven't explored this at all yet, and eg
customizable merging is not done.

> It [flush] probably doesn't need to be final Mike?

I agree.  Wanna include un-final'ing it in a patch?

> Is there a wiki or some sort of write up on LUCENE-1458?

Sorry not just yet.  I agree it's badly needed... it's an enormous set
of changes at this point.  I'll add a wiki page that I'll try to keep
current as the design iterates.

Mike

On Thu, Sep 17, 2009 at 8:14 PM, John Wang <[hidden email]> wrote:

> Sure.
>
> A simple example:
>
> Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
> It might be good to store in a segment:
> <numdocs><v1><v2>....<vn>
>
> so if you have 1000 docs, your seg file is 8k+4 bytes.
>
> Merging would be rather trivial as well.
>
> Doing this right now involves storing into payload, which pays a cost of
> parsing byte[] to say a long per doc.
>
> I think this problem is orthogonal to 1458.
>
> There are other usecases, so I thought it might be a good idea to abstract
> it out, since on a high level it is rather similar:
>
> start
> write per doc
> end
> merge
>
> Hopefully I am describing it clearly.
>
> Thanks
>
> -John
>
>
> On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless
> <[hidden email]> wrote:
>>
>> I'm actively working on LUCENE-1458, to enable differenct codecs for
>> reading/writing the terms dict and doc/freq/prox/payload postings.
>> I'm working now towards getting PforDelta working...
>>
>> However, that change doesn't [yet] do anything for norms, stored
>> fields nor term vectors.
>>
>> Can you describe more details about what kinds of customization you're
>> looking to do?
>>
>> Mike
>>
>> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <[hidden email]> wrote:
>> > Hi guys:
>> >
>> >      I am trying to figure how to add the ability to create custom
>> > segment
>> > files. Hopefully it is possible to create a plugin framework where one
>> > can
>> > provide some sort of callback to add to a segment given a doc and
>> > provide
>> > some sort of merge logic. This is in light of the flexible indexing
>> > effort.
>> >
>> >      After digging thru the latest trunk code in that area, I see a
>> > Writer/WriterPerThread pattern for different types of segment files,
>> > e.g.
>> > Stored data, norms, inverted doc, etc.
>> >
>> >      Do you think it is a good idea to consolidate them? Are there
>> > intricacies where there are cross dependency between different types of
>> > writers?
>> >
>> >      Merge logic seems to be in the SegmentMerger class. Seems to do
>> > this,
>> > it would be good to separate it out to per writer type.
>> >
>> >       I am still trying to understand the code, any help is greatly
>> > appreciated.
>> >
>> > Thoughts?
>> >
>> > Thanks
>> >
>> > -John
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: custom segment files

John Wang-9
Thank you very much Michael for the information!

-John

On Fri, Sep 18, 2009 at 6:01 PM, Michael McCandless <[hidden email]> wrote:
> Say you have a type of field with fixed length data per doc, e.g. a
> 8 bytes.

OK this makes sense -- thanks for the example!  This sounds like
getting column-stride-fields before that feature is added to Lucene
"for real".

For flushing, you can plugin your own indexing chain to IndexWriter.
This (customizing what's indexed per-doc and what's written for the
new segment) is exactly what the pluggable indexing chain is for.
BUT: this API is still very experimental and package private.

I suppose, for looser integration we could add a hook that's called in
IndexWriter giving you a chance to do something at flush.
Hmm... actually could you use doAfterFlush()?

Merging, however, doesn't yet have hooks / pluggability in place to do
something custom, and I agree it's sorely needed.  Patches very
welcome here!

This could enable "loose" customization on what's flushed and how it's
merged, and you'd have to make your own reader external to Lucene.

LUCENE-1458 is aiming to cover this sort of use case, but in a more
tightly integrated way.  EG the new enumeration API in LUCENE-1458 (to
replace TermEnum, TermDocs, TermPositions) is based on AttributeSource
so that you could add your own attribute at the field, term, doc or
positions level.  However I haven't explored this at all yet, and eg
customizable merging is not done.

> It [flush] probably doesn't need to be final Mike?

I agree.  Wanna include un-final'ing it in a patch?

> Is there a wiki or some sort of write up on LUCENE-1458?

Sorry not just yet.  I agree it's badly needed... it's an enormous set
of changes at this point.  I'll add a wiki page that I'll try to keep
current as the design iterates.

Mike

On Thu, Sep 17, 2009 at 8:14 PM, John Wang <[hidden email]> wrote:
> Sure.
>
> A simple example:
>
> Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
> It might be good to store in a segment:
> <numdocs><v1><v2>....<vn>
>
> so if you have 1000 docs, your seg file is 8k+4 bytes.
>
> Merging would be rather trivial as well.
>
> Doing this right now involves storing into payload, which pays a cost of
> parsing byte[] to say a long per doc.
>
> I think this problem is orthogonal to 1458.
>
> There are other usecases, so I thought it might be a good idea to abstract
> it out, since on a high level it is rather similar:
>
> start
> write per doc
> end
> merge
>
> Hopefully I am describing it clearly.
>
> Thanks
>
> -John
>
>
> On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless
> <[hidden email]> wrote:
>>
>> I'm actively working on LUCENE-1458, to enable differenct codecs for
>> reading/writing the terms dict and doc/freq/prox/payload postings.
>> I'm working now towards getting PforDelta working...
>>
>> However, that change doesn't [yet] do anything for norms, stored
>> fields nor term vectors.
>>
>> Can you describe more details about what kinds of customization you're
>> looking to do?
>>
>> Mike
>>
>> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <[hidden email]> wrote:
>> > Hi guys:
>> >
>> >      I am trying to figure how to add the ability to create custom
>> > segment
>> > files. Hopefully it is possible to create a plugin framework where one
>> > can
>> > provide some sort of callback to add to a segment given a doc and
>> > provide
>> > some sort of merge logic. This is in light of the flexible indexing
>> > effort.
>> >
>> >      After digging thru the latest trunk code in that area, I see a
>> > Writer/WriterPerThread pattern for different types of segment files,
>> > e.g.
>> > Stored data, norms, inverted doc, etc.
>> >
>> >      Do you think it is a good idea to consolidate them? Are there
>> > intricacies where there are cross dependency between different types of
>> > writers?
>> >
>> >      Merge logic seems to be in the SegmentMerger class. Seems to do
>> > this,
>> > it would be good to separate it out to per writer type.
>> >
>> >       I am still trying to understand the code, any help is greatly
>> > appreciated.
>> >
>> > Thoughts?
>> >
>> > Thanks
>> >
>> > -John
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]