Merge information in segment files

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Merge information in segment files

Alan Woodward
Hi all,

Is there any way of finding out if a segment is the result of a merge, or if it's just new data?  I can't find anything in SegmentInfo that records this - if it isn't there, I'll open a JIRA.

Here's the use case:  I need to reload ExternalFileField data when segments are merged, as the internal docids will all have changed, invalidating the EFF caches.  However, new segments can just use default values (the EFF is used to store things like click rates, which are all zero for new data).  At the moment, caches are refreshed after every commit.  But cache reloading is heavy - if we can restrict it to only reload after a merge, then we save a lot of wasted CPU and IO cycles.

Thanks,
Alan Woodward
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merge information in segment files

Michael McCandless-2
We do actually record this, in the segments "diagnostics" field ...
but that format is something that can suddenly "change" (ie it's not
an API w/ back compat).

Mike McCandless

http://blog.mikemccandless.com

On Fri, Nov 16, 2012 at 7:01 AM, Alan Woodward
<[hidden email]> wrote:

> Hi all,
>
> Is there any way of finding out if a segment is the result of a merge, or if it's just new data?  I can't find anything in SegmentInfo that records this - if it isn't there, I'll open a JIRA.
>
> Here's the use case:  I need to reload ExternalFileField data when segments are merged, as the internal docids will all have changed, invalidating the EFF caches.  However, new segments can just use default values (the EFF is used to store things like click rates, which are all zero for new data).  At the moment, caches are refreshed after every commit.  But cache reloading is heavy - if we can restrict it to only reload after a merge, then we save a lot of wasted CPU and IO cycles.
>
> Thanks,
> Alan Woodward
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merge information in segment files

Alan Woodward
Do you think it's worth promoting to a first-class API?  Just a boolean - isMerged(), or something.

On 16 Nov 2012, at 12:11, Michael McCandless wrote:

> We do actually record this, in the segments "diagnostics" field ...
> but that format is something that can suddenly "change" (ie it's not
> an API w/ back compat).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Nov 16, 2012 at 7:01 AM, Alan Woodward
> <[hidden email]> wrote:
>> Hi all,
>>
>> Is there any way of finding out if a segment is the result of a merge, or if it's just new data?  I can't find anything in SegmentInfo that records this - if it isn't there, I'll open a JIRA.
>>
>> Here's the use case:  I need to reload ExternalFileField data when segments are merged, as the internal docids will all have changed, invalidating the EFF caches.  However, new segments can just use default values (the EFF is used to store things like click rates, which are all zero for new data).  At the moment, caches are refreshed after every commit.  But cache reloading is heavy - if we can restrict it to only reload after a merge, then we save a lot of wasted CPU and IO cycles.
>>
>> Thanks,
>> Alan Woodward
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merge information in segment files

Michael McCandless-2
On Fri, Nov 16, 2012 at 7:17 AM, Alan Woodward
<[hidden email]> wrote:

> Do you think it's worth promoting to a first-class API?  Just a boolean - isMerged(), or something.

I'm a little bit nervous about that ... ie it's revealing something of
Lucene's internals?

For example, long ago Lucene used to write each document as a single
segment in a RAMDir and then merge segments (still in RAMDir) and then
eventually flush them.  (The code was WONDERFULLY simple/elegant
compared to what we have today :) )

In that world, technically that flushed segment was "merged", but for
your use case I think you would want to treat it as not merged?

We could go back to doing something like this with IW some day ... it
can result in more efficient RAM usage since a written segment is much
more compact than the in-memory postings data structures... and then
what should we return for isMerged?

Could you instead wrap the MergeScheduler and note when merges had completed?

Also: is this because ExternaFileFIeld is used on the top-level
reader?  If it's per segment it seems like you wouldn't need to track
this?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merge information in segment files

Mikhail Khludnev
In reply to this post by Alan Woodward
Alan,

It might be off-topic question but why you consider "all zero for new data" as a major one?
I have two contra samples:
- I have click-through for some product, but it was out-of-stock during building index, and because of this it was ignored during loading EFF. After it was supplied into warehouse, we index this product but put default click-through rank despite it's present in the file. but why?
- another issue if you have click-through rank not for product id (primary key) but for brand or other field. Problem is the same - you know that D&G is highly clicked products, but apply default rank instead. 

I agree with Michael McC that the core problem of wasting CPU&IO is the old dispute about top level data structures vs per segment: FieldCache vs UnInvertedField; DocSet vs CachinWrapperFilter, Solr vs Luvcene&ElasticSearch etc. I hope sooner or later we will have alternative per-segment EFF impl and will be choose the trade off for particular case. 


On Fri, Nov 16, 2012 at 4:17 PM, Alan Woodward <[hidden email]> wrote:
Do you think it's worth promoting to a first-class API?  Just a boolean - isMerged(), or something.

On 16 Nov 2012, at 12:11, Michael McCandless wrote:

> We do actually record this, in the segments "diagnostics" field ...
> but that format is something that can suddenly "change" (ie it's not
> an API w/ back compat).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Nov 16, 2012 at 7:01 AM, Alan Woodward
> <[hidden email]> wrote:
>> Hi all,
>>
>> Is there any way of finding out if a segment is the result of a merge, or if it's just new data?  I can't find anything in SegmentInfo that records this - if it isn't there, I'll open a JIRA.
>>
>> Here's the use case:  I need to reload ExternalFileField data when segments are merged, as the internal docids will all have changed, invalidating the EFF caches.  However, new segments can just use default values (the EFF is used to store things like click rates, which are all zero for new data).  At the moment, caches are refreshed after every commit.  But cache reloading is heavy - if we can restrict it to only reload after a merge, then we save a lot of wasted CPU and IO cycles.
>>
>> Thanks,
>> Alan Woodward
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]