Continuous stream indexing and time-based segment management

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Continuous stream indexing and time-based segment management

mark harwood
There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed.

In these scenarios I imagine the following facilities would be useful:

a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day
b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago 
c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.

I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?

Anyone else had thoughts in this area?

Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Continuous stream indexing and time-based segment management

Simon Willnauer
On Tue, Jun 19, 2012 at 6:42 PM, mark harwood <[hidden email]> wrote:
> There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed.
>
> In these scenarios I imagine the following facilities would be useful:
>
> a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day
> b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago

you can do that by subclassing IW and call some package private APIs /
members. We can certainly make that easier but I personally don't want
to open this as a public API. I can certainly imagine to have a
protected API that allows dropping entire segment.

simon

> c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.
>
> I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>
> Anyone else had thoughts in this area?
>
> Cheers
> Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Continuous stream indexing and time-based segment management

Simon Willnauer
On Tue, Jun 19, 2012 at 9:44 PM, Simon Willnauer
<[hidden email]> wrote:

> On Tue, Jun 19, 2012 at 6:42 PM, mark harwood <[hidden email]> wrote:
>> There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed.
>>
>> In these scenarios I imagine the following facilities would be useful:
>>
>> a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day
>> b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago
>
> you can do that by subclassing IW and call some package private APIs /
> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
>
> simon
>
>> c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.
>>
>> I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>>
>> Anyone else had thoughts in this area?

I had some ideas to add statistics to DocValues that get created
during index time. You can already do that and expose it via
Attributes maybe we can add some API to docvlaues you can hook into so
that you don't need to write you own DV impl.
>>
>> Cheers
>> Mark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Continuous stream indexing and time-based segment management

Michael McCandless-2
In reply to this post by Simon Willnauer
If you are willing/able to close the IndexWriter, it's easy to drop
segments by reading the SegmentInfos, editing, and writing back.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 19, 2012 at 3:44 PM, Simon Willnauer
<[hidden email]> wrote:

> On Tue, Jun 19, 2012 at 6:42 PM, mark harwood <[hidden email]> wrote:
>> There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed.
>>
>> In these scenarios I imagine the following facilities would be useful:
>>
>> a) A MergePolicy that organized content into segments on the basis of increasing time units e.g. 5min->10 min->1 hour->1 day
>> b) The ability to drop entire segments e.g. the day-level segment from exactly a week ago
>
> you can do that by subclassing IW and call some package private APIs /
> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
>
> simon
>
>> c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.
>>
>> I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>>
>> Anyone else had thoughts in this area?
>>
>> Cheers
>> Mark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Continuous stream indexing and time-based segment management

mark harwood
In reply to this post by Simon Willnauer
> you can do that by subclassing IW and call some package private APIs /


To date I have used separate physical indexes with a MultiReader to combine them then dropping the outdated indexes.
At least this has the benefit that a custom MergePolicy is not required to keep content from the different dates segregated.

Where I saw the potential is when looking at S4 or Esper stream processing technologies when they try to count things in time windows.
It struck me that careful organisation of Lucene segments along time units could provide an efficient means of accessing and comparing counts of many things over time.
It looked like the "Hello World' example in S4 for counting top Twitter topics instantiated a Java object per unique topic String which was then responsible for maintaining counts on things - this seems a fairly inefficient way of modelling things.

>>If you are willing/able to close the IndexWriter, it's easy to drop segments by reading the SegmentInfos, editing, and writing back.

My assumption was that ultimately that's what it comes down to - I just wonder if this is likely to be a common requirement, deserving of a supported API



> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
>
> simon
>
>> c) Various new analysis functions comparing term frequencies across time e.g discovery of "trending" topics.
>>
>> I can see that a) could be implemented using a custom MergePolicy and c) can be done via existing APIs but I'm not sure if there is way to simply drop entire segments currently?
>>
>> Anyone else had thoughts in this area?

I had some ideas to add statistics to DocValues that get created
during index time. You can already do that and expose it via
Attributes maybe we can add some API to docvlaues you can hook into so
that you don't need to write you own DV impl.
>>
>> Cheers
>> Mark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]