date range query performance

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

date range query performance

Alok Dhir
Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.

Fairly simple schema -- no large text fields, standard request  
handler.  4 small facet fields.

The index is an event log -- a primary search/retrieval requirement is  
date range queries.

A simple query without a date range subquery is ridiculously fast -  
2ms.  The same query with a date range takes up to 30s (30,000ms).

Concrete example, this query just look 18s:

        instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO  
2008-10-30T03:59:59Z] AND label_facet:"Added to Position"

The exact same query without the date range took 2ms.

I saw a thread from Apr 2008 which explains the problem being due to  
too much precision on the DateField type, and the range expansion  
leading to far too many elements being checked.  Proposed solution  
appears to be a hack where you index date fields as strings and  
hacking together date functions to generate proper queries/format  
results.

Does this remain the recommended solution to this issue?

Thanks

---
Alok K. Dhir
Symplicity Corporation
www.symplicity.com
(703) 351-0200 x 8080
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: date range query performance

Chris Harris-2
Do you need to search down to the minutes and seconds level? If searching by
date provides sufficient granularity, for instance, you can normalize all
the time-of-day portions of the timestamps to midnight while indexing. (So
index any event happening on Oct 01, 2008 as 2008-10-01T00:00:00Z.) That
would give Solr many fewer unique timestamp values to go through.

On Wed, Oct 29, 2008 at 1:30 PM, Alok Dhir <[hidden email]> wrote:

> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
>
> Fairly simple schema -- no large text fields, standard request handler.  4
> small facet fields.
>
> The index is an event log -- a primary search/retrieval requirement is date
> range queries.
>
> A simple query without a date range subquery is ridiculously fast - 2ms.
>  The same query with a date range takes up to 30s (30,000ms).
>
> Concrete example, this query just look 18s:
>
>        instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO
> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>
> The exact same query without the date range took 2ms.
>
> I saw a thread from Apr 2008 which explains the problem being due to too
> much precision on the DateField type, and the range expansion leading to far
> too many elements being checked.  Proposed solution appears to be a hack
> where you index date fields as strings and hacking together date functions
> to generate proper queries/format results.
>
> Does this remain the recommended solution to this issue?
>
> Thanks
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: date range query performance

Alok Dhir
Well, no - we don't care so much about the seconds, but hours &  
minutes are indeed crucial.

---
Alok K. Dhir
Symplicity Corporation
www.symplicity.com
(703) 351-0200 x 8080
[hidden email]

On Oct 29, 2008, at 4:41 PM, Chris Harris wrote:

> Do you need to search down to the minutes and seconds level? If  
> searching by
> date provides sufficient granularity, for instance, you can  
> normalize all
> the time-of-day portions of the timestamps to midnight while  
> indexing. (So
> index any event happening on Oct 01, 2008 as 2008-10-01T00:00:00Z.)  
> That
> would give Solr many fewer unique timestamp values to go through.
>
> On Wed, Oct 29, 2008 at 1:30 PM, Alok Dhir <[hidden email]>  
> wrote:
>
>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
>>
>> Fairly simple schema -- no large text fields, standard request  
>> handler.  4
>> small facet fields.
>>
>> The index is an event log -- a primary search/retrieval requirement  
>> is date
>> range queries.
>>
>> A simple query without a date range subquery is ridiculously fast -  
>> 2ms.
>> The same query with a date range takes up to 30s (30,000ms).
>>
>> Concrete example, this query just look 18s:
>>
>>       instance:client\-csm.symplicity.com AND dt:
>> [2008-10-01T04:00:00Z TO
>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>
>> The exact same query without the date range took 2ms.
>>
>> I saw a thread from Apr 2008 which explains the problem being due  
>> to too
>> much precision on the DateField type, and the range expansion  
>> leading to far
>> too many elements being checked.  Proposed solution appears to be a  
>> hack
>> where you index date fields as strings and hacking together date  
>> functions
>> to generate proper queries/format results.
>>
>> Does this remain the recommended solution to this issue?
>>
>> Thanks
>>
>> ---
>> Alok K. Dhir
>> Symplicity Corporation
>> www.symplicity.com
>> (703) 351-0200 x 8080
>> [hidden email]
>>
>>

Reply | Threaded
Open this post in threaded view
|

RE: date range query performance

Feak, Todd
It strikes me that removing just the seconds could very well reduce
overhead to 1/60 of original. 30 second query turns into 500ms query.
Just a swag though.

-Todd

-----Original Message-----
From: Alok Dhir [mailto:[hidden email]]
Sent: Wednesday, October 29, 2008 1:48 PM
To: [hidden email]
Subject: Re: date range query performance

Well, no - we don't care so much about the seconds, but hours &  
minutes are indeed crucial.

---
Alok K. Dhir
Symplicity Corporation
www.symplicity.com
(703) 351-0200 x 8080
[hidden email]

On Oct 29, 2008, at 4:41 PM, Chris Harris wrote:

> Do you need to search down to the minutes and seconds level? If  
> searching by
> date provides sufficient granularity, for instance, you can  
> normalize all
> the time-of-day portions of the timestamps to midnight while  
> indexing. (So
> index any event happening on Oct 01, 2008 as 2008-10-01T00:00:00Z.)  
> That
> would give Solr many fewer unique timestamp values to go through.
>
> On Wed, Oct 29, 2008 at 1:30 PM, Alok Dhir <[hidden email]>  
> wrote:
>
>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
>>
>> Fairly simple schema -- no large text fields, standard request  
>> handler.  4
>> small facet fields.
>>
>> The index is an event log -- a primary search/retrieval requirement  
>> is date
>> range queries.
>>
>> A simple query without a date range subquery is ridiculously fast -  
>> 2ms.
>> The same query with a date range takes up to 30s (30,000ms).
>>
>> Concrete example, this query just look 18s:
>>
>>       instance:client\-csm.symplicity.com AND dt:
>> [2008-10-01T04:00:00Z TO
>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>
>> The exact same query without the date range took 2ms.
>>
>> I saw a thread from Apr 2008 which explains the problem being due  
>> to too
>> much precision on the DateField type, and the range expansion  
>> leading to far
>> too many elements being checked.  Proposed solution appears to be a  
>> hack
>> where you index date fields as strings and hacking together date  
>> functions
>> to generate proper queries/format results.
>>
>> Does this remain the recommended solution to this issue?
>>
>> Thanks
>>
>> ---
>> Alok K. Dhir
>> Symplicity Corporation
>> www.symplicity.com
>> (703) 351-0200 x 8080
>> [hidden email]
>>
>>


Reply | Threaded
Open this post in threaded view
|

Re: date range query performance

Erick Erickson
In reply to this post by Alok Dhir
I've also seen the suggestion (more from a pure Lucene perspective) of
breaking
apart your dates. Remember that the time/space issues are due to the number
of
terms. So it's possible (although I haven't tried it) to, index many fewer
distinct
terms. e.g. break your dates into some number of fields, say

<year><month><day><hour><minute><second><millisecond>
or maybe
<YearMonthDay><HourMinute><SecondMillisecond>
or
<YearMonthDayHour><MinuteSecond><Millisecond>
or...

and then dance fancy with the queries you generate.

I haven't tried this myself, but if you do I'd really like to get an idea of
how much it improves your results.
Of course I'm not real sure how to accomplish this in SOLR, but free advice
is definitely worth what you pay for it <GG>

Best
Erick

On Wed, Oct 29, 2008 at 4:47 PM, Alok Dhir <[hidden email]> wrote:

> Well, no - we don't care so much about the seconds, but hours & minutes are
> indeed crucial.
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>
> On Oct 29, 2008, at 4:41 PM, Chris Harris wrote:
>
>  Do you need to search down to the minutes and seconds level? If searching
>> by
>> date provides sufficient granularity, for instance, you can normalize all
>> the time-of-day portions of the timestamps to midnight while indexing. (So
>> index any event happening on Oct 01, 2008 as 2008-10-01T00:00:00Z.) That
>> would give Solr many fewer unique timestamp values to go through.
>>
>> On Wed, Oct 29, 2008 at 1:30 PM, Alok Dhir <[hidden email]> wrote:
>>
>>  Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
>>>
>>> Fairly simple schema -- no large text fields, standard request handler.
>>>  4
>>> small facet fields.
>>>
>>> The index is an event log -- a primary search/retrieval requirement is
>>> date
>>> range queries.
>>>
>>> A simple query without a date range subquery is ridiculously fast - 2ms.
>>> The same query with a date range takes up to 30s (30,000ms).
>>>
>>> Concrete example, this query just look 18s:
>>>
>>>      instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO
>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>>
>>> The exact same query without the date range took 2ms.
>>>
>>> I saw a thread from Apr 2008 which explains the problem being due to too
>>> much precision on the DateField type, and the range expansion leading to
>>> far
>>> too many elements being checked.  Proposed solution appears to be a hack
>>> where you index date fields as strings and hacking together date
>>> functions
>>> to generate proper queries/format results.
>>>
>>> Does this remain the recommended solution to this issue?
>>>
>>> Thanks
>>>
>>> ---
>>> Alok K. Dhir
>>> Symplicity Corporation
>>> www.symplicity.com
>>> (703) 351-0200 x 8080
>>> [hidden email]
>>>
>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: date range query performance

hossman
In reply to this post by Alok Dhir
: Concrete example, this query just look 18s:
:
: instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO
: 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"

: I saw a thread from Apr 2008 which explains the problem being due to too much
: precision on the DateField type, and the range expansion leading to far too
: many elements being checked.  Proposed solution appears to be a hack where you
: index date fields as strings and hacking together date functions to generate
: proper queries/format results.

forteh record, you don't need to index as a "StrField" to get this
benefit, you can still index using DateField you just need to round your
dates to some less graunlar level .. if you always want to round down, you
don't even need to do the rounding yourself, just add "/SECOND"
or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr.  
(SOLR-741 proposes adding a config option to DateField to let this be done
server side)

your example query seems to be happy with hour resolution, but in theory
if sometimes you needed hour resolution when doing "big ranges" but more
precise resolution when doing "small ranges" you could even in theory have
a "course" date field that you round to an hour, and redundently index
the same data in a "fine" date field with minute or second resolution.


Also: if you frequently reuse the same ranges over and over (ie: you have
a form widget people pick from so on any given day there is only N
discrete ranges being used) putting them in an "fq" param will let them be
cached uniquely from the main query string
(instance:client\-csm.symplicity.com) so differnet searches using the same
date ranges will be faster.  ditto for your label_facet:"Added to
Position" clause.

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: date range query performance

Alok Dhir
We have implemented the suggested reduction in granularity by dropping  
time altogether and simply disallowing time filtering.  This, in light  
of other search filters we have provided, should prove be sufficient  
for our user base.

We did keep the fine granularity field not for filtering, but for  
sorting.  We definitely need the log entries to be presented in  
chronological order, so the finer resolution date field is useful for  
that at least.

Thanks for the detailed response.

Alok

On Oct 31, 2008, at 2:16 PM, Chris Hostetter wrote:

> : Concrete example, this query just look 18s:
> :
> : instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO
> : 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>
> : I saw a thread from Apr 2008 which explains the problem being due  
> to too much
> : precision on the DateField type, and the range expansion leading  
> to far too
> : many elements being checked.  Proposed solution appears to be a  
> hack where you
> : index date fields as strings and hacking together date functions  
> to generate
> : proper queries/format results.
>
> forteh record, you don't need to index as a "StrField" to get this
> benefit, you can still index using DateField you just need to round  
> your
> dates to some less graunlar level .. if you always want to round  
> down, you
> don't even need to do the rounding yourself, just add "/SECOND"
> or "/MINUTE" or "/HOUR" to each of your dates before sending them to  
> solr.
> (SOLR-741 proposes adding a config option to DateField to let this  
> be done
> server side)
>
> your example query seems to be happy with hour resolution, but in  
> theory
> if sometimes you needed hour resolution when doing "big ranges" but  
> more
> precise resolution when doing "small ranges" you could even in  
> theory have
> a "course" date field that you round to an hour, and redundently index
> the same data in a "fine" date field with minute or second resolution.
>
>
> Also: if you frequently reuse the same ranges over and over (ie: you  
> have
> a form widget people pick from so on any given day there is only N
> discrete ranges being used) putting them in an "fq" param will let  
> them be
> cached uniquely from the main query string
> (instance:client\-csm.symplicity.com) so differnet searches using  
> the same
> date ranges will be faster.  ditto for your label_facet:"Added to
> Position" clause.
>
> -Hoss
>

Reply | Threaded
Open this post in threaded view
|

Re: date range query performance

Michael Lackhoff-2
In reply to this post by hossman
On 31.10.2008 19:16 Chris Hostetter wrote:

> forteh record, you don't need to index as a "StrField" to get this
> benefit, you can still index using DateField you just need to round your
> dates to some less graunlar level .. if you always want to round down, you
> don't even need to do the rounding yourself, just add "/SECOND"
> or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr.  
> (SOLR-741 proposes adding a config option to DateField to let this be done
> server side)

Is this also possible for the timestamp that is automatically added to
all new/updated docs? I would like to be able to search (quickly) for
everything that was added within the last week or month or whatever. And
because I update the index only once a day a granuality of /DAY (if that
exists) would be fine.

- Michael
Reply | Threaded
Open this post in threaded view
|

Re: date range query performance

Erik Hatcher

On Nov 1, 2008, at 1:07 AM, Michael Lackhoff wrote:

> On 31.10.2008 19:16 Chris Hostetter wrote:
>
>> forteh record, you don't need to index as a "StrField" to get this
>> benefit, you can still index using DateField you just need to round  
>> your
>> dates to some less graunlar level .. if you always want to round  
>> down, you
>> don't even need to do the rounding yourself, just add "/SECOND"
>> or "/MINUTE" or "/HOUR" to each of your dates before sending them  
>> to solr.
>> (SOLR-741 proposes adding a config option to DateField to let this  
>> be done
>> server side)
>
> Is this also possible for the timestamp that is automatically added to
> all new/updated docs? I would like to be able to search (quickly) for
> everything that was added within the last week or month or whatever.  
> And
> because I update the index only once a day a granuality of /DAY (if  
> that
> exists) would be fine.

Yeah, this should work fine:

    <field name="timestamp" type="date" indexed="true" stored="true"  
default="NOW/DAY" multiValued="false"/>

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: date range query performance

Michael Lackhoff-2
On 01.11.2008 06:10 Erik Hatcher wrote:

> Yeah, this should work fine:
>
>     <field name="timestamp" type="date" indexed="true" stored="true"  
> default="NOW/DAY" multiValued="false"/>

Wow, that was fast, thanks!

-Michael
Reply | Threaded
Open this post in threaded view
|

SOLR Performance

Alok Dhir
In reply to this post by Alok Dhir
We've moved past this issue by reducing date precision -- thanks to  
all for the help.  Now we're at another problem.

There is relatively constant updating of the index -- new log entries  
are pumped in from several applications continuously.  Obviously, new  
entries do not appear in searches until after a commit occurs.

The problem is, issuing a commit causes searches to come to a  
screeching halt for up to 2 minutes.  We're up to around 80M docs.  
Index size is 27G.  The number of docs will soon be 800M, which  
doesn't bode well for these "pauses" in search performance.

I'd appreciate any suggestions.

---
Alok K. Dhir
Symplicity Corporation
www.symplicity.com
(703) 351-0200 x 8080
[hidden email]

On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:

> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
>
> Fairly simple schema -- no large text fields, standard request  
> handler.  4 small facet fields.
>
> The index is an event log -- a primary search/retrieval requirement  
> is date range queries.
>
> A simple query without a date range subquery is ridiculously fast -  
> 2ms.  The same query with a date range takes up to 30s (30,000ms).
>
> Concrete example, this query just look 18s:
>
> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO  
> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>
> The exact same query without the date range took 2ms.
>
> I saw a thread from Apr 2008 which explains the problem being due to  
> too much precision on the DateField type, and the range expansion  
> leading to far too many elements being checked.  Proposed solution  
> appears to be a hack where you index date fields as strings and  
> hacking together date functions to generate proper queries/format  
> results.
>
> Does this remain the recommended solution to this issue?
>
> Thanks
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

RE: SOLR Performance

Feak, Todd
I believe this is one of the reasons that a master/slave configuration
comes in handy. Commits to the Master don't slow down queries on the
Slave.

-Todd

-----Original Message-----
From: Alok Dhir [mailto:[hidden email]]
Sent: Monday, November 03, 2008 1:47 PM
To: [hidden email]
Subject: SOLR Performance

We've moved past this issue by reducing date precision -- thanks to  
all for the help.  Now we're at another problem.

There is relatively constant updating of the index -- new log entries  
are pumped in from several applications continuously.  Obviously, new  
entries do not appear in searches until after a commit occurs.

The problem is, issuing a commit causes searches to come to a  
screeching halt for up to 2 minutes.  We're up to around 80M docs.  
Index size is 27G.  The number of docs will soon be 800M, which  
doesn't bode well for these "pauses" in search performance.

I'd appreciate any suggestions.

---
Alok K. Dhir
Symplicity Corporation
www.symplicity.com
(703) 351-0200 x 8080
[hidden email]

On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:

> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
>
> Fairly simple schema -- no large text fields, standard request  
> handler.  4 small facet fields.
>
> The index is an event log -- a primary search/retrieval requirement  
> is date range queries.
>
> A simple query without a date range subquery is ridiculously fast -  
> 2ms.  The same query with a date range takes up to 30s (30,000ms).
>
> Concrete example, this query just look 18s:
>
> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
TO  

> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>
> The exact same query without the date range took 2ms.
>
> I saw a thread from Apr 2008 which explains the problem being due to  
> too much precision on the DateField type, and the range expansion  
> leading to far too many elements being checked.  Proposed solution  
> appears to be a hack where you index date fields as strings and  
> hacking together date functions to generate proper queries/format  
> results.
>
> Does this remain the recommended solution to this issue?
>
> Thanks
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>


Reply | Threaded
Open this post in threaded view
|

Re: SOLR Performance

Alok Dhir
I was afraid of that.  Was hoping not to need another big fat box like  
this one...

---
Alok K. Dhir
Symplicity Corporation
www.symplicity.com
(703) 351-0200 x 8080
[hidden email]

On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:

> I believe this is one of the reasons that a master/slave configuration
> comes in handy. Commits to the Master don't slow down queries on the
> Slave.
>
> -Todd
>
> -----Original Message-----
> From: Alok Dhir [mailto:[hidden email]]
> Sent: Monday, November 03, 2008 1:47 PM
> To: [hidden email]
> Subject: SOLR Performance
>
> We've moved past this issue by reducing date precision -- thanks to
> all for the help.  Now we're at another problem.
>
> There is relatively constant updating of the index -- new log entries
> are pumped in from several applications continuously.  Obviously, new
> entries do not appear in searches until after a commit occurs.
>
> The problem is, issuing a commit causes searches to come to a
> screeching halt for up to 2 minutes.  We're up to around 80M docs.
> Index size is 27G.  The number of docs will soon be 800M, which
> doesn't bode well for these "pauses" in search performance.
>
> I'd appreciate any suggestions.
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>
> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
>
>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
>>
>> Fairly simple schema -- no large text fields, standard request
>> handler.  4 small facet fields.
>>
>> The index is an event log -- a primary search/retrieval requirement
>> is date range queries.
>>
>> A simple query without a date range subquery is ridiculously fast -
>> 2ms.  The same query with a date range takes up to 30s (30,000ms).
>>
>> Concrete example, this query just look 18s:
>>
>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
> TO
>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>
>> The exact same query without the date range took 2ms.
>>
>> I saw a thread from Apr 2008 which explains the problem being due to
>> too much precision on the DateField type, and the range expansion
>> leading to far too many elements being checked.  Proposed solution
>> appears to be a hack where you index date fields as strings and
>> hacking together date functions to generate proper queries/format
>> results.
>>
>> Does this remain the recommended solution to this issue?
>>
>> Thanks
>>
>> ---
>> Alok K. Dhir
>> Symplicity Corporation
>> www.symplicity.com
>> (703) 351-0200 x 8080
>> [hidden email]
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: SOLR Performance

Walter Underwood, Netflix
The indexing box can be much smaller, especially in terms of CPU.
It just needs one fast thread and enough disk.

wunder

On 11/3/08 2:58 PM, "Alok Dhir" <[hidden email]> wrote:

> I was afraid of that.  Was hoping not to need another big fat box like
> this one...
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>
> On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:
>
>> I believe this is one of the reasons that a master/slave configuration
>> comes in handy. Commits to the Master don't slow down queries on the
>> Slave.
>>
>> -Todd
>>
>> -----Original Message-----
>> From: Alok Dhir [mailto:[hidden email]]
>> Sent: Monday, November 03, 2008 1:47 PM
>> To: [hidden email]
>> Subject: SOLR Performance
>>
>> We've moved past this issue by reducing date precision -- thanks to
>> all for the help.  Now we're at another problem.
>>
>> There is relatively constant updating of the index -- new log entries
>> are pumped in from several applications continuously.  Obviously, new
>> entries do not appear in searches until after a commit occurs.
>>
>> The problem is, issuing a commit causes searches to come to a
>> screeching halt for up to 2 minutes.  We're up to around 80M docs.
>> Index size is 27G.  The number of docs will soon be 800M, which
>> doesn't bode well for these "pauses" in search performance.
>>
>> I'd appreciate any suggestions.
>>
>> ---
>> Alok K. Dhir
>> Symplicity Corporation
>> www.symplicity.com
>> (703) 351-0200 x 8080
>> [hidden email]
>>
>> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
>>
>>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
>>>
>>> Fairly simple schema -- no large text fields, standard request
>>> handler.  4 small facet fields.
>>>
>>> The index is an event log -- a primary search/retrieval requirement
>>> is date range queries.
>>>
>>> A simple query without a date range subquery is ridiculously fast -
>>> 2ms.  The same query with a date range takes up to 30s (30,000ms).
>>>
>>> Concrete example, this query just look 18s:
>>>
>>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
>> TO
>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>>
>>> The exact same query without the date range took 2ms.
>>>
>>> I saw a thread from Apr 2008 which explains the problem being due to
>>> too much precision on the DateField type, and the range expansion
>>> leading to far too many elements being checked.  Proposed solution
>>> appears to be a hack where you index date fields as strings and
>>> hacking together date functions to generate proper queries/format
>>> results.
>>>
>>> Does this remain the recommended solution to this issue?
>>>
>>> Thanks
>>>
>>> ---
>>> Alok K. Dhir
>>> Symplicity Corporation
>>> www.symplicity.com
>>> (703) 351-0200 x 8080
>>> [hidden email]
>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: SOLR Performance

Alok Dhir
in terms of RAM -- how to size that on the indexer?

---
Alok K. Dhir
Symplicity Corporation
www.symplicity.com
(703) 351-0200 x 8080
[hidden email]

On Nov 3, 2008, at 4:07 PM, Walter Underwood wrote:

> The indexing box can be much smaller, especially in terms of CPU.
> It just needs one fast thread and enough disk.
>
> wunder
>
> On 11/3/08 2:58 PM, "Alok Dhir" <[hidden email]> wrote:
>
>> I was afraid of that.  Was hoping not to need another big fat box  
>> like
>> this one...
>>
>> ---
>> Alok K. Dhir
>> Symplicity Corporation
>> www.symplicity.com
>> (703) 351-0200 x 8080
>> [hidden email]
>>
>> On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:
>>
>>> I believe this is one of the reasons that a master/slave  
>>> configuration
>>> comes in handy. Commits to the Master don't slow down queries on the
>>> Slave.
>>>
>>> -Todd
>>>
>>> -----Original Message-----
>>> From: Alok Dhir [mailto:[hidden email]]
>>> Sent: Monday, November 03, 2008 1:47 PM
>>> To: [hidden email]
>>> Subject: SOLR Performance
>>>
>>> We've moved past this issue by reducing date precision -- thanks to
>>> all for the help.  Now we're at another problem.
>>>
>>> There is relatively constant updating of the index -- new log  
>>> entries
>>> are pumped in from several applications continuously.  Obviously,  
>>> new
>>> entries do not appear in searches until after a commit occurs.
>>>
>>> The problem is, issuing a commit causes searches to come to a
>>> screeching halt for up to 2 minutes.  We're up to around 80M docs.
>>> Index size is 27G.  The number of docs will soon be 800M, which
>>> doesn't bode well for these "pauses" in search performance.
>>>
>>> I'd appreciate any suggestions.
>>>
>>> ---
>>> Alok K. Dhir
>>> Symplicity Corporation
>>> www.symplicity.com
>>> (703) 351-0200 x 8080
>>> [hidden email]
>>>
>>> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
>>>
>>>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core  
>>>> machine.
>>>>
>>>> Fairly simple schema -- no large text fields, standard request
>>>> handler.  4 small facet fields.
>>>>
>>>> The index is an event log -- a primary search/retrieval requirement
>>>> is date range queries.
>>>>
>>>> A simple query without a date range subquery is ridiculously fast -
>>>> 2ms.  The same query with a date range takes up to 30s (30,000ms).
>>>>
>>>> Concrete example, this query just look 18s:
>>>>
>>>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
>>> TO
>>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>>>
>>>> The exact same query without the date range took 2ms.
>>>>
>>>> I saw a thread from Apr 2008 which explains the problem being due  
>>>> to
>>>> too much precision on the DateField type, and the range expansion
>>>> leading to far too many elements being checked.  Proposed solution
>>>> appears to be a hack where you index date fields as strings and
>>>> hacking together date functions to generate proper queries/format
>>>> results.
>>>>
>>>> Does this remain the recommended solution to this issue?
>>>>
>>>> Thanks
>>>>
>>>> ---
>>>> Alok K. Dhir
>>>> Symplicity Corporation
>>>> www.symplicity.com
>>>> (703) 351-0200 x 8080
>>>> [hidden email]
>>>>
>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: SOLR Performance

Otis Gospodnetic-2
That depends largely on your ramBufferSizeMB setting in solrconfig.xml and the memory you are willing to give to the JVM via -Xmx.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Alok Dhir <[hidden email]>
> To: [hidden email]
> Sent: Monday, November 3, 2008 5:16:27 PM
> Subject: Re: SOLR Performance
>
> in terms of RAM -- how to size that on the indexer?
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>
> On Nov 3, 2008, at 4:07 PM, Walter Underwood wrote:
>
> > The indexing box can be much smaller, especially in terms of CPU.
> > It just needs one fast thread and enough disk.
> >
> > wunder
> >
> > On 11/3/08 2:58 PM, "Alok Dhir" wrote:
> >
> >> I was afraid of that.  Was hoping not to need another big fat box like
> >> this one...
> >>
> >> ---
> >> Alok K. Dhir
> >> Symplicity Corporation
> >> www.symplicity.com
> >> (703) 351-0200 x 8080
> >> [hidden email]
> >>
> >> On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:
> >>
> >>> I believe this is one of the reasons that a master/slave configuration
> >>> comes in handy. Commits to the Master don't slow down queries on the
> >>> Slave.
> >>>
> >>> -Todd
> >>>
> >>> -----Original Message-----
> >>> From: Alok Dhir [mailto:[hidden email]]
> >>> Sent: Monday, November 03, 2008 1:47 PM
> >>> To: [hidden email]
> >>> Subject: SOLR Performance
> >>>
> >>> We've moved past this issue by reducing date precision -- thanks to
> >>> all for the help.  Now we're at another problem.
> >>>
> >>> There is relatively constant updating of the index -- new log entries
> >>> are pumped in from several applications continuously.  Obviously, new
> >>> entries do not appear in searches until after a commit occurs.
> >>>
> >>> The problem is, issuing a commit causes searches to come to a
> >>> screeching halt for up to 2 minutes.  We're up to around 80M docs.
> >>> Index size is 27G.  The number of docs will soon be 800M, which
> >>> doesn't bode well for these "pauses" in search performance.
> >>>
> >>> I'd appreciate any suggestions.
> >>>
> >>> ---
> >>> Alok K. Dhir
> >>> Symplicity Corporation
> >>> www.symplicity.com
> >>> (703) 351-0200 x 8080
> >>> [hidden email]
> >>>
> >>> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
> >>>
> >>>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core machine.
> >>>>
> >>>> Fairly simple schema -- no large text fields, standard request
> >>>> handler.  4 small facet fields.
> >>>>
> >>>> The index is an event log -- a primary search/retrieval requirement
> >>>> is date range queries.
> >>>>
> >>>> A simple query without a date range subquery is ridiculously fast -
> >>>> 2ms.  The same query with a date range takes up to 30s (30,000ms).
> >>>>
> >>>> Concrete example, this query just look 18s:
> >>>>
> >>>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
> >>> TO
> >>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
> >>>>
> >>>> The exact same query without the date range took 2ms.
> >>>>
> >>>> I saw a thread from Apr 2008 which explains the problem being due to
> >>>> too much precision on the DateField type, and the range expansion
> >>>> leading to far too many elements being checked.  Proposed solution
> >>>> appears to be a hack where you index date fields as strings and
> >>>> hacking together date functions to generate proper queries/format
> >>>> results.
> >>>>
> >>>> Does this remain the recommended solution to this issue?
> >>>>
> >>>> Thanks
> >>>>
> >>>> ---
> >>>> Alok K. Dhir
> >>>> Symplicity Corporation
> >>>> www.symplicity.com
> >>>> (703) 351-0200 x 8080
> >>>> [hidden email]
> >>>>
> >>>
> >>>
> >>
> >

Reply | Threaded
Open this post in threaded view
|

Re: SOLR Performance

Mike Klaas
In reply to this post by Alok Dhir
If you never execute any queries, a gig should be more than enough.

Of course, I've never played around with a .8 billion doc corpus on  
one machine.

-Mike

On 3-Nov-08, at 2:16 PM, Alok Dhir wrote:

> in terms of RAM -- how to size that on the indexer?
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>
> On Nov 3, 2008, at 4:07 PM, Walter Underwood wrote:
>
>> The indexing box can be much smaller, especially in terms of CPU.
>> It just needs one fast thread and enough disk.
>>
>> wunder
>>
>> On 11/3/08 2:58 PM, "Alok Dhir" <[hidden email]> wrote:
>>
>>> I was afraid of that.  Was hoping not to need another big fat box  
>>> like
>>> this one...
>>>
>>> ---
>>> Alok K. Dhir
>>> Symplicity Corporation
>>> www.symplicity.com
>>> (703) 351-0200 x 8080
>>> [hidden email]
>>>
>>> On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:
>>>
>>>> I believe this is one of the reasons that a master/slave  
>>>> configuration
>>>> comes in handy. Commits to the Master don't slow down queries on  
>>>> the
>>>> Slave.
>>>>
>>>> -Todd
>>>>
>>>> -----Original Message-----
>>>> From: Alok Dhir [mailto:[hidden email]]
>>>> Sent: Monday, November 03, 2008 1:47 PM
>>>> To: [hidden email]
>>>> Subject: SOLR Performance
>>>>
>>>> We've moved past this issue by reducing date precision -- thanks to
>>>> all for the help.  Now we're at another problem.
>>>>
>>>> There is relatively constant updating of the index -- new log  
>>>> entries
>>>> are pumped in from several applications continuously.  Obviously,  
>>>> new
>>>> entries do not appear in searches until after a commit occurs.
>>>>
>>>> The problem is, issuing a commit causes searches to come to a
>>>> screeching halt for up to 2 minutes.  We're up to around 80M docs.
>>>> Index size is 27G.  The number of docs will soon be 800M, which
>>>> doesn't bode well for these "pauses" in search performance.
>>>>
>>>> I'd appreciate any suggestions.
>>>>
>>>> ---
>>>> Alok K. Dhir
>>>> Symplicity Corporation
>>>> www.symplicity.com
>>>> (703) 351-0200 x 8080
>>>> [hidden email]
>>>>
>>>> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
>>>>
>>>>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core  
>>>>> machine.
>>>>>
>>>>> Fairly simple schema -- no large text fields, standard request
>>>>> handler.  4 small facet fields.
>>>>>
>>>>> The index is an event log -- a primary search/retrieval  
>>>>> requirement
>>>>> is date range queries.
>>>>>
>>>>> A simple query without a date range subquery is ridiculously  
>>>>> fast -
>>>>> 2ms.  The same query with a date range takes up to 30s (30,000ms).
>>>>>
>>>>> Concrete example, this query just look 18s:
>>>>>
>>>>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
>>>> TO
>>>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>>>>
>>>>> The exact same query without the date range took 2ms.
>>>>>
>>>>> I saw a thread from Apr 2008 which explains the problem being  
>>>>> due to
>>>>> too much precision on the DateField type, and the range expansion
>>>>> leading to far too many elements being checked.  Proposed solution
>>>>> appears to be a hack where you index date fields as strings and
>>>>> hacking together date functions to generate proper queries/format
>>>>> results.
>>>>>
>>>>> Does this remain the recommended solution to this issue?
>>>>>
>>>>> Thanks
>>>>>
>>>>> ---
>>>>> Alok K. Dhir
>>>>> Symplicity Corporation
>>>>> www.symplicity.com
>>>>> (703) 351-0200 x 8080
>>>>> [hidden email]
>>>>>
>>>>
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

RE: SOLR Performance

Lance Norskog-2
The logistics of handling giant index files hit us before search
performance. We switched to a set of indexes running inside one server
(tomcat) instance with the Multicore+Distributed Search tools, with a frozen
old index and a new index actively taking updates. The smaller new index
takes much less time to recover after a commit.

The DS code does not handle cases where the new and old index have different
versions of the same document. We wrote a custom distributed search that
favored the "new" index over the "old".

Lance

-----Original Message-----
From: Mike Klaas [mailto:[hidden email]]
Sent: Monday, November 03, 2008 4:25 PM
To: [hidden email]
Subject: Re: SOLR Performance

If you never execute any queries, a gig should be more than enough.

Of course, I've never played around with a .8 billion doc corpus on one
machine.

-Mike

On 3-Nov-08, at 2:16 PM, Alok Dhir wrote:

> in terms of RAM -- how to size that on the indexer?
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>
> On Nov 3, 2008, at 4:07 PM, Walter Underwood wrote:
>
>> The indexing box can be much smaller, especially in terms of CPU.
>> It just needs one fast thread and enough disk.
>>
>> wunder
>>
>> On 11/3/08 2:58 PM, "Alok Dhir" <[hidden email]> wrote:
>>
>>> I was afraid of that.  Was hoping not to need another big fat box
>>> like this one...
>>>
>>> ---
>>> Alok K. Dhir
>>> Symplicity Corporation
>>> www.symplicity.com
>>> (703) 351-0200 x 8080
>>> [hidden email]
>>>
>>> On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:
>>>
>>>> I believe this is one of the reasons that a master/slave
>>>> configuration comes in handy. Commits to the Master don't slow down
>>>> queries on the Slave.
>>>>
>>>> -Todd
>>>>
>>>> -----Original Message-----
>>>> From: Alok Dhir [mailto:[hidden email]]
>>>> Sent: Monday, November 03, 2008 1:47 PM
>>>> To: [hidden email]
>>>> Subject: SOLR Performance
>>>>
>>>> We've moved past this issue by reducing date precision -- thanks to
>>>> all for the help.  Now we're at another problem.
>>>>
>>>> There is relatively constant updating of the index -- new log
>>>> entries are pumped in from several applications continuously.  
>>>> Obviously, new entries do not appear in searches until after a
>>>> commit occurs.
>>>>
>>>> The problem is, issuing a commit causes searches to come to a
>>>> screeching halt for up to 2 minutes.  We're up to around 80M docs.
>>>> Index size is 27G.  The number of docs will soon be 800M, which
>>>> doesn't bode well for these "pauses" in search performance.
>>>>
>>>> I'd appreciate any suggestions.
>>>>
>>>> ---
>>>> Alok K. Dhir
>>>> Symplicity Corporation
>>>> www.symplicity.com
>>>> (703) 351-0200 x 8080
>>>> [hidden email]
>>>>
>>>> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
>>>>
>>>>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core
>>>>> machine.
>>>>>
>>>>> Fairly simple schema -- no large text fields, standard request
>>>>> handler.  4 small facet fields.
>>>>>
>>>>> The index is an event log -- a primary search/retrieval
>>>>> requirement is date range queries.
>>>>>
>>>>> A simple query without a date range subquery is ridiculously fast
>>>>> - 2ms.  The same query with a date range takes up to 30s
>>>>> (30,000ms).
>>>>>
>>>>> Concrete example, this query just look 18s:
>>>>>
>>>>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
>>>> TO
>>>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>>>>
>>>>> The exact same query without the date range took 2ms.
>>>>>
>>>>> I saw a thread from Apr 2008 which explains the problem being due
>>>>> to too much precision on the DateField type, and the range
>>>>> expansion leading to far too many elements being checked.  
>>>>> Proposed solution appears to be a hack where you index date fields
>>>>> as strings and hacking together date functions to generate proper
>>>>> queries/format results.
>>>>>
>>>>> Does this remain the recommended solution to this issue?
>>>>>
>>>>> Thanks
>>>>>
>>>>> ---
>>>>> Alok K. Dhir
>>>>> Symplicity Corporation
>>>>> www.symplicity.com
>>>>> (703) 351-0200 x 8080
>>>>> [hidden email]
>>>>>
>>>>
>>>>
>>>
>>
>


Reply | Threaded
Open this post in threaded view
|

RE: SOLR Performance

Feak, Todd
In reply to this post by Mike Klaas
Most desktops nowadays have at least a dual-core and 1GB, you may be
able to get a semi-realistic feel for performance on a local desktop. If
you have access to something meaty in a desktop, you may not have to
spend a dime to find out what it's going to take in a server.

-T

-----Original Message-----
From: Mike Klaas [mailto:[hidden email]]
Sent: Monday, November 03, 2008 4:25 PM
To: [hidden email]
Subject: Re: SOLR Performance

If you never execute any queries, a gig should be more than enough.

Of course, I've never played around with a .8 billion doc corpus on  
one machine.

-Mike

On 3-Nov-08, at 2:16 PM, Alok Dhir wrote:

> in terms of RAM -- how to size that on the indexer?
>
> ---
> Alok K. Dhir
> Symplicity Corporation
> www.symplicity.com
> (703) 351-0200 x 8080
> [hidden email]
>
> On Nov 3, 2008, at 4:07 PM, Walter Underwood wrote:
>
>> The indexing box can be much smaller, especially in terms of CPU.
>> It just needs one fast thread and enough disk.
>>
>> wunder
>>
>> On 11/3/08 2:58 PM, "Alok Dhir" <[hidden email]> wrote:
>>
>>> I was afraid of that.  Was hoping not to need another big fat box  
>>> like
>>> this one...
>>>
>>> ---
>>> Alok K. Dhir
>>> Symplicity Corporation
>>> www.symplicity.com
>>> (703) 351-0200 x 8080
>>> [hidden email]
>>>
>>> On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:
>>>
>>>> I believe this is one of the reasons that a master/slave  
>>>> configuration
>>>> comes in handy. Commits to the Master don't slow down queries on  
>>>> the
>>>> Slave.
>>>>
>>>> -Todd
>>>>
>>>> -----Original Message-----
>>>> From: Alok Dhir [mailto:[hidden email]]
>>>> Sent: Monday, November 03, 2008 1:47 PM
>>>> To: [hidden email]
>>>> Subject: SOLR Performance
>>>>
>>>> We've moved past this issue by reducing date precision -- thanks to
>>>> all for the help.  Now we're at another problem.
>>>>
>>>> There is relatively constant updating of the index -- new log  
>>>> entries
>>>> are pumped in from several applications continuously.  Obviously,  
>>>> new
>>>> entries do not appear in searches until after a commit occurs.
>>>>
>>>> The problem is, issuing a commit causes searches to come to a
>>>> screeching halt for up to 2 minutes.  We're up to around 80M docs.
>>>> Index size is 27G.  The number of docs will soon be 800M, which
>>>> doesn't bode well for these "pauses" in search performance.
>>>>
>>>> I'd appreciate any suggestions.
>>>>
>>>> ---
>>>> Alok K. Dhir
>>>> Symplicity Corporation
>>>> www.symplicity.com
>>>> (703) 351-0200 x 8080
>>>> [hidden email]
>>>>
>>>> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
>>>>
>>>>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core  
>>>>> machine.
>>>>>
>>>>> Fairly simple schema -- no large text fields, standard request
>>>>> handler.  4 small facet fields.
>>>>>
>>>>> The index is an event log -- a primary search/retrieval  
>>>>> requirement
>>>>> is date range queries.
>>>>>
>>>>> A simple query without a date range subquery is ridiculously  
>>>>> fast -
>>>>> 2ms.  The same query with a date range takes up to 30s (30,000ms).
>>>>>
>>>>> Concrete example, this query just look 18s:
>>>>>
>>>>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
>>>> TO
>>>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>>>>
>>>>> The exact same query without the date range took 2ms.
>>>>>
>>>>> I saw a thread from Apr 2008 which explains the problem being  
>>>>> due to
>>>>> too much precision on the DateField type, and the range expansion
>>>>> leading to far too many elements being checked.  Proposed solution
>>>>> appears to be a hack where you index date fields as strings and
>>>>> hacking together date functions to generate proper queries/format
>>>>> results.
>>>>>
>>>>> Does this remain the recommended solution to this issue?
>>>>>
>>>>> Thanks
>>>>>
>>>>> ---
>>>>> Alok K. Dhir
>>>>> Symplicity Corporation
>>>>> www.symplicity.com
>>>>> (703) 351-0200 x 8080
>>>>> [hidden email]
>>>>>
>>>>
>>>>
>>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: SOLR Performance

Walter Underwood, Netflix
In reply to this post by Lance Norskog-2
Funny, that is exactly what Infoseek did back in 1996. A big index that
changed rarely and a small index with real-time changes. Once each week,
merge to make a new big index and start over with the small one.

You also need to handle deletes specially.

wunder

On 11/3/08 6:44 PM, "Lance Norskog" <[hidden email]> wrote:

> The logistics of handling giant index files hit us before search
> performance. We switched to a set of indexes running inside one server
> (tomcat) instance with the Multicore+Distributed Search tools, with a frozen
> old index and a new index actively taking updates. The smaller new index
> takes much less time to recover after a commit.
>
> The DS code does not handle cases where the new and old index have different
> versions of the same document. We wrote a custom distributed search that
> favored the "new" index over the "old".
>
> Lance
>
> -----Original Message-----
> From: Mike Klaas [mailto:[hidden email]]
> Sent: Monday, November 03, 2008 4:25 PM
> To: [hidden email]
> Subject: Re: SOLR Performance
>
> If you never execute any queries, a gig should be more than enough.
>
> Of course, I've never played around with a .8 billion doc corpus on one
> machine.
>
> -Mike
>
> On 3-Nov-08, at 2:16 PM, Alok Dhir wrote:
>
>> in terms of RAM -- how to size that on the indexer?
>>
>> ---
>> Alok K. Dhir
>> Symplicity Corporation
>> www.symplicity.com
>> (703) 351-0200 x 8080
>> [hidden email]
>>
>> On Nov 3, 2008, at 4:07 PM, Walter Underwood wrote:
>>
>>> The indexing box can be much smaller, especially in terms of CPU.
>>> It just needs one fast thread and enough disk.
>>>
>>> wunder
>>>
>>> On 11/3/08 2:58 PM, "Alok Dhir" <[hidden email]> wrote:
>>>
>>>> I was afraid of that.  Was hoping not to need another big fat box
>>>> like this one...
>>>>
>>>> ---
>>>> Alok K. Dhir
>>>> Symplicity Corporation
>>>> www.symplicity.com
>>>> (703) 351-0200 x 8080
>>>> [hidden email]
>>>>
>>>> On Nov 3, 2008, at 4:53 PM, Feak, Todd wrote:
>>>>
>>>>> I believe this is one of the reasons that a master/slave
>>>>> configuration comes in handy. Commits to the Master don't slow down
>>>>> queries on the Slave.
>>>>>
>>>>> -Todd
>>>>>
>>>>> -----Original Message-----
>>>>> From: Alok Dhir [mailto:[hidden email]]
>>>>> Sent: Monday, November 03, 2008 1:47 PM
>>>>> To: [hidden email]
>>>>> Subject: SOLR Performance
>>>>>
>>>>> We've moved past this issue by reducing date precision -- thanks to
>>>>> all for the help.  Now we're at another problem.
>>>>>
>>>>> There is relatively constant updating of the index -- new log
>>>>> entries are pumped in from several applications continuously.
>>>>> Obviously, new entries do not appear in searches until after a
>>>>> commit occurs.
>>>>>
>>>>> The problem is, issuing a commit causes searches to come to a
>>>>> screeching halt for up to 2 minutes.  We're up to around 80M docs.
>>>>> Index size is 27G.  The number of docs will soon be 800M, which
>>>>> doesn't bode well for these "pauses" in search performance.
>>>>>
>>>>> I'd appreciate any suggestions.
>>>>>
>>>>> ---
>>>>> Alok K. Dhir
>>>>> Symplicity Corporation
>>>>> www.symplicity.com
>>>>> (703) 351-0200 x 8080
>>>>> [hidden email]
>>>>>
>>>>> On Oct 29, 2008, at 4:30 PM, Alok Dhir wrote:
>>>>>
>>>>>> Hi -- using solr 1.3 -- roughly 11M docs on a 64 gig 8 core
>>>>>> machine.
>>>>>>
>>>>>> Fairly simple schema -- no large text fields, standard request
>>>>>> handler.  4 small facet fields.
>>>>>>
>>>>>> The index is an event log -- a primary search/retrieval
>>>>>> requirement is date range queries.
>>>>>>
>>>>>> A simple query without a date range subquery is ridiculously fast
>>>>>> - 2ms.  The same query with a date range takes up to 30s
>>>>>> (30,000ms).
>>>>>>
>>>>>> Concrete example, this query just look 18s:
>>>>>>
>>>>>> instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z
>>>>> TO
>>>>>> 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"
>>>>>>
>>>>>> The exact same query without the date range took 2ms.
>>>>>>
>>>>>> I saw a thread from Apr 2008 which explains the problem being due
>>>>>> to too much precision on the DateField type, and the range
>>>>>> expansion leading to far too many elements being checked.
>>>>>> Proposed solution appears to be a hack where you index date fields
>>>>>> as strings and hacking together date functions to generate proper
>>>>>> queries/format results.
>>>>>>
>>>>>> Does this remain the recommended solution to this issue?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> ---
>>>>>> Alok K. Dhir
>>>>>> Symplicity Corporation
>>>>>> www.symplicity.com
>>>>>> (703) 351-0200 x 8080
>>>>>> [hidden email]
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

12