Realtime search and facets with very frequent commits

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Realtime search and facets with very frequent commits

Janne Majaranta
Hello,

I have a log search like application which requires indexed log events to be
searchable within a minute
and uses facets and the statscomponent.

Some stats:
- The log events are indexed every 10 seconds with a "commitWithin" of 60
seconds.
- 1M events / day (~75% are updates to previous events).
- Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
for all 14 fields at the same time.
- Heavy use of StatsComponent ( stats over facets of ~36M documents ).


The application is running a single Solr instance. All updates and queries
are sent to the same instance.
Faceting and the StatsComponent are both amazingly fast with that amount of
documents *when* the caches are warm.

The problem I'm now facing is that keeping the caches warm is too heavy
compared to the frequency of updates.
It takes over 60 seconds to warmup the caches to the level where facets and
stats are returned in milliseconds.

I have tested putting a second solr instance on the same server and sending
the updates to that new instance.
Warming up the new small instance is very fast while the large instance has
very hot caches.

I also put a third (empty) solr instance on the same server which passes the
queries to the two instances with the
"shards" parameters. This is mainly because the client app really doesn't
have to know anything about the shards.

The setup was easy to configure and responses are back in milliseconds and
the updates are visible in seconds.
That is, responses in milliseconds over 40M documents and a update frequency
of 15 seconds on a single physical server.
The (lab) server has 16g RAM and it is running win23k.

Also, what I found out is that using the sharded setup I only need half the
memory for the large instance.
When indexing to the large instance the memory usage goes very fast up to
the maximum allocated heap size and never goes down.

My question is, is there a magic switch in SOLR to have that kind of update
frequency while having the caches on fire ?
Or is it just impossible to achieve facet counts and queries in milliseconds
while updating the index every minute ?

The second question is, the setup with a empty SOLR as a "coordinating"
instance, a large SOLR instance with hot caches and a small SOLR instance
with immediate updates,
all on the same physical server, does it sound like a durable solution
(until the small instance gets big) or is it something is braindead ?

And the third question is, would it be a good idea to merge the small and
the large index periodically so that a fresh and empty small instance would
be available
after the merge ?

Any ideas ?

Best Regards,

Janne Majaranta
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Jason Rutherglen
Janne,

I usually just turn the caches to next to nearly off for frequent commits.

Jason

On Thu, Feb 11, 2010 at 9:35 AM, Janne Majaranta
<[hidden email]> wrote:

> Hello,
>
> I have a log search like application which requires indexed log events to be
> searchable within a minute
> and uses facets and the statscomponent.
>
> Some stats:
> - The log events are indexed every 10 seconds with a "commitWithin" of 60
> seconds.
> - 1M events / day (~75% are updates to previous events).
> - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
> for all 14 fields at the same time.
> - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
>
>
> The application is running a single Solr instance. All updates and queries
> are sent to the same instance.
> Faceting and the StatsComponent are both amazingly fast with that amount of
> documents *when* the caches are warm.
>
> The problem I'm now facing is that keeping the caches warm is too heavy
> compared to the frequency of updates.
> It takes over 60 seconds to warmup the caches to the level where facets and
> stats are returned in milliseconds.
>
> I have tested putting a second solr instance on the same server and sending
> the updates to that new instance.
> Warming up the new small instance is very fast while the large instance has
> very hot caches.
>
> I also put a third (empty) solr instance on the same server which passes the
> queries to the two instances with the
> "shards" parameters. This is mainly because the client app really doesn't
> have to know anything about the shards.
>
> The setup was easy to configure and responses are back in milliseconds and
> the updates are visible in seconds.
> That is, responses in milliseconds over 40M documents and a update frequency
> of 15 seconds on a single physical server.
> The (lab) server has 16g RAM and it is running win23k.
>
> Also, what I found out is that using the sharded setup I only need half the
> memory for the large instance.
> When indexing to the large instance the memory usage goes very fast up to
> the maximum allocated heap size and never goes down.
>
> My question is, is there a magic switch in SOLR to have that kind of update
> frequency while having the caches on fire ?
> Or is it just impossible to achieve facet counts and queries in milliseconds
> while updating the index every minute ?
>
> The second question is, the setup with a empty SOLR as a "coordinating"
> instance, a large SOLR instance with hot caches and a small SOLR instance
> with immediate updates,
> all on the same physical server, does it sound like a durable solution
> (until the small instance gets big) or is it something is braindead ?
>
> And the third question is, would it be a good idea to merge the small and
> the large index periodically so that a fresh and empty small instance would
> be available
> after the merge ?
>
> Any ideas ?
>
> Best Regards,
>
> Janne Majaranta
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Janne Majaranta
Hey Jason,

Do you use faceting with frequent commits ?
And by turning off the caches you mean setting autowarmcount to zero ?

I did try to turn off autowarming with a 36M documents instance but getting
facets over those documents takes over 10 seconds.
With a warm cache it takes 200ms ...

-Janne


2010/2/11 Jason Rutherglen <[hidden email]>

> Janne,
>
> I usually just turn the caches to next to nearly off for frequent commits.
>
> Jason
>
> On Thu, Feb 11, 2010 at 9:35 AM, Janne Majaranta
> <[hidden email]> wrote:
> > Hello,
> >
> > I have a log search like application which requires indexed log events to
> be
> > searchable within a minute
> > and uses facets and the statscomponent.
> >
> > Some stats:
> > - The log events are indexed every 10 seconds with a "commitWithin" of 60
> > seconds.
> > - 1M events / day (~75% are updates to previous events).
> > - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
> > for all 14 fields at the same time.
> > - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
> >
> >
> > The application is running a single Solr instance. All updates and
> queries
> > are sent to the same instance.
> > Faceting and the StatsComponent are both amazingly fast with that amount
> of
> > documents *when* the caches are warm.
> >
> > The problem I'm now facing is that keeping the caches warm is too heavy
> > compared to the frequency of updates.
> > It takes over 60 seconds to warmup the caches to the level where facets
> and
> > stats are returned in milliseconds.
> >
> > I have tested putting a second solr instance on the same server and
> sending
> > the updates to that new instance.
> > Warming up the new small instance is very fast while the large instance
> has
> > very hot caches.
> >
> > I also put a third (empty) solr instance on the same server which passes
> the
> > queries to the two instances with the
> > "shards" parameters. This is mainly because the client app really doesn't
> > have to know anything about the shards.
> >
> > The setup was easy to configure and responses are back in milliseconds
> and
> > the updates are visible in seconds.
> > That is, responses in milliseconds over 40M documents and a update
> frequency
> > of 15 seconds on a single physical server.
> > The (lab) server has 16g RAM and it is running win23k.
> >
> > Also, what I found out is that using the sharded setup I only need half
> the
> > memory for the large instance.
> > When indexing to the large instance the memory usage goes very fast up to
> > the maximum allocated heap size and never goes down.
> >
> > My question is, is there a magic switch in SOLR to have that kind of
> update
> > frequency while having the caches on fire ?
> > Or is it just impossible to achieve facet counts and queries in
> milliseconds
> > while updating the index every minute ?
> >
> > The second question is, the setup with a empty SOLR as a "coordinating"
> > instance, a large SOLR instance with hot caches and a small SOLR instance
> > with immediate updates,
> > all on the same physical server, does it sound like a durable solution
> > (until the small instance gets big) or is it something is braindead ?
> >
> > And the third question is, would it be a good idea to merge the small and
> > the large index periodically so that a fresh and empty small instance
> would
> > be available
> > after the merge ?
> >
> > Any ideas ?
> >
> > Best Regards,
> >
> > Janne Majaranta
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Yonik Seeley-2-2
On Thu, Feb 11, 2010 at 3:21 PM, Janne Majaranta
<[hidden email]> wrote:
> Hey Jason,
>
> Do you use faceting with frequent commits ?
> And by turning off the caches you mean setting autowarmcount to zero ?
>
> I did try to turn off autowarming with a 36M documents instance but getting
> facets over those documents takes over 10 seconds.
> With a warm cache it takes 200ms ...

You can turn off autowarming and do a single static warming query that
does the typical facet request.
If that takes 10 seconds to execute (and populates the caches in the
meantime), you can still commit every minute (or better, use
commitWithin when updating to prevent unnecessary commits)

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Otis Gospodnetic-2
In reply to this post by Janne Majaranta
Janne,

The answers to your last 2 questions are both yes.  I've seen that done a few times and it works.  I don't have the answer to the always-hot cache question.


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----

> From: Janne Majaranta <[hidden email]>
> To: [hidden email]
> Sent: Thu, February 11, 2010 12:35:20 PM
> Subject: Realtime search and facets with very frequent commits
>
> Hello,
>
> I have a log search like application which requires indexed log events to be
> searchable within a minute
> and uses facets and the statscomponent.
>
> Some stats:
> - The log events are indexed every 10 seconds with a "commitWithin" of 60
> seconds.
> - 1M events / day (~75% are updates to previous events).
> - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
> for all 14 fields at the same time.
> - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
>
>
> The application is running a single Solr instance. All updates and queries
> are sent to the same instance.
> Faceting and the StatsComponent are both amazingly fast with that amount of
> documents *when* the caches are warm.
>
> The problem I'm now facing is that keeping the caches warm is too heavy
> compared to the frequency of updates.
> It takes over 60 seconds to warmup the caches to the level where facets and
> stats are returned in milliseconds.
>
> I have tested putting a second solr instance on the same server and sending
> the updates to that new instance.
> Warming up the new small instance is very fast while the large instance has
> very hot caches.
>
> I also put a third (empty) solr instance on the same server which passes the
> queries to the two instances with the
> "shards" parameters. This is mainly because the client app really doesn't
> have to know anything about the shards.
>
> The setup was easy to configure and responses are back in milliseconds and
> the updates are visible in seconds.
> That is, responses in milliseconds over 40M documents and a update frequency
> of 15 seconds on a single physical server.
> The (lab) server has 16g RAM and it is running win23k.
>
> Also, what I found out is that using the sharded setup I only need half the
> memory for the large instance.
> When indexing to the large instance the memory usage goes very fast up to
> the maximum allocated heap size and never goes down.
>
> My question is, is there a magic switch in SOLR to have that kind of update
> frequency while having the caches on fire ?
> Or is it just impossible to achieve facet counts and queries in milliseconds
> while updating the index every minute ?
>
> The second question is, the setup with a empty SOLR as a "coordinating"
> instance, a large SOLR instance with hot caches and a small SOLR instance
> with immediate updates,
> all on the same physical server, does it sound like a durable solution
> (until the small instance gets big) or is it something is braindead ?
>
> And the third question is, would it be a good idea to merge the small and
> the large index periodically so that a fresh and empty small instance would
> be available
> after the merge ?
>
> Any ideas ?
>
> Best Regards,
>
> Janne Majaranta

Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Janne Majaranta
Ok,

Thanks Yonik and Otis.
I already had static warming queries with facets turned on and autowarming
at zero.
There were a lot of other optimizations after that however, so I'll try with
zero autowarming and static warming queries again.

If that doesn't work, I'll go with 3 instances on the same server.

BTW, does it sound like normal that when running updates every minute to a
36M index it takes all the available heap size after about 5 commits
although there is not a single query executed to the index and autowarming
is set to zero ? Just curious.

-Janne


2010/2/11 Otis Gospodnetic <[hidden email]>

> Janne,
>
> The answers to your last 2 questions are both yes.  I've seen that done a
> few times and it works.  I don't have the answer to the always-hot cache
> question.
>
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> ----- Original Message ----
> > From: Janne Majaranta <[hidden email]>
> > To: [hidden email]
> > Sent: Thu, February 11, 2010 12:35:20 PM
> > Subject: Realtime search and facets with very frequent commits
> >
> > Hello,
> >
> > I have a log search like application which requires indexed log events to
> be
> > searchable within a minute
> > and uses facets and the statscomponent.
> >
> > Some stats:
> > - The log events are indexed every 10 seconds with a "commitWithin" of 60
> > seconds.
> > - 1M events / day (~75% are updates to previous events).
> > - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
> > for all 14 fields at the same time.
> > - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
> >
> >
> > The application is running a single Solr instance. All updates and
> queries
> > are sent to the same instance.
> > Faceting and the StatsComponent are both amazingly fast with that amount
> of
> > documents *when* the caches are warm.
> >
> > The problem I'm now facing is that keeping the caches warm is too heavy
> > compared to the frequency of updates.
> > It takes over 60 seconds to warmup the caches to the level where facets
> and
> > stats are returned in milliseconds.
> >
> > I have tested putting a second solr instance on the same server and
> sending
> > the updates to that new instance.
> > Warming up the new small instance is very fast while the large instance
> has
> > very hot caches.
> >
> > I also put a third (empty) solr instance on the same server which passes
> the
> > queries to the two instances with the
> > "shards" parameters. This is mainly because the client app really doesn't
> > have to know anything about the shards.
> >
> > The setup was easy to configure and responses are back in milliseconds
> and
> > the updates are visible in seconds.
> > That is, responses in milliseconds over 40M documents and a update
> frequency
> > of 15 seconds on a single physical server.
> > The (lab) server has 16g RAM and it is running win23k.
> >
> > Also, what I found out is that using the sharded setup I only need half
> the
> > memory for the large instance.
> > When indexing to the large instance the memory usage goes very fast up to
> > the maximum allocated heap size and never goes down.
> >
> > My question is, is there a magic switch in SOLR to have that kind of
> update
> > frequency while having the caches on fire ?
> > Or is it just impossible to achieve facet counts and queries in
> milliseconds
> > while updating the index every minute ?
> >
> > The second question is, the setup with a empty SOLR as a "coordinating"
> > instance, a large SOLR instance with hot caches and a small SOLR instance
> > with immediate updates,
> > all on the same physical server, does it sound like a durable solution
> > (until the small instance gets big) or is it something is braindead ?
> >
> > And the third question is, would it be a good idea to merge the small and
> > the large index periodically so that a fresh and empty small instance
> would
> > be available
> > after the merge ?
> >
> > Any ideas ?
> >
> > Best Regards,
> >
> > Janne Majaranta
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Dipti Khullar
Hey Janne

Can you please let me know what other optimizations are you talking about
here. Because in our application we are committing in about 5 mins but still
the response time is very low and at times there are some connection time
outs also.

Just wanted to confirm if you have done some major configuration changes
which have proved beneficial.

Thanks
Dipti

On Fri, Feb 12, 2010 at 3:03 AM, Janne Majaranta
<[hidden email]>wrote:

> Ok,
>
> Thanks Yonik and Otis.
> I already had static warming queries with facets turned on and autowarming
> at zero.
> There were a lot of other optimizations after that however, so I'll try
> with
> zero autowarming and static warming queries again.
>
> If that doesn't work, I'll go with 3 instances on the same server.
>
> BTW, does it sound like normal that when running updates every minute to a
> 36M index it takes all the available heap size after about 5 commits
> although there is not a single query executed to the index and autowarming
> is set to zero ? Just curious.
>
> -Janne
>
>
> 2010/2/11 Otis Gospodnetic <[hidden email]>
>
> > Janne,
> >
> > The answers to your last 2 questions are both yes.  I've seen that done a
> > few times and it works.  I don't have the answer to the always-hot cache
> > question.
> >
> >
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >
> >
> > ----- Original Message ----
> > > From: Janne Majaranta <[hidden email]>
> > > To: [hidden email]
> > > Sent: Thu, February 11, 2010 12:35:20 PM
> > > Subject: Realtime search and facets with very frequent commits
> > >
> > > Hello,
> > >
> > > I have a log search like application which requires indexed log events
> to
> > be
> > > searchable within a minute
> > > and uses facets and the statscomponent.
> > >
> > > Some stats:
> > > - The log events are indexed every 10 seconds with a "commitWithin" of
> 60
> > > seconds.
> > > - 1M events / day (~75% are updates to previous events).
> > > - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but
> facets
> > > for all 14 fields at the same time.
> > > - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
> > >
> > >
> > > The application is running a single Solr instance. All updates and
> > queries
> > > are sent to the same instance.
> > > Faceting and the StatsComponent are both amazingly fast with that
> amount
> > of
> > > documents *when* the caches are warm.
> > >
> > > The problem I'm now facing is that keeping the caches warm is too heavy
> > > compared to the frequency of updates.
> > > It takes over 60 seconds to warmup the caches to the level where facets
> > and
> > > stats are returned in milliseconds.
> > >
> > > I have tested putting a second solr instance on the same server and
> > sending
> > > the updates to that new instance.
> > > Warming up the new small instance is very fast while the large instance
> > has
> > > very hot caches.
> > >
> > > I also put a third (empty) solr instance on the same server which
> passes
> > the
> > > queries to the two instances with the
> > > "shards" parameters. This is mainly because the client app really
> doesn't
> > > have to know anything about the shards.
> > >
> > > The setup was easy to configure and responses are back in milliseconds
> > and
> > > the updates are visible in seconds.
> > > That is, responses in milliseconds over 40M documents and a update
> > frequency
> > > of 15 seconds on a single physical server.
> > > The (lab) server has 16g RAM and it is running win23k.
> > >
> > > Also, what I found out is that using the sharded setup I only need half
> > the
> > > memory for the large instance.
> > > When indexing to the large instance the memory usage goes very fast up
> to
> > > the maximum allocated heap size and never goes down.
> > >
> > > My question is, is there a magic switch in SOLR to have that kind of
> > update
> > > frequency while having the caches on fire ?
> > > Or is it just impossible to achieve facet counts and queries in
> > milliseconds
> > > while updating the index every minute ?
> > >
> > > The second question is, the setup with a empty SOLR as a "coordinating"
> > > instance, a large SOLR instance with hot caches and a small SOLR
> instance
> > > with immediate updates,
> > > all on the same physical server, does it sound like a durable solution
> > > (until the small instance gets big) or is it something is braindead ?
> > >
> > > And the third question is, would it be a good idea to merge the small
> and
> > > the large index periodically so that a fresh and empty small instance
> > would
> > > be available
> > > after the merge ?
> > >
> > > Any ideas ?
> > >
> > > Best Regards,
> > >
> > > Janne Majaranta
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Janne Majaranta
Hey Dipti,

Basically query optimizations + setting cache sizes to a very high level.
Other than that, the config is about the same as the out-of-the-box config
that comes with the Solr download.

I haven't found a magic switch to get very fast query responses + facet
counts with the frequency of commits I'm having using one single SOLR
instance.
Adding some TOP queries for a certain type of user to static warming queries
just moved the time of autowarming the caches to the time it took to warm
the caches with static queries.
I've been staging a setup where there's a small solr instance receiving all
the updates and a large instance which doesn't receive the live feed of
updates.
The small index will be merged with the large index periodically (once a
week or once a month).
The two instances are seen by the client app as one instance using the
sharding features of SOLR.
The instances are running on the same server inside their own JVM / jetty.

In this setup the caches are very HOT for the large index and queries are
extremely fast, and the small index is small enough to get extremely fast
queries without having to warm up the caches too much.

Basically I'm able to have a commit frequency of 10 seconds in a 40M docs
index while counting TOP5 facets over 14 fields in 200ms.
In reality the commit frequency of 10 seconds comes from the fact that the
updates are going into a 1M - 2M documents index, and the fast facet counts
from the fact that the 38M documents index has hot caches and doesn't
receive any updates.

Also, not running updates to the large index means that the SOLR instance
reading the large index uses about half the memory it used before when
running the updates to the large index. At least it does so on Win2k3.

-Janne


2010/2/15 dipti khullar <[hidden email]>

> Hey Janne
>
> Can you please let me know what other optimizations are you talking about
> here. Because in our application we are committing in about 5 mins but
> still
> the response time is very low and at times there are some connection time
> outs also.
>
> Just wanted to confirm if you have done some major configuration changes
> which have proved beneficial.
>
> Thanks
> Dipti
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Jan Høydahl / Cominvent
Hi,

Have you tried playing with mergeFactor or even mergePolicy?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 16. feb. 2010, at 08.26, Janne Majaranta wrote:

> Hey Dipti,
>
> Basically query optimizations + setting cache sizes to a very high level.
> Other than that, the config is about the same as the out-of-the-box config
> that comes with the Solr download.
>
> I haven't found a magic switch to get very fast query responses + facet
> counts with the frequency of commits I'm having using one single SOLR
> instance.
> Adding some TOP queries for a certain type of user to static warming queries
> just moved the time of autowarming the caches to the time it took to warm
> the caches with static queries.
> I've been staging a setup where there's a small solr instance receiving all
> the updates and a large instance which doesn't receive the live feed of
> updates.
> The small index will be merged with the large index periodically (once a
> week or once a month).
> The two instances are seen by the client app as one instance using the
> sharding features of SOLR.
> The instances are running on the same server inside their own JVM / jetty.
>
> In this setup the caches are very HOT for the large index and queries are
> extremely fast, and the small index is small enough to get extremely fast
> queries without having to warm up the caches too much.
>
> Basically I'm able to have a commit frequency of 10 seconds in a 40M docs
> index while counting TOP5 facets over 14 fields in 200ms.
> In reality the commit frequency of 10 seconds comes from the fact that the
> updates are going into a 1M - 2M documents index, and the fast facet counts
> from the fact that the 38M documents index has hot caches and doesn't
> receive any updates.
>
> Also, not running updates to the large index means that the SOLR instance
> reading the large index uses about half the memory it used before when
> running the updates to the large index. At least it does so on Win2k3.
>
> -Janne
>
>
> 2010/2/15 dipti khullar <[hidden email]>
>
>> Hey Janne
>>
>> Can you please let me know what other optimizations are you talking about
>> here. Because in our application we are committing in about 5 mins but
>> still
>> the response time is very low and at times there are some connection time
>> outs also.
>>
>> Just wanted to confirm if you have done some major configuration changes
>> which have proved beneficial.
>>
>> Thanks
>> Dipti
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Janne Majaranta
Hi,

Yes, I did play with mergeFactor.
I didn't play with mergePolicy.

Wouldn't that affect indexing speed and possibly memory usage ?
I don't have any problems with indexing speed ( 1000 - 2000 docs / sec via
the standard HTTP API ).

My problem is that I need very warm caches to get fast faceting, and the
autowarming of the caches takes too long compared to the frequency of
commits I'm having.
So a commit every minute means less than a minute time to warm the caches.

To give you a idea of what kind of queries needs to be autowarmed in my app,
the logevents indexed as documents have timestamps with different
granularity used for faceting.
For example, to get count of logevents for every hour using faceting there's
a timestamp field with the format yyyymmddhh ( for example: 2010021808
meaning 2010-02-18 8am).
One use case is to get hourly counts over the whole index. A non-cached
query counting the hourly counts over the 40M documents index takes a
while..
And to my understanding autowarming means something like that this kind of
query would be basically re-executed against a cold cache. Probably not
exactly how it works, but it "feels" like it would.

Moving the commits to a smaller index while using sharding to have a
transparent view to the index from the client app seems to solve my problem.

I'm not sure if the (upcoming?) NRT features would keep the caches more
persistent, probably not in a environment where docs get frequent updates /
deletes.

Also, I'm closely following the Ocean Realtime Search project AND it's SOLR
integration. It sounds like it has the "dream features" to enable realtime
updates to the index.

-Janne


2010/2/18 Jan Høydahl / Cominvent <[hidden email]>

> Hi,
>
> Have you tried playing with mergeFactor or even mergePolicy?
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
> On 16. feb. 2010, at 08.26, Janne Majaranta wrote:
>
> > Hey Dipti,
> >
> > Basically query optimizations + setting cache sizes to a very high level.
> > Other than that, the config is about the same as the out-of-the-box
> config
> > that comes with the Solr download.
> >
> > I haven't found a magic switch to get very fast query responses + facet
> > counts with the frequency of commits I'm having using one single SOLR
> > instance.
> > Adding some TOP queries for a certain type of user to static warming
> queries
> > just moved the time of autowarming the caches to the time it took to warm
> > the caches with static queries.
> > I've been staging a setup where there's a small solr instance receiving
> all
> > the updates and a large instance which doesn't receive the live feed of
> > updates.
> > The small index will be merged with the large index periodically (once a
> > week or once a month).
> > The two instances are seen by the client app as one instance using the
> > sharding features of SOLR.
> > The instances are running on the same server inside their own JVM /
> jetty.
> >
> > In this setup the caches are very HOT for the large index and queries are
> > extremely fast, and the small index is small enough to get extremely fast
> > queries without having to warm up the caches too much.
> >
> > Basically I'm able to have a commit frequency of 10 seconds in a 40M docs
> > index while counting TOP5 facets over 14 fields in 200ms.
> > In reality the commit frequency of 10 seconds comes from the fact that
> the
> > updates are going into a 1M - 2M documents index, and the fast facet
> counts
> > from the fact that the 38M documents index has hot caches and doesn't
> > receive any updates.
> >
> > Also, not running updates to the large index means that the SOLR instance
> > reading the large index uses about half the memory it used before when
> > running the updates to the large index. At least it does so on Win2k3.
> >
> > -Janne
> >
> >
> > 2010/2/15 dipti khullar <[hidden email]>
> >
> >> Hey Janne
> >>
> >> Can you please let me know what other optimizations are you talking
> about
> >> here. Because in our application we are committing in about 5 mins but
> >> still
> >> the response time is very low and at times there are some connection
> time
> >> outs also.
> >>
> >> Just wanted to confirm if you have done some major configuration changes
> >> which have proved beneficial.
> >>
> >> Thanks
> >> Dipti
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Otis Gospodnetic-2
Hi Janne,

I *think*  Ocean Realtime Search has been superseded by Lucene NRT search.

 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----

> From: Janne Majaranta <[hidden email]>
> To: [hidden email]
> Sent: Thu, February 18, 2010 2:12:37 AM
> Subject: Re: Realtime search and facets with very frequent commits
>
> Hi,
>
> Yes, I did play with mergeFactor.
> I didn't play with mergePolicy.
>
> Wouldn't that affect indexing speed and possibly memory usage ?
> I don't have any problems with indexing speed ( 1000 - 2000 docs / sec via
> the standard HTTP API ).
>
> My problem is that I need very warm caches to get fast faceting, and the
> autowarming of the caches takes too long compared to the frequency of
> commits I'm having.
> So a commit every minute means less than a minute time to warm the caches.
>
> To give you a idea of what kind of queries needs to be autowarmed in my app,
> the logevents indexed as documents have timestamps with different
> granularity used for faceting.
> For example, to get count of logevents for every hour using faceting there's
> a timestamp field with the format yyyymmddhh ( for example: 2010021808
> meaning 2010-02-18 8am).
> One use case is to get hourly counts over the whole index. A non-cached
> query counting the hourly counts over the 40M documents index takes a
> while..
> And to my understanding autowarming means something like that this kind of
> query would be basically re-executed against a cold cache. Probably not
> exactly how it works, but it "feels" like it would.
>
> Moving the commits to a smaller index while using sharding to have a
> transparent view to the index from the client app seems to solve my problem.
>
> I'm not sure if the (upcoming?) NRT features would keep the caches more
> persistent, probably not in a environment where docs get frequent updates /
> deletes.
>
> Also, I'm closely following the Ocean Realtime Search project AND it's SOLR
> integration. It sounds like it has the "dream features" to enable realtime
> updates to the index.
>
> -Janne
>
>
> 2010/2/18 Jan Høydahl / Cominvent
>
> > Hi,
> >
> > Have you tried playing with mergeFactor or even mergePolicy?
> >
> > --
> > Jan Høydahl  - search architect
> > Cominvent AS - www.cominvent.com
> >
> > On 16. feb. 2010, at 08.26, Janne Majaranta wrote:
> >
> > > Hey Dipti,
> > >
> > > Basically query optimizations + setting cache sizes to a very high level.
> > > Other than that, the config is about the same as the out-of-the-box
> > config
> > > that comes with the Solr download.
> > >
> > > I haven't found a magic switch to get very fast query responses + facet
> > > counts with the frequency of commits I'm having using one single SOLR
> > > instance.
> > > Adding some TOP queries for a certain type of user to static warming
> > queries
> > > just moved the time of autowarming the caches to the time it took to warm
> > > the caches with static queries.
> > > I've been staging a setup where there's a small solr instance receiving
> > all
> > > the updates and a large instance which doesn't receive the live feed of
> > > updates.
> > > The small index will be merged with the large index periodically (once a
> > > week or once a month).
> > > The two instances are seen by the client app as one instance using the
> > > sharding features of SOLR.
> > > The instances are running on the same server inside their own JVM /
> > jetty.
> > >
> > > In this setup the caches are very HOT for the large index and queries are
> > > extremely fast, and the small index is small enough to get extremely fast
> > > queries without having to warm up the caches too much.
> > >
> > > Basically I'm able to have a commit frequency of 10 seconds in a 40M docs
> > > index while counting TOP5 facets over 14 fields in 200ms.
> > > In reality the commit frequency of 10 seconds comes from the fact that
> > the
> > > updates are going into a 1M - 2M documents index, and the fast facet
> > counts
> > > from the fact that the 38M documents index has hot caches and doesn't
> > > receive any updates.
> > >
> > > Also, not running updates to the large index means that the SOLR instance
> > > reading the large index uses about half the memory it used before when
> > > running the updates to the large index. At least it does so on Win2k3.
> > >
> > > -Janne
> > >
> > >
> > > 2010/2/15 dipti khullar
> > >
> > >> Hey Janne
> > >>
> > >> Can you please let me know what other optimizations are you talking
> > about
> > >> here. Because in our application we are committing in about 5 mins but
> > >> still
> > >> the response time is very low and at times there are some connection
> > time
> > >> outs also.
> > >>
> > >> Just wanted to confirm if you have done some major configuration changes
> > >> which have proved beneficial.
> > >>
> > >> Thanks
> > >> Dipti
> > >>
> > >>
> >
> >

Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Janne Majaranta
Hi Otis,

Ok, now I'm confused ;)
There seems to be a bit activity though when looking at the "last updated"
timestamps in the google code project wiki:
http://code.google.com/p/oceansearch/w/list

The Tag Index feature sounds very interesting.

-Janne


2010/2/18 Otis Gospodnetic <[hidden email]>

> Hi Janne,
>
> I *think*  Ocean Realtime Search has been superseded by Lucene NRT search.
>
>  Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> ----- Original Message ----
> > From: Janne Majaranta <[hidden email]>
> > To: [hidden email]
> > Sent: Thu, February 18, 2010 2:12:37 AM
> > Subject: Re: Realtime search and facets with very frequent commits
> >
> > Hi,
> >
> > Yes, I did play with mergeFactor.
> > I didn't play with mergePolicy.
> >
> > Wouldn't that affect indexing speed and possibly memory usage ?
> > I don't have any problems with indexing speed ( 1000 - 2000 docs / sec
> via
> > the standard HTTP API ).
> >
> > My problem is that I need very warm caches to get fast faceting, and the
> > autowarming of the caches takes too long compared to the frequency of
> > commits I'm having.
> > So a commit every minute means less than a minute time to warm the
> caches.
> >
> > To give you a idea of what kind of queries needs to be autowarmed in my
> app,
> > the logevents indexed as documents have timestamps with different
> > granularity used for faceting.
> > For example, to get count of logevents for every hour using faceting
> there's
> > a timestamp field with the format yyyymmddhh ( for example: 2010021808
> > meaning 2010-02-18 8am).
> > One use case is to get hourly counts over the whole index. A non-cached
> > query counting the hourly counts over the 40M documents index takes a
> > while..
> > And to my understanding autowarming means something like that this kind
> of
> > query would be basically re-executed against a cold cache. Probably not
> > exactly how it works, but it "feels" like it would.
> >
> > Moving the commits to a smaller index while using sharding to have a
> > transparent view to the index from the client app seems to solve my
> problem.
> >
> > I'm not sure if the (upcoming?) NRT features would keep the caches more
> > persistent, probably not in a environment where docs get frequent updates
> /
> > deletes.
> >
> > Also, I'm closely following the Ocean Realtime Search project AND it's
> SOLR
> > integration. It sounds like it has the "dream features" to enable
> realtime
> > updates to the index.
> >
> > -Janne
> >
> >
> > 2010/2/18 Jan Høydahl / Cominvent
> >
> > > Hi,
> > >
> > > Have you tried playing with mergeFactor or even mergePolicy?
> > >
> > > --
> > > Jan Høydahl  - search architect
> > > Cominvent AS - www.cominvent.com
> > >
> > > On 16. feb. 2010, at 08.26, Janne Majaranta wrote:
> > >
> > > > Hey Dipti,
> > > >
> > > > Basically query optimizations + setting cache sizes to a very high
> level.
> > > > Other than that, the config is about the same as the out-of-the-box
> > > config
> > > > that comes with the Solr download.
> > > >
> > > > I haven't found a magic switch to get very fast query responses +
> facet
> > > > counts with the frequency of commits I'm having using one single SOLR
> > > > instance.
> > > > Adding some TOP queries for a certain type of user to static warming
> > > queries
> > > > just moved the time of autowarming the caches to the time it took to
> warm
> > > > the caches with static queries.
> > > > I've been staging a setup where there's a small solr instance
> receiving
> > > all
> > > > the updates and a large instance which doesn't receive the live feed
> of
> > > > updates.
> > > > The small index will be merged with the large index periodically
> (once a
> > > > week or once a month).
> > > > The two instances are seen by the client app as one instance using
> the
> > > > sharding features of SOLR.
> > > > The instances are running on the same server inside their own JVM /
> > > jetty.
> > > >
> > > > In this setup the caches are very HOT for the large index and queries
> are
> > > > extremely fast, and the small index is small enough to get extremely
> fast
> > > > queries without having to warm up the caches too much.
> > > >
> > > > Basically I'm able to have a commit frequency of 10 seconds in a 40M
> docs
> > > > index while counting TOP5 facets over 14 fields in 200ms.
> > > > In reality the commit frequency of 10 seconds comes from the fact
> that
> > > the
> > > > updates are going into a 1M - 2M documents index, and the fast facet
> > > counts
> > > > from the fact that the 38M documents index has hot caches and doesn't
> > > > receive any updates.
> > > >
> > > > Also, not running updates to the large index means that the SOLR
> instance
> > > > reading the large index uses about half the memory it used before
> when
> > > > running the updates to the large index. At least it does so on
> Win2k3.
> > > >
> > > > -Janne
> > > >
> > > >
> > > > 2010/2/15 dipti khullar
> > > >
> > > >> Hey Janne
> > > >>
> > > >> Can you please let me know what other optimizations are you talking
> > > about
> > > >> here. Because in our application we are committing in about 5 mins
> but
> > > >> still
> > > >> the response time is very low and at times there are some connection
> > > time
> > > >> outs also.
> > > >>
> > > >> Just wanted to confirm if you have done some major configuration
> changes
> > > >> which have proved beneficial.
> > > >>
> > > >> Thanks
> > > >> Dipti
> > > >>
> > > >>
> > >
> > >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Jason Rutherglen
Janne,

I don't think there's any activity happening there.

SOLR-1606 is the tracking issue for moving to per segment facets and
docsets.  I haven't had an immediate commercial need to implement
those.

Jason

On Thu, Feb 18, 2010 at 7:04 AM, Janne Majaranta
<[hidden email]> wrote:

> Hi Otis,
>
> Ok, now I'm confused ;)
> There seems to be a bit activity though when looking at the "last updated"
> timestamps in the google code project wiki:
> http://code.google.com/p/oceansearch/w/list
>
> The Tag Index feature sounds very interesting.
>
> -Janne
>
>
> 2010/2/18 Otis Gospodnetic <[hidden email]>
>
>> Hi Janne,
>>
>> I *think*  Ocean Realtime Search has been superseded by Lucene NRT search.
>>
>>  Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Hadoop ecosystem search :: http://search-hadoop.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Janne Majaranta <[hidden email]>
>> > To: [hidden email]
>> > Sent: Thu, February 18, 2010 2:12:37 AM
>> > Subject: Re: Realtime search and facets with very frequent commits
>> >
>> > Hi,
>> >
>> > Yes, I did play with mergeFactor.
>> > I didn't play with mergePolicy.
>> >
>> > Wouldn't that affect indexing speed and possibly memory usage ?
>> > I don't have any problems with indexing speed ( 1000 - 2000 docs / sec
>> via
>> > the standard HTTP API ).
>> >
>> > My problem is that I need very warm caches to get fast faceting, and the
>> > autowarming of the caches takes too long compared to the frequency of
>> > commits I'm having.
>> > So a commit every minute means less than a minute time to warm the
>> caches.
>> >
>> > To give you a idea of what kind of queries needs to be autowarmed in my
>> app,
>> > the logevents indexed as documents have timestamps with different
>> > granularity used for faceting.
>> > For example, to get count of logevents for every hour using faceting
>> there's
>> > a timestamp field with the format yyyymmddhh ( for example: 2010021808
>> > meaning 2010-02-18 8am).
>> > One use case is to get hourly counts over the whole index. A non-cached
>> > query counting the hourly counts over the 40M documents index takes a
>> > while..
>> > And to my understanding autowarming means something like that this kind
>> of
>> > query would be basically re-executed against a cold cache. Probably not
>> > exactly how it works, but it "feels" like it would.
>> >
>> > Moving the commits to a smaller index while using sharding to have a
>> > transparent view to the index from the client app seems to solve my
>> problem.
>> >
>> > I'm not sure if the (upcoming?) NRT features would keep the caches more
>> > persistent, probably not in a environment where docs get frequent updates
>> /
>> > deletes.
>> >
>> > Also, I'm closely following the Ocean Realtime Search project AND it's
>> SOLR
>> > integration. It sounds like it has the "dream features" to enable
>> realtime
>> > updates to the index.
>> >
>> > -Janne
>> >
>> >
>> > 2010/2/18 Jan Høydahl / Cominvent
>> >
>> > > Hi,
>> > >
>> > > Have you tried playing with mergeFactor or even mergePolicy?
>> > >
>> > > --
>> > > Jan Høydahl  - search architect
>> > > Cominvent AS - www.cominvent.com
>> > >
>> > > On 16. feb. 2010, at 08.26, Janne Majaranta wrote:
>> > >
>> > > > Hey Dipti,
>> > > >
>> > > > Basically query optimizations + setting cache sizes to a very high
>> level.
>> > > > Other than that, the config is about the same as the out-of-the-box
>> > > config
>> > > > that comes with the Solr download.
>> > > >
>> > > > I haven't found a magic switch to get very fast query responses +
>> facet
>> > > > counts with the frequency of commits I'm having using one single SOLR
>> > > > instance.
>> > > > Adding some TOP queries for a certain type of user to static warming
>> > > queries
>> > > > just moved the time of autowarming the caches to the time it took to
>> warm
>> > > > the caches with static queries.
>> > > > I've been staging a setup where there's a small solr instance
>> receiving
>> > > all
>> > > > the updates and a large instance which doesn't receive the live feed
>> of
>> > > > updates.
>> > > > The small index will be merged with the large index periodically
>> (once a
>> > > > week or once a month).
>> > > > The two instances are seen by the client app as one instance using
>> the
>> > > > sharding features of SOLR.
>> > > > The instances are running on the same server inside their own JVM /
>> > > jetty.
>> > > >
>> > > > In this setup the caches are very HOT for the large index and queries
>> are
>> > > > extremely fast, and the small index is small enough to get extremely
>> fast
>> > > > queries without having to warm up the caches too much.
>> > > >
>> > > > Basically I'm able to have a commit frequency of 10 seconds in a 40M
>> docs
>> > > > index while counting TOP5 facets over 14 fields in 200ms.
>> > > > In reality the commit frequency of 10 seconds comes from the fact
>> that
>> > > the
>> > > > updates are going into a 1M - 2M documents index, and the fast facet
>> > > counts
>> > > > from the fact that the 38M documents index has hot caches and doesn't
>> > > > receive any updates.
>> > > >
>> > > > Also, not running updates to the large index means that the SOLR
>> instance
>> > > > reading the large index uses about half the memory it used before
>> when
>> > > > running the updates to the large index. At least it does so on
>> Win2k3.
>> > > >
>> > > > -Janne
>> > > >
>> > > >
>> > > > 2010/2/15 dipti khullar
>> > > >
>> > > >> Hey Janne
>> > > >>
>> > > >> Can you please let me know what other optimizations are you talking
>> > > about
>> > > >> here. Because in our application we are committing in about 5 mins
>> but
>> > > >> still
>> > > >> the response time is very low and at times there are some connection
>> > > time
>> > > >> outs also.
>> > > >>
>> > > >> Just wanted to confirm if you have done some major configuration
>> changes
>> > > >> which have proved beneficial.
>> > > >>
>> > > >> Thanks
>> > > >> Dipti
>> > > >>
>> > > >>
>> > >
>> > >
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Janne Majaranta
Ok, thanks.

-Janne


2010/2/18 Jason Rutherglen <[hidden email]>

> Janne,
>
> I don't think there's any activity happening there.
>
> SOLR-1606 is the tracking issue for moving to per segment facets and
> docsets.  I haven't had an immediate commercial need to implement
> those.
>
> Jason
>
> On Thu, Feb 18, 2010 at 7:04 AM, Janne Majaranta
> <[hidden email]> wrote:
> > Hi Otis,
> >
> > Ok, now I'm confused ;)
> > There seems to be a bit activity though when looking at the "last
> updated"
> > timestamps in the google code project wiki:
> > http://code.google.com/p/oceansearch/w/list
> >
> > The Tag Index feature sounds very interesting.
> >
> > -Janne
> >
> >
> > 2010/2/18 Otis Gospodnetic <[hidden email]>
> >
> >> Hi Janne,
> >>
> >> I *think*  Ocean Realtime Search has been superseded by Lucene NRT
> search.
> >>
> >>  Otis
> >> ----
> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> >> Hadoop ecosystem search :: http://search-hadoop.com/
> >>
> >>
> >>
> >> ----- Original Message ----
> >> > From: Janne Majaranta <[hidden email]>
> >> > To: [hidden email]
> >> > Sent: Thu, February 18, 2010 2:12:37 AM
> >> > Subject: Re: Realtime search and facets with very frequent commits
> >> >
> >> > Hi,
> >> >
> >> > Yes, I did play with mergeFactor.
> >> > I didn't play with mergePolicy.
> >> >
> >> > Wouldn't that affect indexing speed and possibly memory usage ?
> >> > I don't have any problems with indexing speed ( 1000 - 2000 docs / sec
> >> via
> >> > the standard HTTP API ).
> >> >
> >> > My problem is that I need very warm caches to get fast faceting, and
> the
> >> > autowarming of the caches takes too long compared to the frequency of
> >> > commits I'm having.
> >> > So a commit every minute means less than a minute time to warm the
> >> caches.
> >> >
> >> > To give you a idea of what kind of queries needs to be autowarmed in
> my
> >> app,
> >> > the logevents indexed as documents have timestamps with different
> >> > granularity used for faceting.
> >> > For example, to get count of logevents for every hour using faceting
> >> there's
> >> > a timestamp field with the format yyyymmddhh ( for example: 2010021808
> >> > meaning 2010-02-18 8am).
> >> > One use case is to get hourly counts over the whole index. A
> non-cached
> >> > query counting the hourly counts over the 40M documents index takes a
> >> > while..
> >> > And to my understanding autowarming means something like that this
> kind
> >> of
> >> > query would be basically re-executed against a cold cache. Probably
> not
> >> > exactly how it works, but it "feels" like it would.
> >> >
> >> > Moving the commits to a smaller index while using sharding to have a
> >> > transparent view to the index from the client app seems to solve my
> >> problem.
> >> >
> >> > I'm not sure if the (upcoming?) NRT features would keep the caches
> more
> >> > persistent, probably not in a environment where docs get frequent
> updates
> >> /
> >> > deletes.
> >> >
> >> > Also, I'm closely following the Ocean Realtime Search project AND it's
> >> SOLR
> >> > integration. It sounds like it has the "dream features" to enable
> >> realtime
> >> > updates to the index.
> >> >
> >> > -Janne
> >> >
> >> >
> >> > 2010/2/18 Jan Høydahl / Cominvent
> >> >
> >> > > Hi,
> >> > >
> >> > > Have you tried playing with mergeFactor or even mergePolicy?
> >> > >
> >> > > --
> >> > > Jan Høydahl  - search architect
> >> > > Cominvent AS - www.cominvent.com
> >> > >
> >> > > On 16. feb. 2010, at 08.26, Janne Majaranta wrote:
> >> > >
> >> > > > Hey Dipti,
> >> > > >
> >> > > > Basically query optimizations + setting cache sizes to a very high
> >> level.
> >> > > > Other than that, the config is about the same as the
> out-of-the-box
> >> > > config
> >> > > > that comes with the Solr download.
> >> > > >
> >> > > > I haven't found a magic switch to get very fast query responses +
> >> facet
> >> > > > counts with the frequency of commits I'm having using one single
> SOLR
> >> > > > instance.
> >> > > > Adding some TOP queries for a certain type of user to static
> warming
> >> > > queries
> >> > > > just moved the time of autowarming the caches to the time it took
> to
> >> warm
> >> > > > the caches with static queries.
> >> > > > I've been staging a setup where there's a small solr instance
> >> receiving
> >> > > all
> >> > > > the updates and a large instance which doesn't receive the live
> feed
> >> of
> >> > > > updates.
> >> > > > The small index will be merged with the large index periodically
> >> (once a
> >> > > > week or once a month).
> >> > > > The two instances are seen by the client app as one instance using
> >> the
> >> > > > sharding features of SOLR.
> >> > > > The instances are running on the same server inside their own JVM
> /
> >> > > jetty.
> >> > > >
> >> > > > In this setup the caches are very HOT for the large index and
> queries
> >> are
> >> > > > extremely fast, and the small index is small enough to get
> extremely
> >> fast
> >> > > > queries without having to warm up the caches too much.
> >> > > >
> >> > > > Basically I'm able to have a commit frequency of 10 seconds in a
> 40M
> >> docs
> >> > > > index while counting TOP5 facets over 14 fields in 200ms.
> >> > > > In reality the commit frequency of 10 seconds comes from the fact
> >> that
> >> > > the
> >> > > > updates are going into a 1M - 2M documents index, and the fast
> facet
> >> > > counts
> >> > > > from the fact that the 38M documents index has hot caches and
> doesn't
> >> > > > receive any updates.
> >> > > >
> >> > > > Also, not running updates to the large index means that the SOLR
> >> instance
> >> > > > reading the large index uses about half the memory it used before
> >> when
> >> > > > running the updates to the large index. At least it does so on
> >> Win2k3.
> >> > > >
> >> > > > -Janne
> >> > > >
> >> > > >
> >> > > > 2010/2/15 dipti khullar
> >> > > >
> >> > > >> Hey Janne
> >> > > >>
> >> > > >> Can you please let me know what other optimizations are you
> talking
> >> > > about
> >> > > >> here. Because in our application we are committing in about 5
> mins
> >> but
> >> > > >> still
> >> > > >> the response time is very low and at times there are some
> connection
> >> > > time
> >> > > >> outs also.
> >> > > >>
> >> > > >> Just wanted to confirm if you have done some major configuration
> >> changes
> >> > > >> which have proved beneficial.
> >> > > >>
> >> > > >> Thanks
> >> > > >> Dipti
> >> > > >>
> >> > > >>
> >> > >
> >> > >
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Index field untokenized

Alessandro Falasca (KCTP)
In reply to this post by Otis Gospodnetic-2
Hi All,
I want to index some data untokenized (e.g. url), but I can't
find a way to do it.

I know there is a way to do it in solr configuration but I want
to specify this options directly in my solr xml.

This is a fragment of the xml that i post in slr and I want to know if is possible to add to some field (e.g.
modsCollection.name.xlink:href) an extra attribute in some other way the information about how to index it.//

///<?xml version="1.0" encoding="UTF-8"?>
<add xmlns="http://www.fao.org/faooa/schemas/eims/v0.9" xmlns:mods="http://www.loc.gov/mods/v3"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:eims="http://www.fao.org/faooa/schemas/eims/v0.9"
        xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xalan="http://xml.apache.org/xalan"
        xmlns:l="http://lang.data" xmlns:fn="http://www.w3.org/2005/xpath-functions"
        xmlns:dcterms="http://purl.org/dc/terms/" xmlns:ags="http://www.fao.org/agris/agmes/schemas/0.1/"
        xmlns:uvalibadmin="http://dl.lib.virginia.edu/bin/admin/admin.dtd/"
        xmlns:uvalibdesc="http://dl.lib.virginia.edu/bin/dtd/descmeta/descmeta.dtd"
        xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:zs="http://www.loc.gov/zing/srw/">
        <doc boost="3.5">
                <field boost="2.5" name="PID">eims-document:1960
                </field>
                .....
                <field name="modsCollection.name.xlink:href">http://aims.fao.org/aos/v01/corporatebody/c_1962</field>
                ....
                <field name="modsCollection.language.languageTerm.authority">iso639-2b</field>
                ....
        </doc>
</add>

/Regards,
Alessandro


<?xml version="1.0" encoding="UTF-8"?>
<add xmlns="http://www.fao.org/faooa/schemas/eims/v0.9" xmlns:mods="http://www.loc.gov/mods/v3"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:eims="http://www.fao.org/faooa/schemas/eims/v0.9"
        xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xalan="http://xml.apache.org/xalan"
        xmlns:l="http://lang.data" xmlns:fn="http://www.w3.org/2005/xpath-functions"
        xmlns:dcterms="http://purl.org/dc/terms/" xmlns:ags="http://www.fao.org/agris/agmes/schemas/0.1/"
        xmlns:uvalibadmin="http://dl.lib.virginia.edu/bin/admin/admin.dtd/"
        xmlns:uvalibdesc="http://dl.lib.virginia.edu/bin/dtd/descmeta/descmeta.dtd"
        xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:zs="http://www.loc.gov/zing/srw/">
        <doc boost="3.5">
                <field boost="2.5" name="PID">eims-document:1960
                </field>
                <field name="fgs.state">Active</field>
                <field name="fgs.label">Note relative à la réforme de l'ONU et de la FAO
                </field>
                <field name="fgs.ownerId" />
                <field name="fgs.createdDate">2010-03-11T13:37:44.537Z
                </field>
                <field name="fgs.lastModifiedDate">2010-03-11T13:39:15.819Z
                </field>
                <field name="xslt-version">2</field>
                <field name="audit:auditTrail.audit:record.ID.untokenized">AUDREC1</field>
                <field name="audit:auditTrail.audit:process.type.untokenized">Fedora API-M</field>
                <field name="audit:auditTrail.audit:record.audit:action">modifyDatastreamByValue
                </field>
                <field name="audit:auditTrail.audit:record.audit:componentID">DC</field>
                <field name="audit:auditTrail.audit:record.audit:responsibility">fedoraAdmin
                                                </field>
                <field name="audit:auditTrail.audit:record.audit:date">2010-03-11T13:37:44.801Z
                </field>
                <field name="audit:auditTrail.audit:record.audit:justification">Initial Import of this Object
                                                </field>

                <field name="audit:auditTrail.audit:record.ID.untokenized">AUDREC2</field>
                <field name="audit:auditTrail.audit:process.type.untokenized">Fedora API-M</field>
                <field name="audit:auditTrail.audit:record.audit:action">addDatastream</field>
                <field name="audit:auditTrail.audit:record.audit:componentID">MODS</field>
                <field name="audit:auditTrail.audit:record.audit:responsibility">fedoraAdmin
                                                </field>
                <field name="audit:auditTrail.audit:record.audit:date">2010-03-11T13:39:09.348Z
                </field>


                <field name="audit:auditTrail.audit:record.ID.untokenized">AUDREC3</field>
                <field name="audit:auditTrail.audit:process.type.untokenized">Fedora API-M</field>
                <field name="audit:auditTrail.audit:record.audit:action">addDatastream</field>
                <field name="audit:auditTrail.audit:record.audit:componentID">AGRISFO</field>
                <field name="audit:auditTrail.audit:record.audit:responsibility">fedoraAdmin
                                                </field>
                <field name="audit:auditTrail.audit:record.audit:date">2010-03-11T13:39:11.931Z
                </field>


                <field name="audit:auditTrail.audit:record.ID.untokenized">AUDREC4</field>
                <field name="audit:auditTrail.audit:process.type.untokenized">Fedora API-M</field>
                <field name="audit:auditTrail.audit:record.audit:action">addDatastream</field>
                <field name="audit:auditTrail.audit:record.audit:componentID">EIMS</field>
                <field name="audit:auditTrail.audit:record.audit:responsibility">fedoraAdmin
                                                </field>
                <field name="audit:auditTrail.audit:record.audit:date">2010-03-11T13:39:13.434Z
                </field>


                <field name="audit:auditTrail.audit:record.ID.untokenized">AUDREC5</field>
                <field name="audit:auditTrail.audit:process.type.untokenized">Fedora API-M</field>
                <field name="audit:auditTrail.audit:record.audit:action">addDatastream</field>
                <field name="audit:auditTrail.audit:record.audit:componentID">SKOS</field>
                <field name="audit:auditTrail.audit:record.audit:responsibility">fedoraAdmin
                                                </field>
                <field name="audit:auditTrail.audit:record.audit:date">2010-03-11T13:39:15.819Z
                </field>



                <field name="oai_dc:dc.dc:title.xml:lang.untokenized">fr</field>
                <field name="oai_dc:dc.dc:title">Note relative à la réforme de l'ONU et de la FAO
                </field>
                <field name="oai_dc:dc.dc:identifier">pubid.fao.org:210159</field>
                <field name="oai_dc:dc.dc:publisher">FAO</field>



                <field name="rdf:RDF.rdf:Description.rdf:about.untokenized">info:fedora/eims-document:1960
                </field>
                <field name="rdf:RDF.rdf:Description.hasFRBRType">faooa:FRBR-EXPRESSION</field>
                <field name="rdf:RDF.rdf:Description.hasJobNo">J8010</field>



                <field name="modsCollection.mods.version.untokenized">3.3</field>

                <field name="modsCollection.recordInfo.recordCreationDate">2006-06-29</field>


                <field name="modsCollection.titleInfo.xml:lang.untokenized">fr</field>
                <field name="modsCollection.titleInfo.title">Note relative à la réforme de l'ONU et de la FAO
                </field>

                <field name="modsCollection.name.authority.untokenized">fao-aos-corporatebody</field>
                <field name="modsCollection.name.type.untokenized">corporate</field>
                <field name="modsCollection.name.xlink:href.untokenized">http://aims.fao.org/aos/v01/corporatebody/c_1962
                </field>
                <field name="modsCollection.name.xml:lang.untokenized">en</field>
                <field name="modsCollection.name.namePart">FAO, Rome (Italy). Fisheries and Aquaculture
                        Dept.</field>

                <field name="modsCollection.roleTerm.authority.untokenized">marcrelator</field>
                <field name="modsCollection.roleTerm.type.untokenized">text</field>
                <field name="modsCollection.role.roleTerm">Author</field>
                <field name="modsCollection.role.roleTerm.authority.untokenized">marcrelator</field>
                <field name="modsCollection.role.roleTerm.type.untokenized">text</field>


                <field name="modsCollection.name.type.untokenized">conference</field>
                <field name="modsCollection.name.xml:lang.untokenized">en</field>
                <field name="modsCollection.name.namePart">FAO Committee on Fisheries. Sub-Committee on
                        Aquaculture (Sess. 4 : 6-10 Oct 2008 : Puerto Varas, Chile)</field>

                <field name="modsCollection.roleTerm.authority.untokenized">marcrelator</field>
                <field name="modsCollection.roleTerm.type.untokenized">text</field>
                <field name="modsCollection.role.roleTerm">Author</field>
                <field name="modsCollection.role.roleTerm.authority.untokenized">marcrelator</field>
                <field name="modsCollection.role.roleTerm.type.untokenized">text</field>


                <field name="modsCollection.genre.type.untokenized">type</field>
                <field name="modsCollection.mods.genre">Conference</field>
                <field name="modsCollection.mods.genre.type.untokenized">type</field>
                <field name="modsCollection.genre.type.untokenized">type</field>
                <field name="modsCollection.mods.genre">Non-conventional</field>
                <field name="modsCollection.mods.genre.type.untokenized">type</field>

                <field name="modsCollection.languageTerm.authority.untokenized">iso639-2b</field>
                <field name="modsCollection.languageTerm.type.untokenized">code</field>
                <field name="modsCollection.language.languageTerm">fra</field>
                <field name="modsCollection.language.languageTerm.authority.untokenized">iso639-2b</field>
                <field name="modsCollection.language.languageTerm.type.untokenized">code</field>
                <field name="modsCollection.languageTerm.type.untokenized">text</field>
                <field name="modsCollection.language.languageTerm">French</field>
                <field name="modsCollection.language.languageTerm.type.untokenized">text</field>

                <field name="modsCollection.identifier.type.untokenized">jn</field>
                <field name="modsCollection.mods.identifier">J8010</field>
                <field name="modsCollection.mods.identifier.type.untokenized">jn</field>
                <field name="modsCollection.identifier.type.untokenized">rn</field>





                <field name="eims.publication.identifier">210159</field>
                <field name="eims.publication.waicent_published">0</field>
                <field name="eims.Department.code.untokenized">3</field>
                <field name="eims.Department.xml:lang.untokenized">en</field>
                <field name="eims.Department.label">KC</field>



                <field name="eims.maintype.code.untokenized">1</field>
                <field name="eims.maintype.xml:lang.untokenized">en</field>
                <field name="eims.maintype.maintypeDescription">Publication</field>


        </doc>
</add>
Reply | Threaded
Open this post in threaded view
|

Re: Index field untokenized

Chris Hostetter-3

: I want to index some data untokenized (e.g. url), but I can't
: find a way to do it.
:
: I know there is a way to do it in solr configuration but I want
: to specify this options directly in my solr xml.
:
: This is a fragment of the xml that i post in slr and I want to know if is
: possible to add to some field (e.g. modsCollection.name.xlink:href) an extra
: attribute in some other way the information about how to index it.//

No, that's not how Solr works -- the client specifies a list of of
field=>value pairs for each document, and the schema.xml file tells Solr
how those field=>value pairs should be dealt with.

If the client could override this behavior, then you could end up with a
non-sensical index if some clients said a certain field name should be
tokenized, and other clients mistakenly said that same field shouldn't be
tokenized.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

David Smiley
In reply to this post by Janne Majaranta
Janne,
        Have you found your query relevancy to deteriorate with this setup?  Something to be aware of with distributed searches is that the relevancy of each Solr core response is based on the local index to that core.  So if you're distributed Solr setup does not distribute documents randomly (as is certainly the case for you) your relevancy scores will be poor.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Feb 11, 2010, at 12:35 PM, Janne Majaranta wrote:
...
>
> I have tested putting a second solr instance on the same server and sending
> the updates to that new instance.
> Warming up the new small instance is very fast while the large instance has
> very hot caches.
...
>
> Best Regards,
>
> Janne Majaranta

Reply | Threaded
Open this post in threaded view
|

Re: Realtime search and facets with very frequent commits

Janne Majaranta
Yeah, thanks for pointing this out.
I'm not using any relevancy functions (yet). The data indexed for my app is
basically log events.
The most relevant events are the newest ones, so sorting by timestamp is
enough.

BTW, your book is great ;)

-Janne

2010/3/31 Smiley, David W. <[hidden email]>

> Janne,
>        Have you found your query relevancy to deteriorate with this setup?
>  Something to be aware of with distributed searches is that the relevancy of
> each Solr core response is based on the local index to that core.  So if
> you're distributed Solr setup does not distribute documents randomly (as is
> certainly the case for you) your relevancy scores will be poor.
>
> ~ David Smiley
> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>
> On Feb 11, 2010, at 12:35 PM, Janne Majaranta wrote:
> ...
> >
> > I have tested putting a second solr instance on the same server and
> sending
> > the updates to that new instance.
> > Warming up the new small instance is very fast while the large instance
> has
> > very hot caches.
> ...
> >
> > Best Regards,
> >
> > Janne Majaranta
>
>