Solr for Content Management

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr for Content Management

Moenieb Davids-2
Hi All,

Background:
I am currently testing a deployment of a content management framework where
I am trying to punt Solr as the tool of choice for ingestion and searching.

Current status:
I have deployed SolrCloud across multiple servers with multiple shards and
a replication factor of 2.
In terms of collections, I have a person collection that contains details
individuals including address and high level portfolio info. Structurally,
this collection contains great grandchildren.
Then I have a few collections that deals with content. For now, content is
just emails and document with a max size of 2MB, with certain user
exceptions that can go higher than 2MB.
Content is indexed twice in terms of the actual content, firstly as
binary/stream and then as readable text. Metadata is negligible


Challenges:
When performing full text searches without concurrently executing updates,
solr seems to be doing well. Running updates also does okish given the
nature of the transaction. However, when I run search and updates
simultaneously, performance drops quite significantly. I have played with
field properties, analyzers, tokenizers, shafting sizes etc.
Any advice?
Would like to know if anyone has done something similar. Please excuse the
long winded message


--
Sent from Gmail Mobile



--
Sent from Gmail Mobile
Reply | Threaded
Open this post in threaded view
|

Re: Solr for Content Management

David Hastings
When you are sending updates you are adjusting the segments which take them
out of memory and the index becomes "cold" until it gets enough searches to
cache the various aspects of the index.

On Thu, Jun 7, 2018 at 2:10 PM, Moenieb Davids <[hidden email]>
wrote:

> Hi All,
>
> Background:
> I am currently testing a deployment of a content management framework where
> I am trying to punt Solr as the tool of choice for ingestion and searching.
>
> Current status:
> I have deployed SolrCloud across multiple servers with multiple shards and
> a replication factor of 2.
> In terms of collections, I have a person collection that contains details
> individuals including address and high level portfolio info. Structurally,
> this collection contains great grandchildren.
> Then I have a few collections that deals with content. For now, content is
> just emails and document with a max size of 2MB, with certain user
> exceptions that can go higher than 2MB.
> Content is indexed twice in terms of the actual content, firstly as
> binary/stream and then as readable text. Metadata is negligible
>
>
> Challenges:
> When performing full text searches without concurrently executing updates,
> solr seems to be doing well. Running updates also does okish given the
> nature of the transaction. However, when I run search and updates
> simultaneously, performance drops quite significantly. I have played with
> field properties, analyzers, tokenizers, shafting sizes etc.
> Any advice?
> Would like to know if anyone has done something similar. Please excuse the
> long winded message
>
>
> --
> Sent from Gmail Mobile
>
>
>
> --
> Sent from Gmail Mobile
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr for Content Management

Alexandre Rafalovitch
And in solrconfig.xml, it is possible to configure the searches to warm the
index up before the users see it.

Regards,
    Alex

On Thu, Jun 7, 2018, 21:27 David Hastings, <[hidden email]>
wrote:

> When you are sending updates you are adjusting the segments which take them
> out of memory and the index becomes "cold" until it gets enough searches to
> cache the various aspects of the index.
>
> On Thu, Jun 7, 2018 at 2:10 PM, Moenieb Davids <[hidden email]>
> wrote:
>
> > Hi All,
> >
> > Background:
> > I am currently testing a deployment of a content management framework
> where
> > I am trying to punt Solr as the tool of choice for ingestion and
> searching.
> >
> > Current status:
> > I have deployed SolrCloud across multiple servers with multiple shards
> and
> > a replication factor of 2.
> > In terms of collections, I have a person collection that contains details
> > individuals including address and high level portfolio info.
> Structurally,
> > this collection contains great grandchildren.
> > Then I have a few collections that deals with content. For now, content
> is
> > just emails and document with a max size of 2MB, with certain user
> > exceptions that can go higher than 2MB.
> > Content is indexed twice in terms of the actual content, firstly as
> > binary/stream and then as readable text. Metadata is negligible
> >
> >
> > Challenges:
> > When performing full text searches without concurrently executing
> updates,
> > solr seems to be doing well. Running updates also does okish given the
> > nature of the transaction. However, when I run search and updates
> > simultaneously, performance drops quite significantly. I have played with
> > field properties, analyzers, tokenizers, shafting sizes etc.
> > Any advice?
> > Would like to know if anyone has done something similar. Please excuse
> the
> > long winded message
> >
> >
> > --
> > Sent from Gmail Mobile
> >
> >
> >
> > --
> > Sent from Gmail Mobile
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr for Content Management

Emir Arnautović
Hi,
It is also likely that your indexing is using resources and that there are not enough resources for queries to process. Indexing can put stress on heap and GCs might be slowing Solr down resulting in observed latency. Can you tell us a bit more on size of your index, server configs, heap size, indexing rate, how do you do indexing (batch size) and query rate. This might give us better ideas to point you into right direction.
Do you use anything to monitor your Solr/host? Does monitoring tool suggest that there are some bottleneck?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 8 Jun 2018, at 09:06, Alexandre Rafalovitch <[hidden email]> wrote:
>
> And in solrconfig.xml, it is possible to configure the searches to warm the
> index up before the users see it.
>
> Regards,
>    Alex
>
> On Thu, Jun 7, 2018, 21:27 David Hastings, <[hidden email]>
> wrote:
>
>> When you are sending updates you are adjusting the segments which take them
>> out of memory and the index becomes "cold" until it gets enough searches to
>> cache the various aspects of the index.
>>
>> On Thu, Jun 7, 2018 at 2:10 PM, Moenieb Davids <[hidden email]>
>> wrote:
>>
>>> Hi All,
>>>
>>> Background:
>>> I am currently testing a deployment of a content management framework
>> where
>>> I am trying to punt Solr as the tool of choice for ingestion and
>> searching.
>>>
>>> Current status:
>>> I have deployed SolrCloud across multiple servers with multiple shards
>> and
>>> a replication factor of 2.
>>> In terms of collections, I have a person collection that contains details
>>> individuals including address and high level portfolio info.
>> Structurally,
>>> this collection contains great grandchildren.
>>> Then I have a few collections that deals with content. For now, content
>> is
>>> just emails and document with a max size of 2MB, with certain user
>>> exceptions that can go higher than 2MB.
>>> Content is indexed twice in terms of the actual content, firstly as
>>> binary/stream and then as readable text. Metadata is negligible
>>>
>>>
>>> Challenges:
>>> When performing full text searches without concurrently executing
>> updates,
>>> solr seems to be doing well. Running updates also does okish given the
>>> nature of the transaction. However, when I run search and updates
>>> simultaneously, performance drops quite significantly. I have played with
>>> field properties, analyzers, tokenizers, shafting sizes etc.
>>> Any advice?
>>> Would like to know if anyone has done something similar. Please excuse
>> the
>>> long winded message
>>>
>>>
>>> --
>>> Sent from Gmail Mobile
>>>
>>>
>>>
>>> --
>>> Sent from Gmail Mobile
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Solr for Content Management

Shawn Heisey-2
In reply to this post by Moenieb Davids-2
On 6/7/2018 12:10 PM, Moenieb Davids wrote:
> Challenges:
> When performing full text searches without concurrently executing updates,
> solr seems to be doing well. Running updates also does okish given the
> nature of the transaction. However, when I run search and updates
> simultaneously, performance drops quite significantly. I have played with
> field properties, analyzers, tokenizers, shafting sizes etc.

I have absolutely no idea what a shafting size is.  If I google for it,
the only relevant thing that comes up is your message on this list.

Doing updates at the same time as queries will always have an impact on
query performance.  But if that impact is very significant, then it
sounds like the machine doesn't have enough memory to allow the OS to
effectively cache the index data.  When updates are made, all the data
that is written will end up in the disk cache, and if the cache is as
big as can get already, it will push older data out of the cache.

Disks are very slow compared to memory, so if the index data required to
complete a query must be read from the disk, performance is adversely
affected.

A page discussing OS disk cache requirements:

https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn