Quantcast

managing active/passive cores in Solr and Haystack

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

managing active/passive cores in Solr and Haystack

serwah sabetghadam
Hi all,

I am totally new to this group and of course so happy to join:)
So my question may be repetitive but I did not find how to search all
previous questions.


problem in one sentence:
to read from multiple cores (archive and active ones), write only to the
latest active core
using Solr and Haystack


I am designing a periodic indexing system, one core per months, of which
always the last two indexes are used to search on, and always the last one
is the active one for current indexing.


We are using Haystack to manage the communications with Solr.
We can use multiple cores in the settings.py in Haystack, that is totally
fine.
The problem is that in this case, as I have tested, both cores are getting
updated for new indexing.

Then I decided to use the "--using" parameter of Haystack to select which
backend to use for updating the index, sth like:

./manage.py update_index events.Event --age=24 --workers=4 --using=default

that in default part in the settigns.py file I have defined the active
core.
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
         'URL': 'http://127.0.0.1:8983/solr/core_Feb',
          },
     'slave':{
          'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
          'URL': 'http://127.0.0.1:8983/solr/core_Jan',
     },
 }

here core_Feb is the active core, or is going to be the active core.

then now I am not sure this way it will read from both. Now I can manage
the write part, but again problem with reading from multiple cores. What I
tested before for reading from multiple cores was like :

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
         'URL': 'http://127.0.0.1:8983/solr/core_Feb',
          'URL': 'http://127.0.0.1:8983/solr/core_Jan',
     },
 }


but in this case it will write in both! that I want to write only in the
core_Feb one.

Any help is highly appreciated,
Best,
Serwah
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: managing active/passive cores in Solr and Haystack

Erick Erickson
I don't know much about HAYSTACK, but for the Solr URL you probably
want the "shards" parameter for searching, see:
https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

And just use the specific core you care about for update requests.

But I would suggest that you can have Solr do much of this work,
specifically SolrCloud with "implicit" routing. Combine that with
"collection aliasing" and I think you have what you need with a single
Solr URL. "implicit" routing allows you to send docs to a particular
shard based on the value of a particular field. You can add/remove
shards at will (only with the "implicit" router, not with the default
compositeID router". Etc.

I've skimmed over lots of details here, I just didn't wan you to be
unaware that a solution exists (see "time series data" in the
literature).

Best,
Erick

On Tue, Mar 14, 2017 at 8:06 AM, serwah sabetghadam
<[hidden email]> wrote:

> Hi all,
>
> I am totally new to this group and of course so happy to join:)
> So my question may be repetitive but I did not find how to search all
> previous questions.
>
>
> problem in one sentence:
> to read from multiple cores (archive and active ones), write only to the
> latest active core
> using Solr and Haystack
>
>
> I am designing a periodic indexing system, one core per months, of which
> always the last two indexes are used to search on, and always the last one
> is the active one for current indexing.
>
>
> We are using Haystack to manage the communications with Solr.
> We can use multiple cores in the settings.py in Haystack, that is totally
> fine.
> The problem is that in this case, as I have tested, both cores are getting
> updated for new indexing.
>
> Then I decided to use the "--using" parameter of Haystack to select which
> backend to use for updating the index, sth like:
>
> ./manage.py update_index events.Event --age=24 --workers=4 --using=default
>
> that in default part in the settigns.py file I have defined the active
> core.
> HAYSTACK_CONNECTIONS = {
>     'default': {
>         'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
>          'URL': 'http://127.0.0.1:8983/solr/core_Feb',
>           },
>      'slave':{
>           'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
>           'URL': 'http://127.0.0.1:8983/solr/core_Jan',
>      },
>  }
>
> here core_Feb is the active core, or is going to be the active core.
>
> then now I am not sure this way it will read from both. Now I can manage
> the write part, but again problem with reading from multiple cores. What I
> tested before for reading from multiple cores was like :
>
> HAYSTACK_CONNECTIONS = {
>     'default': {
>         'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
>          'URL': 'http://127.0.0.1:8983/solr/core_Feb',
>           'URL': 'http://127.0.0.1:8983/solr/core_Jan',
>      },
>  }
>
>
> but in this case it will write in both! that I want to write only in the
> core_Feb one.
>
> Any help is highly appreciated,
> Best,
> Serwah
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: managing active/passive cores in Solr and Haystack

serwah sabetghadam
Thanks Erick for the fast answer:)

I knew about sharding, just as far as I know it will work on different
servers.
I wonder if it is possible to do sth like sharding as you mentioned but on
a single standalone Solr?
Can I use the implicit routing on standalone then?

and I would appreciate if someone has experience with HAYSTACK conducting
Solr to share the experience here.

Best,
Serwah

On Tue, Mar 14, 2017 at 4:15 PM, Erick Erickson <[hidden email]>
wrote:

> I don't know much about HAYSTACK, but for the Solr URL you probably
> want the "shards" parameter for searching, see:
> https://cwiki.apache.org/confluence/display/solr/
> Distributed+Search+with+Index+Sharding
>
> And just use the specific core you care about for update requests.
>
> But I would suggest that you can have Solr do much of this work,
> specifically SolrCloud with "implicit" routing. Combine that with
> "collection aliasing" and I think you have what you need with a single
> Solr URL. "implicit" routing allows you to send docs to a particular
> shard based on the value of a particular field. You can add/remove
> shards at will (only with the "implicit" router, not with the default
> compositeID router". Etc.
>
> I've skimmed over lots of details here, I just didn't wan you to be
> unaware that a solution exists (see "time series data" in the
> literature).
>
> Best,
> Erick
>
> On Tue, Mar 14, 2017 at 8:06 AM, serwah sabetghadam
> <[hidden email]> wrote:
> > Hi all,
> >
> > I am totally new to this group and of course so happy to join:)
> > So my question may be repetitive but I did not find how to search all
> > previous questions.
> >
> >
> > problem in one sentence:
> > to read from multiple cores (archive and active ones), write only to the
> > latest active core
> > using Solr and Haystack
> >
> >
> > I am designing a periodic indexing system, one core per months, of which
> > always the last two indexes are used to search on, and always the last
> one
> > is the active one for current indexing.
> >
> >
> > We are using Haystack to manage the communications with Solr.
> > We can use multiple cores in the settings.py in Haystack, that is totally
> > fine.
> > The problem is that in this case, as I have tested, both cores are
> getting
> > updated for new indexing.
> >
> > Then I decided to use the "--using" parameter of Haystack to select which
> > backend to use for updating the index, sth like:
> >
> > ./manage.py update_index events.Event --age=24 --workers=4
> --using=default
> >
> > that in default part in the settigns.py file I have defined the active
> > core.
> > HAYSTACK_CONNECTIONS = {
> >     'default': {
> >         'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
> >          'URL': 'http://127.0.0.1:8983/solr/core_Feb',
> >           },
> >      'slave':{
> >           'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
> >           'URL': 'http://127.0.0.1:8983/solr/core_Jan',
> >      },
> >  }
> >
> > here core_Feb is the active core, or is going to be the active core.
> >
> > then now I am not sure this way it will read from both. Now I can manage
> > the write part, but again problem with reading from multiple cores. What
> I
> > tested before for reading from multiple cores was like :
> >
> > HAYSTACK_CONNECTIONS = {
> >     'default': {
> >         'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
> >          'URL': 'http://127.0.0.1:8983/solr/core_Feb',
> >           'URL': 'http://127.0.0.1:8983/solr/core_Jan',
> >      },
> >  }
> >
> >
> > but in this case it will write in both! that I want to write only in the
> > core_Feb one.
> >
> > Any help is highly appreciated,
> > Best,
> > Serwah
>



--
Serwah Sabetghadam
Vienna University of Technology
Office phone: +43 1 58801 188633 <%2B43%201%2058801%20188314>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: managing active/passive cores in Solr and Haystack

Shawn Heisey-2
On 3/15/2017 7:55 AM, serwah sabetghadam wrote:
> Thanks Erick for the fast answer:)
>
> I knew about sharding, just as far as I know it will work on different
> servers.
> I wonder if it is possible to do sth like sharding as you mentioned but on
> a single standalone Solr?
> Can I use the implicit routing on standalone then?

If you're running standalone (not SolrCloud), then everything having to
do with shards must be 100 percent managed by you.  There is no
routing.  There is no capability of automatically managing which
implicit shards belong to which logical index.  There's no automatic
replication of index data for redundancy.  You're in charge of
*everything* that SolrCloud would normally handle automatically.

https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

Multiple shards can live in a single Solr instance, whether you use
SolrCloud or the old way described above.  If your query rate is very
low, this probably will perform well.  As the query rate increases, it's
best to only have one core per Solr instance.  Either way, it's
*usually* best to only have one Solr instance per machine.

Thanks,
Shawn

Loading...