A tool for frequent re-indexing...

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

A tool for frequent re-indexing...

Ravish Bhagdev
Hi All,

I am considering writing a small tool that would read from one solr core
and write to another as a means of quick re-indexing of data.  I have a
large-ish set (hundreds of thousands) of documents that I've already parsed
with Tika and I keep changing bits and pieces in schema and config to try
new things often.  Instead of having to go through the process of
re-indexing from docs (and some DBs), I thought it may be much more faster
to just read from one core and write into new core with new
schema, analysers and/or settings.

I was wondering if anyone else has done anything similar already?  It would
be handy if I can use this sort of thing to spin off another core write to
it and then swap the two cores discarding the older one.

Thanks,
Ravish
Reply | Threaded
Open this post in threaded view
|

Re: A tool for frequent re-indexing...

iorixxx
> I am considering writing a small tool that would read from
> one solr core
> and write to another as a means of quick re-indexing of
> data.  I have a
> large-ish set (hundreds of thousands) of documents that I've
> already parsed
> with Tika and I keep changing bits and pieces in schema and
> config to try
> new things often.  Instead of having to go through the
> process of
> re-indexing from docs (and some DBs), I thought it may be
> much more faster
> to just read from one core and write into new core with new
> schema, analysers and/or settings.
>
> I was wondering if anyone else has done anything similar
> already?  It would
> be handy if I can use this sort of thing to spin off another
> core write to
> it and then swap the two cores discarding the older one.

You might find these relevant :

https://issues.apache.org/jira/browse/SOLR-3246

http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor


Reply | Threaded
Open this post in threaded view
|

Re: A tool for frequent re-indexing...

Valeriy Felberg
I've implemented something like described in
https://issues.apache.org/jira/browse/SOLR-3246. The idea is to add an
update request processor at the end of the update chain in the core
you want to copy. The processor converts the SolrInputDocument to XML
(there is some utility method for doing this) and dumps the XML into a
file which can be fed into Solr again with curl. If you have many
documents you will probably want to distribute the XML files into
different directories using some common prefix in the id field.

On Fri, Apr 6, 2012 at 5:18 AM, Ahmet Arslan <[hidden email]> wrote:

>> I am considering writing a small tool that would read from
>> one solr core
>> and write to another as a means of quick re-indexing of
>> data.  I have a
>> large-ish set (hundreds of thousands) of documents that I've
>> already parsed
>> with Tika and I keep changing bits and pieces in schema and
>> config to try
>> new things often.  Instead of having to go through the
>> process of
>> re-indexing from docs (and some DBs), I thought it may be
>> much more faster
>> to just read from one core and write into new core with new
>> schema, analysers and/or settings.
>>
>> I was wondering if anyone else has done anything similar
>> already?  It would
>> be handy if I can use this sort of thing to spin off another
>> core write to
>> it and then swap the two cores discarding the older one.
>
> You might find these relevant :
>
> https://issues.apache.org/jira/browse/SOLR-3246
>
> http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
>
>
Reply | Threaded
Open this post in threaded view
|

Re: A tool for frequent re-indexing...

Ravish Bhagdev
Thanks.  This is useful to know as well.

I was actually after
SolrEntityProcessor<http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor>
which
I failed to notice until pointed out by previous reply because I'm using
1.4 still.

Cheers,
Ravish

On Fri, Apr 6, 2012 at 11:01 AM, Valeriy Felberg
<[hidden email]>wrote:

> I've implemented something like described in
> https://issues.apache.org/jira/browse/SOLR-3246. The idea is to add an
> update request processor at the end of the update chain in the core
> you want to copy. The processor converts the SolrInputDocument to XML
> (there is some utility method for doing this) and dumps the XML into a
> file which can be fed into Solr again with curl. If you have many
> documents you will probably want to distribute the XML files into
> different directories using some common prefix in the id field.
>
> On Fri, Apr 6, 2012 at 5:18 AM, Ahmet Arslan <[hidden email]> wrote:
> >> I am considering writing a small tool that would read from
> >> one solr core
> >> and write to another as a means of quick re-indexing of
> >> data.  I have a
> >> large-ish set (hundreds of thousands) of documents that I've
> >> already parsed
> >> with Tika and I keep changing bits and pieces in schema and
> >> config to try
> >> new things often.  Instead of having to go through the
> >> process of
> >> re-indexing from docs (and some DBs), I thought it may be
> >> much more faster
> >> to just read from one core and write into new core with new
> >> schema, analysers and/or settings.
> >>
> >> I was wondering if anyone else has done anything similar
> >> already?  It would
> >> be handy if I can use this sort of thing to spin off another
> >> core write to
> >> it and then swap the two cores discarding the older one.
> >
> > You might find these relevant :
> >
> > https://issues.apache.org/jira/browse/SOLR-3246
> >
> > http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
> >
> >
>