Realtime directory change...

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Realtime directory change...

escher2k
Hi,
  We currently use Lucene to do index user data every couple of hours - the index is completely rebuilt,
the old index is archived and the new one copied over to the directory. Example -

/bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
/bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
/bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
/bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
/bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help


This works fine since the index is retrieved every time from the disk. Is it possible to do the same with Solr ?
Assuming we also use caching to speed up the retrieval, is there a way to invalidate some/all caches when
this done ?

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Realtime directory change...

Yonik Seeley-2
On 12/21/06, escher2k <[hidden email]> wrote:

> Hi,
>   We currently use Lucene to do index user data every couple of hours - the
> index is completely rebuilt,
> the old index is archived and the new one copied over to the directory.
>
> Example -
>
> /bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
> /bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
> /bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
> /bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
> /bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help
>
> This works fine since the index is retrieved every time from the disk. Is it
> possible to do the same with Solr ?

Yes, this will work.  This is sort of what the index distribution
scripts do to install a new index snapshot in a master/slave
configuration.

You also don't have to build in a different directory if you don't
want to.  Solr supports incremental updates.

> Assuming we also use caching to speed up the retrieval, is there a way to
> invalidate some/all caches when
> this done ?

It's done automatically.  You will need to issue a <commit/> to solr
to get it to read the new index (open a new searcher), and new caches
will be associated with that new searcher.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Realtime directory change...

escher2k
Thanks. The problem is, it is not easy to do an incremental update on the data set.
In which case, I guess the index needs to be created in a different path and we need to move
files around. However, since the documents are added over HTTP, how does one even create
the index in a different path on the same machine while the application is still running ?

Ideally, what we would want is to recreate a new index from scratch and then use the master/slave
configuration to copy the indexes to other machines.

Yonik Seeley wrote
On 12/21/06, escher2k <escher2k@yahoo.com> wrote:
> Hi,
>   We currently use Lucene to do index user data every couple of hours - the
> index is completely rebuilt,
> the old index is archived and the new one copied over to the directory.
>
> Example -
>
> /bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
> /bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
> /bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
> /bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
> /bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help
>
> This works fine since the index is retrieved every time from the disk. Is it
> possible to do the same with Solr ?

Yes, this will work.  This is sort of what the index distribution
scripts do to install a new index snapshot in a master/slave
configuration.

You also don't have to build in a different directory if you don't
want to.  Solr supports incremental updates.

> Assuming we also use caching to speed up the retrieval, is there a way to
> invalidate some/all caches when
> this done ?

It's done automatically.  You will need to issue a <commit/> to solr
to get it to read the new index (open a new searcher), and new caches
will be associated with that new searcher.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Realtime directory change...

Chris Hostetter-3

: Thanks. The problem is, it is not easy to do an incremental update on the
: data set.
: In which case, I guess the index needs to be created in a different path and
: we need to move
: files around. However, since the documents are added over HTTP, how does one
: even create
: the index in a different path on the same machine while the application is
: still running ?

for the record, i don't think you *have* to do this ... allthough it will
certianly work fine if you want to (since it's just hte master/slave model
starting with an empty index)

if in your current model, you have an index which you never modify, and
you regularly build a new index on a new path and then replace it, you
could do the same thing with a single Solr instance by indexing all of
your new documents on the same index, then deleting all docs older then
your newest "rebuild" (using a timestamp field) and then and only then
issue a commit to tell Solr to start using the new index.

as long as no one else issues a commit while you are "rebuilding" your
index will allways look consistent.

But as i said: the master/slave model will work perfectly for what you
want as well -- and the snap* scripts will take care of loading it up on
your slave.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Realtime directory change...

Thorsten Scherler-3
In reply to this post by escher2k
On Thu, 2006-12-21 at 12:23 -0800, escher2k wrote:

> Hi,
>   We currently use Lucene to do index user data every couple of hours - the
> index is completely rebuilt,
> the old index is archived and the new one copied over to the directory.
> Example -
>
> /bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
> /bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
> /bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
> /bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
> /bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help
>
> This works fine since the index is retrieved every time from the disk. Is it
> possible to do the same with Solr ?
> Assuming we also use caching to speed up the retrieval, is there a way to
> invalidate some/all caches when
> this done ?
>

Did you look into
http://wiki.apache.org/solr/CollectionDistribution
http://wiki.apache.org/solr/SolrCollectionDistributionScripts
http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline

I am still very new to solr but it sounds like it is exactly what you
need (like as well said by others).

HTH

salu2


> Thanks.
>

Reply | Threaded
Open this post in threaded view
|

Re: Realtime directory change...

escher2k
In reply to this post by Chris Hostetter-3
Thanks Chris. So, assuming that we rebuild the index, delete the old data and then execute a commit,
will the snap scripts take care of reconciling all the data ? Internally, is there an update timestamp notion
used to figure out which unique id records have changed and then synchronize them by executing delete/insert ops ?

Chris Hostetter wrote
: Thanks. The problem is, it is not easy to do an incremental update on the
: data set.
: In which case, I guess the index needs to be created in a different path and
: we need to move
: files around. However, since the documents are added over HTTP, how does one
: even create
: the index in a different path on the same machine while the application is
: still running ?

for the record, i don't think you *have* to do this ... allthough it will
certianly work fine if you want to (since it's just hte master/slave model
starting with an empty index)

if in your current model, you have an index which you never modify, and
you regularly build a new index on a new path and then replace it, you
could do the same thing with a single Solr instance by indexing all of
your new documents on the same index, then deleting all docs older then
your newest "rebuild" (using a timestamp field) and then and only then
issue a commit to tell Solr to start using the new index.

as long as no one else issues a commit while you are "rebuilding" your
index will allways look consistent.

But as i said: the master/slave model will work perfectly for what you
want as well -- and the snap* scripts will take care of loading it up on
your slave.



-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: Realtime directory change...

Chris Hostetter-3

: Thanks Chris. So, assuming that we rebuild the index, delete the old data and
: then execute a commit,
: will the snap scripts take care of reconciling all the data ? Internally, is
: there an update timestamp notion
: used to figure out which unique id records have changed and then synchronize
: them by executing delete/insert ops ?

Ummm... i'm not sure i understand you're question, if you've got a
uniqueKey field, then it doesn't matter what kind of timestamps you have,
Solr will automaticaly delete the old records as you add new records with
the same id.  if you don't have a uniqueKey field, and you want to just
reindex your corpus at moment X and then say "anything older then
timestamp X should be deleted" when you are done, then you can just do a
delete by query using a range query on the date X ... having a
timestamp field that records the moment when something was indexed is
actually very easy, just inlcude a date field with the value of "NOW"
(this will be even easier once i get arround to commiting SOLR-82).

bear in mind, it doesn't have to be a date field ... you could also
record a simple "build number" that you incriment each time you "rebuild"



-Hoss