How do we limit the growth of a Lucene Index?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How do we limit the growth of a Lucene Index?

Sandeep Mahendru
Hi ,

   We have been developing an enterprise logging service at the Wachovia
bank. The logs (Busines, application, error) for all the bank related
applications are consolidated
at one single location in an Oracle 10g Database.

In our second phase, we are now building a high perforinmg report viewer
over it. So our search algorithm does not go to the Oracle 10g DB. We
therfore avoid network and I/O.
Our serach algorith now goes to a LUCENE index. We have Lucene indexes
created for each application. These indexes are present on the same machine,
where the search algorithm runs. As more applications at the bank are now
beginning to consume this service, the Lucene Index is now growing.

One of my team leads has suggested the following approach to resolve this
issue:

*I think the best approach is to restrict the Index size , is to keep it for
some limited time and then archive the same. In case user wants to search
against the old files then we might need to provide some configuration using
which the lucene searcher can point to the achieved file and search the
content. To implement this we need to rename the Index file with from and to
date before its archived. While searching against the older files, user need
to provide the date range and then the app can point to the relevant
archived index files for search. Let me know your thoughts on this. *
**
At present this sounds the most logical to me. But then we begin to store
the Lucene indexes on a diffferent machine. This might again cause the
search algorithm to make a network trip, if the serach is based on old
archived data.

 Is there a better design to resolve the above concern. Does Lucene provid
some sort of API to handle the above scenario's?

Regards,
Sandeep.
Reply | Threaded
Open this post in threaded view
|

Re: How do we limit the growth of a Lucene Index?

Grant Ingersoll-2
You could search this list about distributing your indexes, etc.  
RemoteSearchable may be handy, but you will probably have to build  
some infrastructure around it for handling failover, etc. (would make  
for a nice contribution)

How often do you think archived data will need to be accessed?  And  
how much data are you talking?  Seems to me like the main issue will  
be in managing the searchers in light of having a lot of potential  
indexes.  Just thinking out loud, though.

-Grant

On Nov 4, 2007, at 1:48 PM, Sandeep Mahendru wrote:

> Hi ,
>
>   We have been developing an enterprise logging service at the  
> Wachovia
> bank. The logs (Busines, application, error) for all the bank related
> applications are consolidated
> at one single location in an Oracle 10g Database.
>
> In our second phase, we are now building a high perforinmg report  
> viewer
> over it. So our search algorithm does not go to the Oracle 10g DB. We
> therfore avoid network and I/O.
> Our serach algorith now goes to a LUCENE index. We have Lucene indexes
> created for each application. These indexes are present on the same  
> machine,
> where the search algorithm runs. As more applications at the bank  
> are now
> beginning to consume this service, the Lucene Index is now growing.
>
> One of my team leads has suggested the following approach to resolve  
> this
> issue:
>
> *I think the best approach is to restrict the Index size , is to  
> keep it for
> some limited time and then archive the same. In case user wants to  
> search
> against the old files then we might need to provide some  
> configuration using
> which the lucene searcher can point to the achieved file and search  
> the
> content. To implement this we need to rename the Index file with  
> from and to
> date before its archived. While searching against the older files,  
> user need
> to provide the date range and then the app can point to the relevant
> archived index files for search. Let me know your thoughts on this. *
> **
> At present this sounds the most logical to me. But then we begin to  
> store
> the Lucene indexes on a diffferent machine. This might again cause the
> search algorithm to make a network trip, if the serach is based on old
> archived data.
>
> Is there a better design to resolve the above concern. Does Lucene  
> provid
> some sort of API to handle the above scenario's?
>
> Regards,
> Sandeep.

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How do we limit the growth of a Lucene Index?

Marcus Herou
As you suggest you could either roll the index on the local machine or
remote and gzip the content fileds on the archive index and provide a
GzipReader when you need to search old results.

If money is of the essence then the best solution probably is to have 1 good
box with fast SCSI disks which serves the primary index and an "archive"
slower machine with huge cheap IDE disks.
You then need to have a servlet or such on the archive server which
understands searching over http. Anyone said SOLR :)

//Marcus

On 11/5/07, Grant Ingersoll <[hidden email]> wrote:

>
> You could search this list about distributing your indexes, etc.
> RemoteSearchable may be handy, but you will probably have to build
> some infrastructure around it for handling failover, etc. (would make
> for a nice contribution)
>
> How often do you think archived data will need to be accessed?  And
> how much data are you talking?  Seems to me like the main issue will
> be in managing the searchers in light of having a lot of potential
> indexes.  Just thinking out loud, though.
>
> -Grant
>
> On Nov 4, 2007, at 1:48 PM, Sandeep Mahendru wrote:
>
> > Hi ,
> >
> >   We have been developing an enterprise logging service at the
> > Wachovia
> > bank. The logs (Busines, application, error) for all the bank related
> > applications are consolidated
> > at one single location in an Oracle 10g Database.
> >
> > In our second phase, we are now building a high perforinmg report
> > viewer
> > over it. So our search algorithm does not go to the Oracle 10g DB. We
> > therfore avoid network and I/O.
> > Our serach algorith now goes to a LUCENE index. We have Lucene indexes
> > created for each application. These indexes are present on the same
> > machine,
> > where the search algorithm runs. As more applications at the bank
> > are now
> > beginning to consume this service, the Lucene Index is now growing.
> >
> > One of my team leads has suggested the following approach to resolve
> > this
> > issue:
> >
> > *I think the best approach is to restrict the Index size , is to
> > keep it for
> > some limited time and then archive the same. In case user wants to
> > search
> > against the old files then we might need to provide some
> > configuration using
> > which the lucene searcher can point to the achieved file and search
> > the
> > content. To implement this we need to rename the Index file with
> > from and to
> > date before its archived. While searching against the older files,
> > user need
> > to provide the date range and then the app can point to the relevant
> > archived index files for search. Let me know your thoughts on this. *
> > **
> > At present this sounds the most logical to me. But then we begin to
> > store
> > the Lucene indexes on a diffferent machine. This might again cause the
> > search algorithm to make a network trip, if the serach is based on old
> > archived data.
> >
> > Is there a better design to resolve the above concern. Does Lucene
> > provid
> > some sort of API to handle the above scenario's?
> >
> > Regards,
> > Sandeep.
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Boot Camp Training:
> ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Marcus Herou Solution Architect & Core Java developer Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com
Reply | Threaded
Open this post in threaded view
|

Re: How do we limit the growth of a Lucene Index?

Sandeep Mahendru
Hi Marcus,

  Thanks for providing these suggestions. I will work on these directions
with my team.

Regards,
Sandeep.


On 11/5/07, Marcus Herou <[hidden email]> wrote:

>
> As you suggest you could either roll the index on the local machine or
> remote and gzip the content fileds on the archive index and provide a
> GzipReader when you need to search old results.
>
> If money is of the essence then the best solution probably is to have 1
> good
> box with fast SCSI disks which serves the primary index and an "archive"
> slower machine with huge cheap IDE disks.
> You then need to have a servlet or such on the archive server which
> understands searching over http. Anyone said SOLR :)
>
> //Marcus
>
> On 11/5/07, Grant Ingersoll <[hidden email]> wrote:
> >
> > You could search this list about distributing your indexes, etc.
> > RemoteSearchable may be handy, but you will probably have to build
> > some infrastructure around it for handling failover, etc. (would make
> > for a nice contribution)
> >
> > How often do you think archived data will need to be accessed?  And
> > how much data are you talking?  Seems to me like the main issue will
> > be in managing the searchers in light of having a lot of potential
> > indexes.  Just thinking out loud, though.
> >
> > -Grant
> >
> > On Nov 4, 2007, at 1:48 PM, Sandeep Mahendru wrote:
> >
> > > Hi ,
> > >
> > >   We have been developing an enterprise logging service at the
> > > Wachovia
> > > bank. The logs (Busines, application, error) for all the bank related
> > > applications are consolidated
> > > at one single location in an Oracle 10g Database.
> > >
> > > In our second phase, we are now building a high perforinmg report
> > > viewer
> > > over it. So our search algorithm does not go to the Oracle 10g DB. We
> > > therfore avoid network and I/O.
> > > Our serach algorith now goes to a LUCENE index. We have Lucene indexes
> > > created for each application. These indexes are present on the same
> > > machine,
> > > where the search algorithm runs. As more applications at the bank
> > > are now
> > > beginning to consume this service, the Lucene Index is now growing.
> > >
> > > One of my team leads has suggested the following approach to resolve
> > > this
> > > issue:
> > >
> > > *I think the best approach is to restrict the Index size , is to
> > > keep it for
> > > some limited time and then archive the same. In case user wants to
> > > search
> > > against the old files then we might need to provide some
> > > configuration using
> > > which the lucene searcher can point to the achieved file and search
> > > the
> > > content. To implement this we need to rename the Index file with
> > > from and to
> > > date before its archived. While searching against the older files,
> > > user need
> > > to provide the date range and then the app can point to the relevant
> > > archived index files for search. Let me know your thoughts on this. *
> > > **
> > > At present this sounds the most logical to me. But then we begin to
> > > store
> > > the Lucene indexes on a diffferent machine. This might again cause the
> > > search algorithm to make a network trip, if the serach is based on old
> > > archived data.
> > >
> > > Is there a better design to resolve the above concern. Does Lucene
> > > provid
> > > some sort of API to handle the above scenario's?
> > >
> > > Regards,
> > > Sandeep.
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> >
> > Lucene Boot Camp Training:
> > ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!
> http://www.apachecon.com
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> --
> Marcus Herou Solution Architect & Core Java developer Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com
>
Reply | Threaded
Open this post in threaded view
|

Re: How do we limit the growth of a Lucene Index?

Sandeep Mahendru
In reply to this post by Grant Ingersoll-2
Hi Grant,

  Thanks for providing these suggestions. I will work on these directions
with my team.

Regards,
Sandeep.


On 11/5/07, Grant Ingersoll <[hidden email]> wrote:

>
> You could search this list about distributing your indexes, etc.
> RemoteSearchable may be handy, but you will probably have to build
> some infrastructure around it for handling failover, etc. (would make
> for a nice contribution)
>
> How often do you think archived data will need to be accessed?  And
> how much data are you talking?  Seems to me like the main issue will
> be in managing the searchers in light of having a lot of potential
> indexes.  Just thinking out loud, though.
>
> -Grant
>
> On Nov 4, 2007, at 1:48 PM, Sandeep Mahendru wrote:
>
> > Hi ,
> >
> >   We have been developing an enterprise logging service at the
> > Wachovia
> > bank. The logs (Busines, application, error) for all the bank related
> > applications are consolidated
> > at one single location in an Oracle 10g Database.
> >
> > In our second phase, we are now building a high perforinmg report
> > viewer
> > over it. So our search algorithm does not go to the Oracle 10g DB. We
> > therfore avoid network and I/O.
> > Our serach algorith now goes to a LUCENE index. We have Lucene indexes
> > created for each application. These indexes are present on the same
> > machine,
> > where the search algorithm runs. As more applications at the bank
> > are now
> > beginning to consume this service, the Lucene Index is now growing.
> >
> > One of my team leads has suggested the following approach to resolve
> > this
> > issue:
> >
> > *I think the best approach is to restrict the Index size , is to
> > keep it for
> > some limited time and then archive the same. In case user wants to
> > search
> > against the old files then we might need to provide some
> > configuration using
> > which the lucene searcher can point to the achieved file and search
> > the
> > content. To implement this we need to rename the Index file with
> > from and to
> > date before its archived. While searching against the older files,
> > user need
> > to provide the date range and then the app can point to the relevant
> > archived index files for search. Let me know your thoughts on this. *
> > **
> > At present this sounds the most logical to me. But then we begin to
> > store
> > the Lucene indexes on a diffferent machine. This might again cause the
> > search algorithm to make a network trip, if the serach is based on old
> > archived data.
> >
> > Is there a better design to resolve the above concern. Does Lucene
> > provid
> > some sort of API to handle the above scenario's?
> >
> > Regards,
> > Sandeep.
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Boot Camp Training:
> ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>