Integrating dynamic data into Lucene search/ranking

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Integrating dynamic data into Lucene search/ranking

Tobias Lohr
I have a more architectural question, which is maybe sort of off topic, but as I want to implement it using Java  and Lucene, it's the right forum however:

I'm thinking of an approach to design a system that integrates dynamic information into a search (and a ranking) functionality using Lucene. With dynamic data I mean, data which changes very often within the typical index rebuild cycle, i.e. transactional data.

A good example is the sorting of products in an online store by product availability.

Does anybody know good reading resources (approaches, whitepapers, books etc.) for integrating such dynamic information into a search/ranking functionality?

(I already searched at Google, but couldn't find anything useful though.)

Thanks in advance!
--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger?did=10

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Integrating dynamic data into Lucene search/ranking

Otis Gospodnetic-2
Tobias,

The question is a little too open, I think.  Perhaps start by saying what you've tried, what doesn't work, what you think won't work, the actual rate of change, the size of your index and, very importantly, how quickly you need to see index changes (adds, deletes, updates).

How about this for the boostrap question: just update (delete+add) the whole Document and reopen the IndexSeearcher every N minutes.  Would that work for you?

Does only the stored data change or also searchable data?  If the former, you could choose to store that in the external data store (e.g. RDBMS, BDB...)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Tobias Lohr <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 15, 2008 11:33:56 AM
Subject: Integrating dynamic data into Lucene search/ranking

I have a more architectural question, which is maybe sort of off topic,
 but as I want to implement it using Java  and Lucene, it's the right
 forum however:

I'm thinking of an approach to design a system that integrates dynamic
 information into a search (and a ranking) functionality using Lucene.
 With dynamic data I mean, data which changes very often within the
 typical index rebuild cycle, i.e. transactional data.

A good example is the sorting of products in an online store by product
 availability.

Does anybody know good reading resources (approaches, whitepapers,
 books etc.) for integrating such dynamic information into a search/ranking
 functionality?

(I already searched at Google, but couldn't find anything useful
 though.)

Thanks in advance!
--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger?did=10

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Integrating dynamic data into Lucene search/ranking

Tobias Lohr
The index contains about a several ten thousand documents, with a field
count of about fifty. The index is going to be rebuild approx. every
day, but varies, since the searchable content doesn't change very often.
Now I face the challenge to work in more dynamic data into the index,
and even make this searchable, or more using it to sort the documents in
the search result. The change frequency of this data is very high, at
minimum every minute, but can be every few seconds.

I know that it is impossible to have a real time up to date index and
that there will be gap in time. But as long as this is not to large,
this is ok. I've tried to do some incremental update every X seconds or
minutes. This would eventually work for small indexes, with less data to
be indexed for a single document, but I think that this is not the right
approach here. It's also not the right way, because of collecting all
the static, searchable data that didn't changed for a document, although
only the dynamic information changed, you know?

I'm sorry, what exactly do you mean with "reopen the IndexSearcher every
N minutes"?

You say, that I could also store the data in an external data store. Do
you mean the dynamic data? If yes, I some how need this information
within an index, in order to sort by it, right? Or do I overlook
something here?

Thanks,
Tobias

Otis Gospodnetic schrieb:

> Tobias,
>
> The question is a little too open, I think.  Perhaps start by saying what you've tried, what doesn't work, what you think won't work, the actual rate of change, the size of your index and, very importantly, how quickly you need to see index changes (adds, deletes, updates).
>
> How about this for the boostrap question: just update (delete+add) the whole Document and reopen the IndexSeearcher every N minutes.  Would that work for you?
>
> Does only the stored data change or also searchable data?  If the former, you could choose to store that in the external data store (e.g. RDBMS, BDB...)
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Tobias Lohr <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, January 15, 2008 11:33:56 AM
> Subject: Integrating dynamic data into Lucene search/ranking
>
> I have a more architectural question, which is maybe sort of off topic,
>  but as I want to implement it using Java  and Lucene, it's the right
>  forum however:
>
> I'm thinking of an approach to design a system that integrates dynamic
>  information into a search (and a ranking) functionality using Lucene.
>  With dynamic data I mean, data which changes very often within the
>  typical index rebuild cycle, i.e. transactional data.
>
> A good example is the sorting of products in an online store by product
>  availability.
>
> Does anybody know good reading resources (approaches, whitepapers,
>  books etc.) for integrating such dynamic information into a search/ranking
>  functionality?
>
> (I already searched at Google, but couldn't find anything useful
>  though.)
>
> Thanks in advance!
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SV: Integrating dynamic data into Lucene search/ranking

Marcus Falk
We did this in our system, indexing a constant flow of news articles, by doing as Otis described (reopened the indexsearcher)..
 
Every 3:d minute we are creating a new indexsearcher in the background after this searcher has been created we are fireing some warm up queries against it and after that we change the old searcher to point to the new one. Works fine for us and we got large indexes (several millions of articles)...
 
/Regards
Marcus
 
 
 

________________________________

Från: Tobias Lohr [mailto:[hidden email]]
Skickat: on 2008-01-16 21:57
Till: [hidden email]
Ämne: Re: Integrating dynamic data into Lucene search/ranking



The index contains about a several ten thousand documents, with a field
count of about fifty. The index is going to be rebuild approx. every
day, but varies, since the searchable content doesn't change very often.
Now I face the challenge to work in more dynamic data into the index,
and even make this searchable, or more using it to sort the documents in
the search result. The change frequency of this data is very high, at
minimum every minute, but can be every few seconds.

I know that it is impossible to have a real time up to date index and
that there will be gap in time. But as long as this is not to large,
this is ok. I've tried to do some incremental update every X seconds or
minutes. This would eventually work for small indexes, with less data to
be indexed for a single document, but I think that this is not the right
approach here. It's also not the right way, because of collecting all
the static, searchable data that didn't changed for a document, although
only the dynamic information changed, you know?

I'm sorry, what exactly do you mean with "reopen the IndexSearcher every
N minutes"?

You say, that I could also store the data in an external data store. Do
you mean the dynamic data? If yes, I some how need this information
within an index, in order to sort by it, right? Or do I overlook
something here?

Thanks,
Tobias

Otis Gospodnetic schrieb:

> Tobias,
>
> The question is a little too open, I think.  Perhaps start by saying what you've tried, what doesn't work, what you think won't work, the actual rate of change, the size of your index and, very importantly, how quickly you need to see index changes (adds, deletes, updates).
>
> How about this for the boostrap question: just update (delete+add) the whole Document and reopen the IndexSeearcher every N minutes.  Would that work for you?
>
> Does only the stored data change or also searchable data?  If the former, you could choose to store that in the external data store (e.g. RDBMS, BDB...)
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Tobias Lohr <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, January 15, 2008 11:33:56 AM
> Subject: Integrating dynamic data into Lucene search/ranking
>
> I have a more architectural question, which is maybe sort of off topic,
>  but as I want to implement it using Java  and Lucene, it's the right
>  forum however:
>
> I'm thinking of an approach to design a system that integrates dynamic
>  information into a search (and a ranking) functionality using Lucene.
>  With dynamic data I mean, data which changes very often within the
>  typical index rebuild cycle, i.e. transactional data.
>
> A good example is the sorting of products in an online store by product
>  availability.
>
> Does anybody know good reading resources (approaches, whitepapers,
>  books etc.) for integrating such dynamic information into a search/ranking
>  functionality?
>
> (I already searched at Google, but couldn't find anything useful
>  though.)
>
> Thanks in advance!
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: SV: Integrating dynamic data into Lucene search/ranking

Tobias Lohr
I'm not really sure, if this approach is possible for working in changes every - let's say - 30 seconds!?

-------- Original-Nachricht --------
> Datum: Thu, 17 Jan 2008 05:35:13 +0100
> Von: "Marcus Falk" <[hidden email]>
> An: [hidden email], [hidden email]
> Betreff: SV: Integrating dynamic data into Lucene search/ranking

> We did this in our system, indexing a constant flow of news articles, by
> doing as Otis described (reopened the indexsearcher)..
>  
> Every 3:d minute we are creating a new indexsearcher in the background
> after this searcher has been created we are fireing some warm up queries
> against it and after that we change the old searcher to point to the new one.
> Works fine for us and we got large indexes (several millions of articles)...
>  
> /Regards
> Marcus
>  
>  
>  
>
> ________________________________
>
> Från: Tobias Lohr [mailto:[hidden email]]
> Skickat: on 2008-01-16 21:57
> Till: [hidden email]
> Ämne: Re: Integrating dynamic data into Lucene search/ranking
>
>
>
> The index contains about a several ten thousand documents, with a field
> count of about fifty. The index is going to be rebuild approx. every
> day, but varies, since the searchable content doesn't change very often.
> Now I face the challenge to work in more dynamic data into the index,
> and even make this searchable, or more using it to sort the documents in
> the search result. The change frequency of this data is very high, at
> minimum every minute, but can be every few seconds.
>
> I know that it is impossible to have a real time up to date index and
> that there will be gap in time. But as long as this is not to large,
> this is ok. I've tried to do some incremental update every X seconds or
> minutes. This would eventually work for small indexes, with less data to
> be indexed for a single document, but I think that this is not the right
> approach here. It's also not the right way, because of collecting all
> the static, searchable data that didn't changed for a document, although
> only the dynamic information changed, you know?
>
> I'm sorry, what exactly do you mean with "reopen the IndexSearcher every
> N minutes"?
>
> You say, that I could also store the data in an external data store. Do
> you mean the dynamic data? If yes, I some how need this information
> within an index, in order to sort by it, right? Or do I overlook
> something here?
>
> Thanks,
> Tobias
>
> Otis Gospodnetic schrieb:
> > Tobias,
> >
> > The question is a little too open, I think.  Perhaps start by saying
> what you've tried, what doesn't work, what you think won't work, the actual
> rate of change, the size of your index and, very importantly, how quickly you
> need to see index changes (adds, deletes, updates).
> >
> > How about this for the boostrap question: just update (delete+add) the
> whole Document and reopen the IndexSeearcher every N minutes.  Would that
> work for you?
> >
> > Does only the stored data change or also searchable data?  If the
> former, you could choose to store that in the external data store (e.g. RDBMS,
> BDB...)
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > From: Tobias Lohr <[hidden email]>
> > To: [hidden email]
> > Sent: Tuesday, January 15, 2008 11:33:56 AM
> > Subject: Integrating dynamic data into Lucene search/ranking
> >
> > I have a more architectural question, which is maybe sort of off topic,
> >  but as I want to implement it using Java  and Lucene, it's the right
> >  forum however:
> >
> > I'm thinking of an approach to design a system that integrates dynamic
> >  information into a search (and a ranking) functionality using Lucene.
> >  With dynamic data I mean, data which changes very often within the
> >  typical index rebuild cycle, i.e. transactional data.
> >
> > A good example is the sorting of products in an online store by product
> >  availability.
> >
> > Does anybody know good reading resources (approaches, whitepapers,
> >  books etc.) for integrating such dynamic information into a
> search/ranking
> >  functionality?
> >
> > (I already searched at Google, but couldn't find anything useful
> >  though.)
> >
> > Thanks in advance!
> >  
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

--
Psssst! Schon vom neuen GMX MultiMessenger gehört?
Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger?did=10

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SV: Integrating dynamic data into Lucene search/ranking

Andrzej Białecki-2
Tobias Lohr wrote:
> I'm not really sure, if this approach is possible for working in changes every - let's say - 30 seconds!?

The conventional wisdom is to use RAMDirectory in such scenarios. I.e.
you commit frequent updates to a RAMDirectory and frequently reopen its
Searcher (which should be fast). Periodically, merge the RAMDirectory
index with your on-disk index - you need to open a new IndexSearcher in
the background, warm it up with the latest N queries, and when it's
ready you swap searchers, i.e. you close the old one, purge the
RAMDirectory (since it was synced to the on-disk index), and start using
the new IndexSearcher.

And again, start accumulating new docs in the RAMDirectory, etc, etc ...

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SV: SV: Integrating dynamic data into Lucene search/ranking

Marcus Falk
In reply to this post by Marcus Falk
In our solution we used a RAMDir for the newest incoming articles and a FSDir for older ones. Then we had a limit for the ramdir  like 10.000 documents when that limit were hit we used mergesegments to move the content from ramdir -> fsdir, actually we had to do some modification in the mergesegment method since it always seemed to do an optimize on the index after the merge, I have the code if u want it.

If you use RAMDir + FSDir you can use 2 indexserchers and one multisearcher on top. The indexsearcher that uses the small RAMDir can be rebinded quite often.

/
Regards
M


-----Ursprungligt meddelande-----
Från: Andrzej Bialecki [mailto:[hidden email]]
Skickat: den 17 januari 2008 10:55
Till: [hidden email]
Ämne: Re: SV: Integrating dynamic data into Lucene search/ranking

Tobias Lohr wrote:
> I'm not really sure, if this approach is possible for working in changes every - let's say - 30 seconds!?

The conventional wisdom is to use RAMDirectory in such scenarios. I.e.
you commit frequent updates to a RAMDirectory and frequently reopen its
Searcher (which should be fast). Periodically, merge the RAMDirectory
index with your on-disk index - you need to open a new IndexSearcher in
the background, warm it up with the latest N queries, and when it's
ready you swap searchers, i.e. you close the old one, purge the
RAMDirectory (since it was synced to the on-disk index), and start using
the new IndexSearcher.

And again, start accumulating new docs in the RAMDirectory, etc, etc ...

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


> Datum: Thu, 17 Jan 2008 05:35:13 +0100
> Von: "Marcus Falk" <[hidden email]>
> An: [hidden email], [hidden email]
> Betreff: SV: Integrating dynamic data into Lucene search/ranking

> We did this in our system, indexing a constant flow of news articles,
> by doing as Otis described (reopened the indexsearcher)..
>  
> Every 3:d minute we are creating a new indexsearcher in the background
> after this searcher has been created we are fireing some warm up
> queries against it and after that we change the old searcher to point to the new one.
> Works fine for us and we got large indexes (several millions of articles)...
>  
> /Regards
> Marcus
>  
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SV: SV: Integrating dynamic data into Lucene search/ranking

Tobias Lohr
Thanks for your hint. If its possible I would take a look into the code, but the approach is interesting.

What would you say to this approach I developed in my mind:

- Having an additional quite smaller index, were only the dynamic data resides and is incorporated every N seconds with incremental index updates.
- Documents of the additional index have the same semantical "id" field, to model a relation between them.
- A search is actually based on the index containing the searchable content, but the sorting/ranking is done using a SortComparatorSource, which "extracts" the information and calculates the score for the documents  of the content index.

What do you say?

-------- Original-Nachricht --------
> Datum: Thu, 17 Jan 2008 14:26:53 +0100
> Von: "Marcus Falk" <[hidden email]>
> An: [hidden email]
> Betreff: SV: SV: Integrating dynamic data into Lucene search/ranking

> In our solution we used a RAMDir for the newest incoming articles and a
> FSDir for older ones. Then we had a limit for the ramdir  like 10.000
> documents when that limit were hit we used mergesegments to move the content from
> ramdir -> fsdir, actually we had to do some modification in the
> mergesegment method since it always seemed to do an optimize on the index after the
> merge, I have the code if u want it.
>
> If you use RAMDir + FSDir you can use 2 indexserchers and one
> multisearcher on top. The indexsearcher that uses the small RAMDir can be rebinded
> quite often.
>
> /
> Regards
> M
>
>
> -----Ursprungligt meddelande-----
> Från: Andrzej Bialecki [mailto:[hidden email]]
> Skickat: den 17 januari 2008 10:55
> Till: [hidden email]
> Ämne: Re: SV: Integrating dynamic data into Lucene search/ranking
>
> Tobias Lohr wrote:
> > I'm not really sure, if this approach is possible for working in changes
> every - let's say - 30 seconds!?
>
> The conventional wisdom is to use RAMDirectory in such scenarios. I.e.
> you commit frequent updates to a RAMDirectory and frequently reopen its
> Searcher (which should be fast). Periodically, merge the RAMDirectory
> index with your on-disk index - you need to open a new IndexSearcher in
> the background, warm it up with the latest N queries, and when it's
> ready you swap searchers, i.e. you close the old one, purge the
> RAMDirectory (since it was synced to the on-disk index), and start using
> the new IndexSearcher.
>
> And again, start accumulating new docs in the RAMDirectory, etc, etc ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> > Datum: Thu, 17 Jan 2008 05:35:13 +0100
> > Von: "Marcus Falk" <[hidden email]>
> > An: [hidden email], [hidden email]
> > Betreff: SV: Integrating dynamic data into Lucene search/ranking
>
> > We did this in our system, indexing a constant flow of news articles,
> > by doing as Otis described (reopened the indexsearcher)..
> >  
> > Every 3:d minute we are creating a new indexsearcher in the background
> > after this searcher has been created we are fireing some warm up
> > queries against it and after that we change the old searcher to point to
> the new one.
> > Works fine for us and we got large indexes (several millions of
> articles)...
> >  
> > /Regards
> > Marcus
> >  
> >  
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SV: SV: SV: Integrating dynamic data into Lucene search/ranking

Marcus Falk
In reply to this post by Marcus Falk
I think that would work. But I'm not 100% sure of what you are trying to achieve.

Just a notice:
Sorting on results has poor performance, if you have a large index, we ran into severe performance problems with just a coupe of million articles which lead us to modify the ranking instead.

Code for merge without optimize:

        public virtual void AddIndexesWithoutMerge(Directory dir)
                {
                        lock (this)
                        {
                                int start = segmentInfos.Count;
                               
               
                                SegmentInfos sis = new SegmentInfos(); // read infos from dir
                                sis.Read(dir);
                                for (int j = 0; j < sis.Count; j++)
                                {
                                        segmentInfos.Add(sis.Info(j)); // add each info
                            }
                               
                               
                                // merge newly added segments in log(n) passes
                                while (segmentInfos.Count > start + mergeFactor)
                                {
                                        for (int base_Renamed = start + 1; base_Renamed < segmentInfos.Count; base_Renamed++)
                                        {
                                                int end = System.Math.Min(segmentInfos.Count, base_Renamed + mergeFactor);
                                                if (end - base_Renamed > 1)
                                                        MergeSegments(base_Renamed, end);
                                        }
                                }

                MaybeMergeSegments();
                        }
                }

(in indexwriter class (C# code as you notice, will probably look about the same in java))

There is a AddIndexes method in the original implementation of indexwriter, however that would cause a merge of files on disc my version causes the directory to be writed to a new file (so if you put in a RamDirectory containing 10.000 docs you will get a new file with 10.000 docs on disk, which later will be merged by lucene when mergefactor is triggered on FS indexwriter).

/
Regards
Marcus













-----Ursprungligt meddelande-----
Från: Tobias Lohr [mailto:[hidden email]]
Skickat: den 17 januari 2008 15:15
Till: [hidden email]
Ämne: Re: SV: SV: Integrating dynamic data into Lucene search/ranking

Thanks for your hint. If its possible I would take a look into the code, but the approach is interesting.

What would you say to this approach I developed in my mind:

- Having an additional quite smaller index, were only the dynamic data resides and is incorporated every N seconds with incremental index updates.
- Documents of the additional index have the same semantical "id" field, to model a relation between them.
- A search is actually based on the index containing the searchable content, but the sorting/ranking is done using a SortComparatorSource, which "extracts" the information and calculates the score for the documents  of the content index.

What do you say?

-------- Original-Nachricht --------
> Datum: Thu, 17 Jan 2008 14:26:53 +0100
> Von: "Marcus Falk" <[hidden email]>
> An: [hidden email]
> Betreff: SV: SV: Integrating dynamic data into Lucene search/ranking

> In our solution we used a RAMDir for the newest incoming articles and a
> FSDir for older ones. Then we had a limit for the ramdir  like 10.000
> documents when that limit were hit we used mergesegments to move the content from
> ramdir -> fsdir, actually we had to do some modification in the
> mergesegment method since it always seemed to do an optimize on the index after the
> merge, I have the code if u want it.
>
> If you use RAMDir + FSDir you can use 2 indexserchers and one
> multisearcher on top. The indexsearcher that uses the small RAMDir can be rebinded
> quite often.
>
> /
> Regards
> M
>
>
> -----Ursprungligt meddelande-----
> Från: Andrzej Bialecki [mailto:[hidden email]]
> Skickat: den 17 januari 2008 10:55
> Till: [hidden email]
> Ämne: Re: SV: Integrating dynamic data into Lucene search/ranking
>
> Tobias Lohr wrote:
> > I'm not really sure, if this approach is possible for working in changes
> every - let's say - 30 seconds!?
>
> The conventional wisdom is to use RAMDirectory in such scenarios. I.e.
> you commit frequent updates to a RAMDirectory and frequently reopen its
> Searcher (which should be fast). Periodically, merge the RAMDirectory
> index with your on-disk index - you need to open a new IndexSearcher in
> the background, warm it up with the latest N queries, and when it's
> ready you swap searchers, i.e. you close the old one, purge the
> RAMDirectory (since it was synced to the on-disk index), and start using
> the new IndexSearcher.
>
> And again, start accumulating new docs in the RAMDirectory, etc, etc ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> > Datum: Thu, 17 Jan 2008 05:35:13 +0100
> > Von: "Marcus Falk" <[hidden email]>
> > An: [hidden email], [hidden email]
> > Betreff: SV: Integrating dynamic data into Lucene search/ranking
>
> > We did this in our system, indexing a constant flow of news articles,
> > by doing as Otis described (reopened the indexsearcher)..
> >  
> > Every 3:d minute we are creating a new indexsearcher in the background
> > after this searcher has been created we are fireing some warm up
> > queries against it and after that we change the old searcher to point to
> the new one.
> > Works fine for us and we got large indexes (several millions of
> articles)...
> >  
> > /Regards
> > Marcus
> >  
> >  
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SV: SV: SV: Integrating dynamic data into Lucene search/ranking

Marcus Falk
In reply to this post by Marcus Falk
I heard from a friend that this behavior (AddWithoutMerge) has been added into 2.1 or 2.2 of lucene.

/M

-----Ursprungligt meddelande-----
Från: Marcus Falk [mailto:[hidden email]]
Skickat: den 17 januari 2008 16:34
Till: [hidden email]
Ämne: SV: SV: SV: Integrating dynamic data into Lucene search/ranking

I think that would work. But I'm not 100% sure of what you are trying to achieve.

Just a notice:
Sorting on results has poor performance, if you have a large index, we ran into severe performance problems with just a coupe of million articles which lead us to modify the ranking instead.

Code for merge without optimize:

        public virtual void AddIndexesWithoutMerge(Directory dir)
                {
                        lock (this)
                        {
                                int start = segmentInfos.Count;
                               
               
                                SegmentInfos sis = new SegmentInfos(); // read infos from dir
                                sis.Read(dir);
                                for (int j = 0; j < sis.Count; j++)
                                {
                                        segmentInfos.Add(sis.Info(j)); // add each info
                            }
                               
                               
                                // merge newly added segments in log(n) passes
                                while (segmentInfos.Count > start + mergeFactor)
                                {
                                        for (int base_Renamed = start + 1; base_Renamed < segmentInfos.Count; base_Renamed++)
                                        {
                                                int end = System.Math.Min(segmentInfos.Count, base_Renamed + mergeFactor);
                                                if (end - base_Renamed > 1)
                                                        MergeSegments(base_Renamed, end);
                                        }
                                }

                MaybeMergeSegments();
                        }
                }

(in indexwriter class (C# code as you notice, will probably look about the same in java))

There is a AddIndexes method in the original implementation of indexwriter, however that would cause a merge of files on disc my version causes the directory to be writed to a new file (so if you put in a RamDirectory containing 10.000 docs you will get a new file with 10.000 docs on disk, which later will be merged by lucene when mergefactor is triggered on FS indexwriter).

/
Regards
Marcus













-----Ursprungligt meddelande-----
Från: Tobias Lohr [mailto:[hidden email]]
Skickat: den 17 januari 2008 15:15
Till: [hidden email]
Ämne: Re: SV: SV: Integrating dynamic data into Lucene search/ranking

Thanks for your hint. If its possible I would take a look into the code, but the approach is interesting.

What would you say to this approach I developed in my mind:

- Having an additional quite smaller index, were only the dynamic data resides and is incorporated every N seconds with incremental index updates.
- Documents of the additional index have the same semantical "id" field, to model a relation between them.
- A search is actually based on the index containing the searchable content, but the sorting/ranking is done using a SortComparatorSource, which "extracts" the information and calculates the score for the documents  of the content index.

What do you say?

-------- Original-Nachricht --------
> Datum: Thu, 17 Jan 2008 14:26:53 +0100
> Von: "Marcus Falk" <[hidden email]>
> An: [hidden email]
> Betreff: SV: SV: Integrating dynamic data into Lucene search/ranking

> In our solution we used a RAMDir for the newest incoming articles and a
> FSDir for older ones. Then we had a limit for the ramdir  like 10.000
> documents when that limit were hit we used mergesegments to move the content from
> ramdir -> fsdir, actually we had to do some modification in the
> mergesegment method since it always seemed to do an optimize on the index after the
> merge, I have the code if u want it.
>
> If you use RAMDir + FSDir you can use 2 indexserchers and one
> multisearcher on top. The indexsearcher that uses the small RAMDir can be rebinded
> quite often.
>
> /
> Regards
> M
>
>
> -----Ursprungligt meddelande-----
> Från: Andrzej Bialecki [mailto:[hidden email]]
> Skickat: den 17 januari 2008 10:55
> Till: [hidden email]
> Ämne: Re: SV: Integrating dynamic data into Lucene search/ranking
>
> Tobias Lohr wrote:
> > I'm not really sure, if this approach is possible for working in changes
> every - let's say - 30 seconds!?
>
> The conventional wisdom is to use RAMDirectory in such scenarios. I.e.
> you commit frequent updates to a RAMDirectory and frequently reopen its
> Searcher (which should be fast). Periodically, merge the RAMDirectory
> index with your on-disk index - you need to open a new IndexSearcher in
> the background, warm it up with the latest N queries, and when it's
> ready you swap searchers, i.e. you close the old one, purge the
> RAMDirectory (since it was synced to the on-disk index), and start using
> the new IndexSearcher.
>
> And again, start accumulating new docs in the RAMDirectory, etc, etc ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> > Datum: Thu, 17 Jan 2008 05:35:13 +0100
> > Von: "Marcus Falk" <[hidden email]>
> > An: [hidden email], [hidden email]
> > Betreff: SV: Integrating dynamic data into Lucene search/ranking
>
> > We did this in our system, indexing a constant flow of news articles,
> > by doing as Otis described (reopened the indexsearcher)..
> >  
> > Every 3:d minute we are creating a new indexsearcher in the background
> > after this searcher has been created we are fireing some warm up
> > queries against it and after that we change the old searcher to point to
> the new one.
> > Works fine for us and we got large indexes (several millions of
> articles)...
> >  
> > /Regards
> > Marcus
> >  
> >  
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]