Lucene Queries Over User-Editable Dynamic Categories of Documents

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene Queries Over User-Editable Dynamic Categories of Documents

lucene user-2
Folks!

We are building a web-based multi-user system. Users of our system are able
to categorize items that they have found into groups of related documents.
We would like users to be able to search these document groups and rapidly
find matches. Each user might have ten of these categories and might have
perhaps a few hundred documents in each. These categories might be highly
dynamic, with users adding and deleting documents from these categories many
times a day. How might we use Lucene to perform searches limited to these
very dynamic and end-user editable categories? Any ideas for how we might do
this efficiently?

If all the data were in a SQL database, we could run a subquery that
returned the IDs of the items in categories and use that to limit the
results of the super query.

Currently we do not plan to maintain the information about the end-user's
categories in the Lucene index at all, or not in a big, main Lucene index
anyway.

What our the reasonable options for handling this? What are the performance
implications of various choices?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

mark harwood
Given the volatility in the set membership I'd be tempted to keep that grouping info in a database rather than doing the reader/writer-open/close dance in Lucene before you can see any updates. (I suspect this is the reason you've opted not to keep the info in Lucene).
You can pull a user's list of a hundred or so terms out of the database (typically primary keys) and add them as a TermsFilter to your Lucene queries.
I've found that using this approach can be pretty fast even with a large list of filter terms - it was a while ago so I can't quote stats, you'll need to try it for yourself.

Caching these filters may prove useful but if it's a big dataset Bitsets don't sound like a memory-efficient form of storing these lists as it sounds like they'll be sparsely populated.
You may be interested in the more memory-efficient options such as SortedVIntList here: http://issues.apache.org/jira/browse/LUCENE-584.
Without taking the whole of that patch on board you could have a caching strategy based on this pseudocode:

getFilter(Set primaryKeys, IndexReader reader)
{
   TermsFilter tf= new TermsFilter()
   for all primaryKeys:
       tf.addTerm(primaryKey)
  BitSet bits;
  SortedVIntList cached=lruCachedMap.get(tf);
  if(cached==null)
        bits=tf.bits(reader)
        lruCachedMap.put(tf, convertBitsToSortedVIntList(bits))
  else
        bits=convertSortedVIntListToBits(bits)
  return new Filter()
       {
                 BitSet bits(IndexReader reader)
                 {
                     return bits;
                 }
       };
}


On a bit of a lucene-dev tangent, I think the above code has the makings of an optimisation to CachingWrapperFilter - it could choose to cache SortedVIntLists or BitSets depending on the sparseness of the set and transparently handles any required conversions.



----- Original Message ----
From: lucene user <[hidden email]>
To: [hidden email]
Sent: Wednesday, 24 October, 2007 7:18:10 AM
Subject: Lucene Queries Over User-Editable Dynamic Categories of Documents

Folks!

We are building a web-based multi-user system. Users of our system are
 able
to categorize items that they have found into groups of related
 documents.
We would like users to be able to search these document groups and
 rapidly
find matches. Each user might have ten of these categories and might
 have
perhaps a few hundred documents in each. These categories might be
 highly
dynamic, with users adding and deleting documents from these categories
 many
times a day. How might we use Lucene to perform searches limited to
 these
very dynamic and end-user editable categories? Any ideas for how we
 might do
this efficiently?

If all the data were in a SQL database, we could run a subquery that
returned the IDs of the items in categories and use that to limit the
results of the super query.

Currently we do not plan to maintain the information about the
 end-user's
categories in the Lucene index at all, or not in a big, main Lucene
 index
anyway.

What our the reasonable options for handling this? What are the
 performance
implications of various choices?

Thanks!





      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

lucene user-2
Thanks very much. How large can my end-user's catigories grow before this
implementation you have outlined will start to bog down? If my users had
thousands of items categorized, would you still recommend using a term
filter in this way? Tens of thousands? What is a realistic max? Is there
another idea that works for even larger numbers? Frankly, we don't yet
understand how our users will use the system in the long run. When you have
done stuff like this, how large have the term filters grown?

Would it EVER make sense to maintain the end user's catigories in some sort
of Lucene data structure? If so, what data structure?

Would it EVER be wise to keep the end-user catigories in Lucene? If so, when
and how?

What are other realistic options for implementing user categorization of
documents?

When solr talks about "faceted searching," this isn't what they mean, is it?
They say "Faceted Searching based on unique field values and explicit
queries" and I'm looking to find what they mean and not getting clear.

Thanks!

On 10/24/07, mark harwood <[hidden email]> wrote:

>
> Given the volatility in the set membership I'd be tempted to keep that
> grouping info in a database rather than doing the reader/writer-open/close
> dance in Lucene before you can see any updates. (I suspect this is the
> reason you've opted not to keep the info in Lucene).
> You can pull a user's list of a hundred or so terms out of the database
> (typically primary keys) and add them as a TermsFilter to your Lucene
> queries.
> I've found that using this approach can be pretty fast even with a large
> list of filter terms - it was a while ago so I can't quote stats, you'll
> need to try it for yourself.
>
> Caching these filters may prove useful but if it's a big dataset Bitsets
> don't sound like a memory-efficient form of storing these lists as it sounds
> like they'll be sparsely populated.
> You may be interested in the more memory-efficient options such as
> SortedVIntList here: http://issues.apache.org/jira/browse/LUCENE-584.
> Without taking the whole of that patch on board you could have a caching
> strategy based on this pseudocode:
>
> getFilter(Set primaryKeys, IndexReader reader)
> {
>    TermsFilter tf= new TermsFilter()
>    for all primaryKeys:
>        tf.addTerm(primaryKey)
>   BitSet bits;
>   SortedVIntList cached=lruCachedMap.get(tf);
>   if(cached==null)
>         bits=tf.bits(reader)
>         lruCachedMap.put(tf, convertBitsToSortedVIntList(bits))
>   else
>         bits=convertSortedVIntListToBits(bits)
>   return new Filter()
>        {
>                  BitSet bits(IndexReader reader)
>                  {
>                      return bits;
>                  }
>        };
> }
>
>
> On a bit of a lucene-dev tangent, I think the above code has the makings
> of an optimisation to CachingWrapperFilter - it could choose to cache
> SortedVIntLists or BitSets depending on the sparseness of the set and
> transparently handles any required conversions.
>
>
>
> ----- Original Message ----
> From: lucene user <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, 24 October, 2007 7:18:10 AM
> Subject: Lucene Queries Over User-Editable Dynamic Categories of Documents
>
> Folks!
>
> We are building a web-based multi-user system. Users of our system are
> able
> to categorize items that they have found into groups of related
> documents.
> We would like users to be able to search these document groups and
> rapidly
> find matches. Each user might have ten of these categories and might
> have
> perhaps a few hundred documents in each. These categories might be
> highly
> dynamic, with users adding and deleting documents from these categories
> many
> times a day. How might we use Lucene to perform searches limited to
> these
> very dynamic and end-user editable categories? Any ideas for how we
> might do
> this efficiently?
>
> If all the data were in a SQL database, we could run a subquery that
> returned the IDs of the items in categories and use that to limit the
> results of the super query.
>
> Currently we do not plan to maintain the information about the
> end-user's
> categories in the Lucene index at all, or not in a big, main Lucene
> index
> anyway.
>
> What our the reasonable options for handling this? What are the
> performance
> implications of various choices?
>
> Thanks!
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

mark harwood
In reply to this post by lucene user-2
If you were talking about a reasonably stable dataset e.g. products in a catalogue that would be manageable in Lucene because the volume of updates is comparatively low (one set of categorisations maintained by the site owner e.g. something like Cnet where Solr/faceted search came from). If you are talking about lots of users with their own categories/tags (I'm thinking of del.ici.ous/Flickr model) then the update volumes are likely to be much higher and therefore could be a problem when maintaining a single Lucene index because of the need to constantly close/reopen readers/writers.

Managing multiple per-user indexes may be an option but I've never tried that - that would require more disk, RAM and management.

As for the original suggestion: I have used thousands of primary-key terms in a terms filter before now (i.e. terms with only one doc) and was surprised at the speed. I can't recall exact stats so try it yourself.





----- Original Message ----
From: lucene user <[hidden email]>
To: [hidden email]
Sent: Wednesday, 24 October, 2007 1:12:32 PM
Subject: Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

Thanks very much. How large can my end-user's catigories grow before
 this
implementation you have outlined will start to bog down? If my users
 had
thousands of items categorized, would you still recommend using a term
filter in this way? Tens of thousands? What is a realistic max? Is
 there
another idea that works for even larger numbers? Frankly, we don't yet
understand how our users will use the system in the long run. When you
 have
done stuff like this, how large have the term filters grown?

Would it EVER make sense to maintain the end user's catigories in some
 sort
of Lucene data structure? If so, what data structure?

Would it EVER be wise to keep the end-user catigories in Lucene? If so,
 when
and how?

What are other realistic options for implementing user categorization
 of
documents?

When solr talks about "faceted searching," this isn't what they mean,
 is it?
They say "Faceted Searching based on unique field values and explicit
queries" and I'm looking to find what they mean and not getting clear.

Thanks!

On 10/24/07, mark harwood <[hidden email]> wrote:
>
> Given the volatility in the set membership I'd be tempted to keep
 that
> grouping info in a database rather than doing the
 reader/writer-open/close
> dance in Lucene before you can see any updates. (I suspect this is
 the
> reason you've opted not to keep the info in Lucene).
> You can pull a user's list of a hundred or so terms out of the
 database
> (typically primary keys) and add them as a TermsFilter to your Lucene
> queries.
> I've found that using this approach can be pretty fast even with a
 large
> list of filter terms - it was a while ago so I can't quote stats,
 you'll
> need to try it for yourself.
>
> Caching these filters may prove useful but if it's a big dataset
 Bitsets
> don't sound like a memory-efficient form of storing these lists as it
 sounds
> like they'll be sparsely populated.
> You may be interested in the more memory-efficient options such as
> SortedVIntList here: http://issues.apache.org/jira/browse/LUCENE-584.
> Without taking the whole of that patch on board you could have a
 caching

> strategy based on this pseudocode:
>
> getFilter(Set primaryKeys, IndexReader reader)
> {
>    TermsFilter tf= new TermsFilter()
>    for all primaryKeys:
>        tf.addTerm(primaryKey)
>   BitSet bits;
>   SortedVIntList cached=lruCachedMap.get(tf);
>   if(cached==null)
>         bits=tf.bits(reader)
>         lruCachedMap.put(tf, convertBitsToSortedVIntList(bits))
>   else
>         bits=convertSortedVIntListToBits(bits)
>   return new Filter()
>        {
>                  BitSet bits(IndexReader reader)
>                  {
>                      return bits;
>                  }
>        };
> }
>
>
> On a bit of a lucene-dev tangent, I think the above code has the
 makings

> of an optimisation to CachingWrapperFilter - it could choose to cache
> SortedVIntLists or BitSets depending on the sparseness of the set and
> transparently handles any required conversions.
>
>
>
> ----- Original Message ----
> From: lucene user <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, 24 October, 2007 7:18:10 AM
> Subject: Lucene Queries Over User-Editable Dynamic Categories of
 Documents
>
> Folks!
>
> We are building a web-based multi-user system. Users of our system
 are

> able
> to categorize items that they have found into groups of related
> documents.
> We would like users to be able to search these document groups and
> rapidly
> find matches. Each user might have ten of these categories and might
> have
> perhaps a few hundred documents in each. These categories might be
> highly
> dynamic, with users adding and deleting documents from these
 categories

> many
> times a day. How might we use Lucene to perform searches limited to
> these
> very dynamic and end-user editable categories? Any ideas for how we
> might do
> this efficiently?
>
> If all the data were in a SQL database, we could run a subquery that
> returned the IDs of the items in categories and use that to limit the
> results of the super query.
>
> Currently we do not plan to maintain the information about the
> end-user's
> categories in the Lucene index at all, or not in a big, main Lucene
> index
> anyway.
>
> What our the reasonable options for handling this? What are the
> performance
> implications of various choices?
>
> Thanks!
>
>
>
>





      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

lucene user-2
Thanks for all your help!

We are using Lucene 2.1.0 and TermsFilter seems to be new in Lucene 2.2.0.
I have not been able to find SortedVIntList in the javadocs at all.

Because both SortedVIntList and a regular BitSet are based on Lucene
Document Numbers, which are not permanent, It seems we will need to
generate these objects fresh at least once per session. Any comments,
about that? Do I have that correct?

Our application includes the following filter implementation that we use for
a
slightly different end user category problem. We could easily use it for our
current problem as well.

Is TermsFilter sufficiently better (faster, more compact, more correct,
etc.) to make upgrading
very important?

protected static class FieldFilter extends Filter {
  protected Collection values ; protected String field ;
  public FieldFilter(Collection v, String f) { super() ; values=v ; field=f
;}
  public BitSet bits(IndexReader r) {
    BitSet allow= new BitSet() ; String val="" ;
    try {
      TermEnum e=r.terms() ; TermDocs d=r.termDocs() ;
      boolean more= e.skipTo( new Term(field, "")) ;
      while (more) {
        if ( ! e.term().field().equals(field) ) break ;
        val=e.term().text() ;
        if (values.contains(val)) { d.seek(e) ; while (d.next()) allow.set(
d.doc()) ;}
        more=e.next() ;
      }
      return allow ;
    } catch (Exception x) {
      throw MessageToUser.error(x, "Error for Field="+field+" val="+val) ;
    }
  } // bits()
} // FieldFilter{}


On 10/24/07, mark harwood <[hidden email]> wrote:

>
> If you were talking about a reasonably stable dataset e.g. products in a
> catalogue that would be manageable in Lucene because the volume of updates
> is comparatively low (one set of categorisations maintained by the site
> owner e.g. something like Cnet where Solr/faceted search came from). If
> you are talking about lots of users with their own categories/tags (I'm
> thinking of del.ici.ous/Flickr model) then the update volumes are likely
> to be much higher and therefore could be a problem when maintaining a single
> Lucene index because of the need to constantly close/reopen readers/writers.
>
> Managing multiple per-user indexes may be an option but I've never tried
> that - that would require more disk, RAM and management.
>
> As for the original suggestion: I have used thousands of primary-key terms
> in a terms filter before now (i.e. terms with only one doc) and was
> surprised at the speed. I can't recall exact stats so try it yourself.
>
>
>
>
>
> ----- Original Message ----
> From: lucene user <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, 24 October, 2007 1:12:32 PM
> Subject: Re: Lucene Queries Over User-Editable Dynamic Categories of
> Documents
>
> Thanks very much. How large can my end-user's catigories grow before
> this
> implementation you have outlined will start to bog down? If my users
> had
> thousands of items categorized, would you still recommend using a term
> filter in this way? Tens of thousands? What is a realistic max? Is
> there
> another idea that works for even larger numbers? Frankly, we don't yet
> understand how our users will use the system in the long run. When you
> have
> done stuff like this, how large have the term filters grown?
>
> Would it EVER make sense to maintain the end user's catigories in some
> sort
> of Lucene data structure? If so, what data structure?
>
> Would it EVER be wise to keep the end-user catigories in Lucene? If so,
> when
> and how?
>
> What are other realistic options for implementing user categorization
> of
> documents?
>
> When solr talks about "faceted searching," this isn't what they mean,
> is it?
> They say "Faceted Searching based on unique field values and explicit
> queries" and I'm looking to find what they mean and not getting clear.
>
> Thanks!
>
> On 10/24/07, mark harwood <[hidden email]> wrote:
> >
> > Given the volatility in the set membership I'd be tempted to keep
> that
> > grouping info in a database rather than doing the
> reader/writer-open/close
> > dance in Lucene before you can see any updates. (I suspect this is
> the
> > reason you've opted not to keep the info in Lucene).
> > You can pull a user's list of a hundred or so terms out of the
> database
> > (typically primary keys) and add them as a TermsFilter to your Lucene
> > queries.
> > I've found that using this approach can be pretty fast even with a
> large
> > list of filter terms - it was a while ago so I can't quote stats,
> you'll
> > need to try it for yourself.
> >
> > Caching these filters may prove useful but if it's a big dataset
> Bitsets
> > don't sound like a memory-efficient form of storing these lists as it
> sounds
> > like they'll be sparsely populated.
> > You may be interested in the more memory-efficient options such as
> > SortedVIntList here: http://issues.apache.org/jira/browse/LUCENE-584.
> > Without taking the whole of that patch on board you could have a
> caching
> > strategy based on this pseudocode:
> >
> > getFilter(Set primaryKeys, IndexReader reader)
> > {
> >    TermsFilter tf= new TermsFilter()
> >    for all primaryKeys:
> >        tf.addTerm(primaryKey)
> >   BitSet bits;
> >   SortedVIntList cached=lruCachedMap.get(tf);
> >   if(cached==null)
> >         bits=tf.bits(reader)
> >         lruCachedMap.put(tf, convertBitsToSortedVIntList(bits))
> >   else
> >         bits=convertSortedVIntListToBits(bits)
> >   return new Filter()
> >        {
> >                  BitSet bits(IndexReader reader)
> >                  {
> >                      return bits;
> >                  }
> >        };
> > }
> >
> >
> > On a bit of a lucene-dev tangent, I think the above code has the
> makings
> > of an optimisation to CachingWrapperFilter - it could choose to cache
> > SortedVIntLists or BitSets depending on the sparseness of the set and
> > transparently handles any required conversions.
> >
> >
> >
> > ----- Original Message ----
> > From: lucene user <[hidden email]>
> > To: [hidden email]
> > Sent: Wednesday, 24 October, 2007 7:18:10 AM
> > Subject: Lucene Queries Over User-Editable Dynamic Categories of
> Documents
> >
> > Folks!
> >
> > We are building a web-based multi-user system. Users of our system
> are
> > able
> > to categorize items that they have found into groups of related
> > documents.
> > We would like users to be able to search these document groups and
> > rapidly
> > find matches. Each user might have ten of these categories and might
> > have
> > perhaps a few hundred documents in each. These categories might be
> > highly
> > dynamic, with users adding and deleting documents from these
> categories
> > many
> > times a day. How might we use Lucene to perform searches limited to
> > these
> > very dynamic and end-user editable categories? Any ideas for how we
> > might do
> > this efficiently?
> >
> > If all the data were in a SQL database, we could run a subquery that
> > returned the IDs of the items in categories and use that to limit the
> > results of the super query.
> >
> > Currently we do not plan to maintain the information about the
> > end-user's
> > categories in the Lucene index at all, or not in a big, main Lucene
> > index
> > anyway.
> >
> > What our the reasonable options for handling this? What are the
> > performance
> > implications of various choices?
> >
> > Thanks!
> >
> >
> >
> >
>
>
>
>
>
>       ___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try
> it
> now.
> http://uk.answers.yahoo.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Highlighter and href fields

Cool The Breezer
Is there anyway I stop highlighting text if it is a href/url etc...? The problem occurs when the field content is a URL which contains the query e.g. my search is for .net and fields has value http://jkjsd.net. After applying highlighter, it becomes http://jkjsd<b>.net</b>, which is a wrong URL. Can I filter it out?
   
  - BR

 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Highlighter and href fields

Mark Miller-3
Nothing in the Highlighter per seh that will help you there. I see two
options off the top of my head:

1. break the text before feeding it to the highlighter and feed all but
the URL parts, and then stitch back together -- much as you might do if
highlighting an XML doc. Ugly though.

2. Use an Analyzer that recognizes URL's. That way you wont get partial
URL matches like .net. Each URL would be a full token, and would require
a search matching the entire URL to match. Even if you already indexed
with a different Analzyer, you could use this special Analyzer just for
highlighting...it would act exactly the same as your indexing Analyzer,
but would parse any URL as a single token. Of course, if you are using
TokenSource, this is not an option.

- Mark

Cool Coder wrote:
> Is there anyway I stop highlighting text if it is a href/url etc...? The problem occurs when the field content is a URL which contains the query e.g. my search is for .net and fields has value http://jkjsd.net. After applying highlighter, it becomes http://jkjsd<b>.net</b>, which is a wrong URL. Can I filter it out?
>    
>   - BR
>
>  __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com 
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

mark harwood
In reply to this post by lucene user-2
lucene user wrote:
> Thanks for all your help!
>
> We are using Lucene 2.1.0 and TermsFilter seems to be new in Lucene 2.2.0.
> I have not been able to find SortedVIntList in the javadocs at all.
>  

No, SortedVIntList is in the patch I provided a link to earlier.


> Because both SortedVIntList and a regular BitSet are based on Lucene
> Document Numbers, which are not permanent, It seems we will need to
> generate these objects fresh at least once per session. Any comments,
> about that? Do I have that correct?
>  
Yes. Most caches tend to be held in WeakHashMap keyed on IndexReader so
that when a new reader takes over old caches are automatically garbage
collected.


> Our application includes the following filter implementation that we use for
> a
> slightly different end user category problem. We could easily use it for our
> current problem as well.
>
> Is TermsFilter sufficiently better (faster, more compact, more correct,
> etc.) to make upgrading
> very important?
>  
TermsFilter is in "contrib" and is stand-alone so should work with most
Lucene versions.
Your implementation looks to scan the whole termEnum whereas TermsFilter
looks up only the selected terms using reader.termDocs(term).
Benchmarking will tell you which is faster. I'd be interested to know
the results.

Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highlighter and href fields

Cool The Breezer
In reply to this post by Mark Miller-3
Ok I understand now that I have a big work ahead of me.
  >2. Use an Analyzer that recognizes URL's. That way you wont get partial
  BTW, Do you know any analyzer that can recognize URLs.
   
  - BR

Mark Miller <[hidden email]> wrote:
  Nothing in the Highlighter per seh that will help you there. I see two
options off the top of my head:

1. break the text before feeding it to the highlighter and feed all but
the URL parts, and then stitch back together -- much as you might do if
highlighting an XML doc. Ugly though.

2. Use an Analyzer that recognizes URL's. That way you wont get partial
URL matches like .net. Each URL would be a full token, and would require
a search matching the entire URL to match. Even if you already indexed
with a different Analzyer, you could use this special Analyzer just for
highlighting...it would act exactly the same as your indexing Analyzer,
but would parse any URL as a single token. Of course, if you are using
TokenSource, this is not an option.

- Mark

Cool Coder wrote:
> Is there anyway I stop highlighting text if it is a href/url etc...? The problem occurs when the field content is a URL which contains the query e.g. my search is for .net and fields has value http://jkjsd.net. After applying highlighter, it becomes http://jkjsd.net, which is a wrong URL. Can I filter it out?
>
> - BR
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

lucene user-2
In reply to this post by mark harwood
What do you means by 'Most caches are held in WeakHashMap...' is this
caching provided by CachingWrappingFilter or do we have to implement it
ourselves? I assume the former.

We will share results of our testing as soon as we have any - not sure how
generalizable they will be.

You have been super helpful! Very grateful! Thanks!

On 10/24/07, markharw00d <[hidden email]> wrote:

>
> lucene user wrote:
> > Thanks for all your help!
> >
> > We are using Lucene 2.1.0 and TermsFilter seems to be new in Lucene
> 2.2.0.
> > I have not been able to find SortedVIntList in the javadocs at all.
> >
>
> No, SortedVIntList is in the patch I provided a link to earlier.
>
>
> > Because both SortedVIntList and a regular BitSet are based on Lucene
> > Document Numbers, which are not permanent, It seems we will need to
> > generate these objects fresh at least once per session. Any comments,
> > about that? Do I have that correct?
> >
> Yes. Most caches tend to be held in WeakHashMap keyed on IndexReader so
> that when a new reader takes over old caches are automatically garbage
> collected.
>
>
> > Our application includes the following filter implementation that we use
> for
> > a
> > slightly different end user category problem. We could easily use it for
> our
> > current problem as well.
> >
> > Is TermsFilter sufficiently better (faster, more compact, more correct,
> > etc.) to make upgrading
> > very important?
> >
> TermsFilter is in "contrib" and is stand-alone so should work with most
> Lucene versions.
> Your implementation looks to scan the whole termEnum whereas TermsFilter
> looks up only the selected terms using reader.termDocs(term).
> Benchmarking will tell you which is faster. I'd be interested to know
> the results.
>
> Cheers
> Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

mark harwood
In reply to this post by lucene user-2
There are 2 considerations when caching filter results:

1) What was the criteria used to produce the results?
2) What version of the index were these results taken from?

CachingWrapperFilter takes care of 2) by using WeakHashMap keyed on IndexReader.

The filter you pass to CachingWrapperFilter must play it's part in taking care of 1) by implementing hashcode/equals.

You can then maintain an LRU hashmap of CachingWrapperFilters to keep only the most popular filters.

Incidentally, contrib's XMLQueryParser handles all this for you with a simple "CachedFilter" tag:

<FilteredQuery>
    <Query>
        <UserQuery>"Brittany Spears"</UserQuery>
    </Query>    
    <Filter>
        <CachedFilter>
            <RangeFilter fieldName="date" lowerTerm="19970409" upperTerm="19970412"/>
        </CachedFilter>
    </Filter>    
</FilteredQuery>


Cheers
Mark

----- Original Message ----
From: lucene user <[hidden email]>
To: [hidden email]
Sent: Thursday, 25 October, 2007 10:36:14 AM
Subject: Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

What do you means by 'Most caches are held in WeakHashMap...' is this
caching provided by CachingWrappingFilter or do we have to implement it
ourselves? I assume the former.

We will share results of our testing as soon as we have any - not sure
 how
generalizable they will be.

You have been super helpful! Very grateful! Thanks!

On 10/24/07, markharw00d <[hidden email]> wrote:

>
> lucene user wrote:
> > Thanks for all your help!
> >
> > We are using Lucene 2.1.0 and TermsFilter seems to be new in Lucene
> 2.2.0.
> > I have not been able to find SortedVIntList in the javadocs at all.
> >
>
> No, SortedVIntList is in the patch I provided a link to earlier.
>
>
> > Because both SortedVIntList and a regular BitSet are based on
 Lucene
> > Document Numbers, which are not permanent, It seems we will need to
> > generate these objects fresh at least once per session. Any
 comments,
> > about that? Do I have that correct?
> >
> Yes. Most caches tend to be held in WeakHashMap keyed on IndexReader
 so
> that when a new reader takes over old caches are automatically
 garbage
> collected.
>
>
> > Our application includes the following filter implementation that
 we use
> for
> > a
> > slightly different end user category problem. We could easily use
 it for
> our
> > current problem as well.
> >
> > Is TermsFilter sufficiently better (faster, more compact, more
 correct,
> > etc.) to make upgrading
> > very important?
> >
> TermsFilter is in "contrib" and is stand-alone so should work with
 most
> Lucene versions.
> Your implementation looks to scan the whole termEnum whereas
 TermsFilter

> looks up only the selected terms using reader.termDocs(term).
> Benchmarking will tell you which is faster. I'd be interested to know
> the results.
>
> Cheers
> Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>





      ___________________________________________________________
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  http://uk.promotions.yahoo.com/forgood/environment.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]