number of hits per document

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

number of hits per document

John Byrne-3
Hi,

Is there an easy way to find out the number of hits per document for a
Query, rather than just for a Term?

Let's say, for example, I have a document like this:

"here is cats near dogs and here is cats a long long way from dogs"

and I use a SpanNearQuery to find "cats" near "dogs" with a slop of 1 -
I need to be able to find out that there was 1 hit, even though there
are 2 occurrences of "cats" and 2 of "dogs" - there is still only 1 hit
that matches my Query.

Is this possible?

Thanks,
JB.




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: number of hits per document

Grant Ingersoll-2
A SpanQuery is just a Query, so the traditional way of Querying still  
applies, i.e. you get back a list of matching documents.  Beyond that,  
if you just want to operate on the spans, just keep track of how often  
the doc() method changes.

HTH,
Grant
On Jun 9, 2008, at 11:21 AM, John Byrne wrote:

> Hi,
>
> Is there an easy way to find out the number of hits per document for  
> a Query, rather than just for a Term?
>
> Let's say, for example, I have a document like this:
>
> "here is cats near dogs and here is cats a long long way from dogs"
>
> and I use a SpanNearQuery to find "cats" near "dogs" with a slop of  
> 1 - I need to be able to find out that there was 1 hit, even though  
> there are 2 occurrences of "cats" and 2 of "dogs" - there is still  
> only 1 hit that matches my Query.
>
> Is this possible?
>
> Thanks,
> JB.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

The performance of lucene searching(web entironment) test

lutan

I have recently done some tests on lucene. I do not know whether the test results normal. hd entironment:Intel(R) Xeon(R) CPU   5110  @ 1.60GHz4GB ram sw entironment:centOS4.6+sun jdk 1.5+jboss+lucene2.3.2+je-analysis(a chinese analysis)there are 10 million+ documents which total about 3GB test steps: 1 run single searcher.jsp in jboss(tuning ,and use 1GB ram)2 use loadrunner  to test   simulation  10 user concurrent  request.    the TPS(transactions per second) about 10   simulation  50 user concurrent  request.    the TPS(transactions per second) about 8   simulation  100 user concurrent  request.    the TPS(transactions per second) about 2 and the jsp was very simple,index in local file system-------------------------------------------------------------------------------------------------  <body>    <center>   <form action="lucene.jsp" method="post" name="form1" >    <input type="text" value="" name="keyword2"/>    <input type="submit" value="searcher" onclick="SUB()"/>  
  <input type="reset" value="exit"/>   </form>   </center>     <hr>  <%   if(request.getParameter("keyword2")!=null && !"".equals(request.getParameter("keyword2"))) {    String dir="/usr/local/index";  String key="name";  String word = new String(request.getParameter("keyword2"),"utf-8") ;  Searcher searcher = null;  searcher = new IndexSearcher(FSDirectory.getDirectory(dir, false));  Analyzer myAnalyzer=new jeasy.analysis.MMAnalyzer();  QueryParser queryParser=new QueryParser(key,myAnalyzer);  Query query=queryParser.parse(word);           Hits hits = null;  long startTime = System.nanoTime();        hits= searcher.search(query);          long estimatedTime = System.nanoTime() - startTime;         BigDecimal bb = new BigDecimal(estimatedTime);        BigDecimal ee = new BigDecimal(1000000000);        System.out.println("Key word: "+word+" Hits:" + hits.length()+"  Cost time: "+ bb.divide(ee) + "/s");    searcher.close();    }  out.print("ABC") ; %>  </body>   ---------------
 ----------------------search.jsp--------------------------------------------------------- and I also try to use Singleton IndexSearcher ,but it's seam not helpful.-------------------------------------------------------------------------------- public IndexSearcher getIndexSearcher() throws IOException {  if (this.indexSearcher == null) {   return new IndexSearcher(FSDirectory.getDirectory(folder, false));  } else {   IndexReader ir = indexSearcher.getIndexReader();   if (!ir.isCurrent()) {    this.indexSearcher.close();    this.indexSearcher = new IndexSearcher(FSDirectory.getDirectory(folder, false));    ir = indexSearcher.getIndexReader();    if (ir.hasDeletions()) {     if (this.indexWriter != null) {      this.indexWriter.optimize();     }    }   }   return this.indexSearcher;  } }------------------------------------GetsingletonIndexsearcher.java --------------------------------------------- use the same code in application search one times per 0.5s average.so how do I i
 mprove the seaching  performance in  concurrent entironment ? Does the hd entironment: Intel(R) Xeon(R) CPU   5110  @ 1.60GHz4GB ramgive  me     50+TPS?
_________________________________________________________________
用手机MSN聊天写邮件看空间,无限沟通,分享精彩!
http://mobile.msn.com.cn/
Reply | Threaded
Open this post in threaded view
|

Re: number of hits per document

John Byrne-3
In reply to this post by Grant Ingersoll-2
Hi,
 
I could do it that way, but couting the spans per document is specific
to SpanQuerys. I would still have to count hits for TermQuerys
separately. I was looking for a generic way to count hits for any
instance of Query within a document.

To put it another way, the ability to find the Term frequency in a
single document seems incomplete, since a Term does not equate to a hit.
For instance, sticking with my previous example, if my document
contained a thousand occurrences of "cats" but only one of them is near
"dogs", then the frequency of the Term "cats" in that document is
irrelevant to me.

In general, my queries will consist of a BooleanQuery containing any
number of sub-queries of any implementation - what I actually need to
know is how many hits there are for that BooleanQuery query in each
document. Maybe I will expand the BooleanQuery into all it's sub-queries
recursively, and then handle them by type - counting spans per document
for SpanQuerys and using the Term frequency for TermQuerys. I was just
hoping there would be an existing (and fast)  way to do this.

Thanks,
John

Grant Ingersoll wrote:

> A SpanQuery is just a Query, so the traditional way of Querying still
> applies, i.e. you get back a list of matching documents.  Beyond that,
> if you just want to operate on the spans, just keep track of how often
> the doc() method changes.
>
> HTH,
> Grant
> On Jun 9, 2008, at 11:21 AM, John Byrne wrote:
>
>> Hi,
>>
>> Is there an easy way to find out the number of hits per document for
>> a Query, rather than just for a Term?
>>
>> Let's say, for example, I have a document like this:
>>
>> "here is cats near dogs and here is cats a long long way from dogs"
>>
>> and I use a SpanNearQuery to find "cats" near "dogs" with a slop of 1
>> - I need to be able to find out that there was 1 hit, even though
>> there are 2 occurrences of "cats" and 2 of "dogs" - there is still
>> only 1 hit that matches my Query.
>>
>> Is this possible?
>>
>> Thanks,
>> JB.
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The performance of lucene searching(web entironment) test

Toke Eskildsen
In reply to this post by lutan
On Tue, 2008-06-10 at 21:11 +0800, lutan wrote:
> [A lot of text with code and no newlines, making it very hard to read]

In your test you're reusing the searcher. For each search your program
performs, you will see faster response times, until the searcher is
fully warmed.

If your production-system, you re-open your searcher every time and do
not have the benefit of a warmed searcher.

So yes, Singleton searcher helps, as opposed to opening a searcher for
every search. Try making a test where the only thing you do is open a
searcher 100 times and you will see that it takes a non-trivial amount
of time.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: The performance of lucene searching(web entironment) test

lutan

Thanks  for the reply!
 
In my test case , I start loadrunner jsut test for 5 minute,and the response growth slowly.the TPS(transactions per second) seems stoped at 10 finally.
I will run a test for a longer time again.
In addition,does lucene has bottleneck about the number of documents or index size..?
 
> Date: Tue, 10 Jun 2008 16:34:17 +0200> From: [hidden email]> Subject: Re: The performance of lucene searching(web entironment) test> To: [hidden email]> > On Tue, 2008-06-10 at 21:11 +0800, lutan wrote:> > [A lot of text with code and no newlines, making it very hard to read]> > In your test you're reusing the searcher. For each search your program> performs, you will see faster response times, until the searcher is> fully warmed.> > If your production-system, you re-open your searcher every time and do> not have the benefit of a warmed searcher.> > So yes, Singleton searcher helps, as opposed to opening a searcher for> every search. Try making a test where the only thing you do is open a> searcher 100 times and you will see that it takes a non-trivial amount> of time.> > > > ---------------------------------------------------------------------> To unsubscribe, e-mail: [hidden email]> For additional commands, e-mail: java-user-h
 [hidden email]>
_________________________________________________________________
Windows Live Photo gallery 数码相机的超级伴侣,轻松管理和编辑照片,还能制作全景美图!
http://get.live.cn/product/photo.html
Reply | Threaded
Open this post in threaded view
|

Re: number of hits per document

Spencer Tickner
In reply to this post by John Byrne-3
Hi John,

Sorry I don't have a solution for you but I'm trying to do the same
thing. I would love to hear from you if you have any success with
this.

Cheers,

Spencer
[hidden email]

On Tue, Jun 10, 2008 at 6:28 AM, John Byrne <[hidden email]> wrote:

> Hi,
>
> I could do it that way, but couting the spans per document is specific to
> SpanQuerys. I would still have to count hits for TermQuerys separately. I
> was looking for a generic way to count hits for any instance of Query within
> a document.
>
> To put it another way, the ability to find the Term frequency in a single
> document seems incomplete, since a Term does not equate to a hit. For
> instance, sticking with my previous example, if my document contained a
> thousand occurrences of "cats" but only one of them is near "dogs", then the
> frequency of the Term "cats" in that document is irrelevant to me.
>
> In general, my queries will consist of a BooleanQuery containing any number
> of sub-queries of any implementation - what I actually need to know is how
> many hits there are for that BooleanQuery query in each document. Maybe I
> will expand the BooleanQuery into all it's sub-queries recursively, and then
> handle them by type - counting spans per document for SpanQuerys and using
> the Term frequency for TermQuerys. I was just hoping there would be an
> existing (and fast)  way to do this.
>
> Thanks,
> John
>
> Grant Ingersoll wrote:
>>
>> A SpanQuery is just a Query, so the traditional way of Querying still
>> applies, i.e. you get back a list of matching documents.  Beyond that, if
>> you just want to operate on the spans, just keep track of how often the
>> doc() method changes.
>>
>> HTH,
>> Grant
>> On Jun 9, 2008, at 11:21 AM, John Byrne wrote:
>>
>>> Hi,
>>>
>>> Is there an easy way to find out the number of hits per document for a
>>> Query, rather than just for a Term?
>>>
>>> Let's say, for example, I have a document like this:
>>>
>>> "here is cats near dogs and here is cats a long long way from dogs"
>>>
>>> and I use a SpanNearQuery to find "cats" near "dogs" with a slop of 1 - I
>>> need to be able to find out that there was 1 hit, even though there are 2
>>> occurrences of "cats" and 2 of "dogs" - there is still only 1 hit that
>>> matches my Query.
>>>
>>> Is this possible?
>>>
>>> Thanks,
>>> JB.
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: The performance of lucene searching(web entironment) test

Toke Eskildsen
In reply to this post by lutan
On Wed, 2008-06-11 at 00:17 +0800, lutan wrote:
> In my test case , I start loadrunner jsut test for 5 minute,and the response
> growth slowly.the TPS(transactions per second) seems stoped at 10 finally.

That's without reusing the searcher, right? In that case the increased
rate must be attributed to the disk cache being warmed. Please try and
test again with the searcher being reused.

> In addition,does lucene has bottleneck about the number of documents or index size..?

Not to my knowledge. Search time and RAM consumption goes up, of course,
but I'm not aware of any special point where things start to go bad at
an increased rate.

> Does the hd entironment: Intel(R) Xeon(R) CPU   5110  @ 1.60GHz4GB
> ramgive  me     50+TPS?

With an index of 10M/3GB? It doesn't sound unrealistic after warm-up.
How much RAM is available for disk-cache, when the machine is running?


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: The performance of lucene searching(web entironment) test

lutan

Thanks for you replay!> Date: Wed, 11 Jun 2008 09:19:46 +0200> From: [hidden email]> Subject: RE: The performance of lucene searching(web entironment) test> To: [hidden email]> > On Wed, 2008-06-11 at 00:17 +0800, lutan wrote:> > In my test case , I start loadrunner jsut test for 5 minute,and the response > > growth slowly.the TPS(transactions per second) seems stoped at 10 finally.> > That's without reusing the searcher, right? In that case the increased> rate must be attributed to the disk cache being warmed. Please try and> test again with the searcher being reused.>
 
 
Yes ,I have test again with same entironment but to use singleton IndexSearcher.the  performance
has increased. there 100 concurrent user request use different keyword ,and get 60 TPS(2 TPS before).
and now the bottleneck  seem to be CPU,and the CPU using approach 100%.and both RAM(using 70MB average),
HD using as normal.
 
> > In addition,does lucene has bottleneck about the number of documents or index size..?> > Not to my knowledge. Search time and RAM consumption goes up, of course,> but I'm not aware of any special point where things start to go bad at> an increased rate.>
 
Could I consider that as long as I have a larger capacity RAM ,and I
will get a good performance.
 
 
> > Does the hd entironment: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz4GB > > ramgive me 50+TPS?> > With an index of 10M/3GB? It doesn't sound unrealistic after warm-up.> How much RAM is available for disk-cache, when the machine is running?>
 
 
I don't understand  " for disk-cache" meaning  very  clear.Could you please
explain it again.Thanks a lot!(does't cache on RAM?)
 does warm-up  ==  cache?
 how many docs do lucene will be cached default?and could I control the cache size?
 
I am new to lucene ,maybe my questions  looks  not professional.
forgive me.
> > ---------------------------------------------------------------------> To unsubscribe, e-mail: [hidden email]> For additional commands, e-mail: [hidden email]>
_________________________________________________________________
新年换新颜,快来妆扮自己的MSN给心仪的TA一个惊喜!
http://im.live.cn/emoticons/?ID=18
Reply | Threaded
Open this post in threaded view
|

RE: The performance of lucene searching(web entironment) test

Toke Eskildsen
On Wed, 2008-06-11 at 18:56 +0800, lutan wrote:
> Yes ,I have test again with same entironment but to use singleton
> IndexSearcher.the  performance has increased. there 100 concurrent
> user request use different keyword ,and get 60 TPS(2 TPS before).
> and now the bottleneck  seem to be CPU,and the CPU using approach
> 100%.and both RAM(using 70MB average), HD using as normal.

It sounds like you have found the solution to your immediate problem.
Great.

> Could I consider that as long as I have a larger capacity RAM ,and I
> will get a good performance.

Depends on your index-size (in bytes). When your index grows, less and
less of it can fit in the disk-cache and more time will be required for
proper warm-up. But the change will happen gradually, so you'll only be
surprised if you suddenly increase your index-size to double or more
size.

> I don't understand  " for disk-cache" meaning  very  clear.Could you please
> explain it again.Thanks a lot!(does't cache on RAM?)
> does warm-up  ==  cache?

There are (at least) two important memory mechanisms to consider.
My apologies if some of this is basic knowledge to you:

1) Disk-cache.
In general, the free RAM on your Linux-system is used for disk-cache.
With an index-size of 3GB and (just a guess) 1 GB free RAM, the
operating system is able to cache 1/3 or less of your index. If you open
the same index several times in a row, the disk-cache will be warmed to
the relevant parts of your index, so that you're not even hitting the
disk after a while. At least not for opening. This is the effect you
observed with your non-singleton based test, where the speed increased
slowly up to a not-so-high level.

2) Lucene internal structures.
I don't know much about this, so I hope somebody will correct me if I
make mistakes: Lucene has some internal structures that are initialized
when searches are performed. Depending on setup, this initialization can
be quite heavy (custom search for example). Performing warm-up, such as
searching with previously logged queries, will initialize these
structures before the real queries are received. This is the effect you
observed with your singleton searcher.

1 & 2 can be seen in combination, as the initialization of the internal
structures in Lucene requires a fair amount of seeks in the index data.
If there's nothing in the disk-cache and a conventional platter-based
harddisk is used, it takes some time. If the disk-cache is warmed from
previous use or a solid state drive setup is used, it is much faster.

>  how many docs do lucene will be cached default?and could I control the
>  cache size?

I don't know. Maybe someone else will chime in?


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: The performance of lucene searching(web entironment) test

lutan

Very grateful for Toke Eskildsen of attention my questions.
> Date: Fri, 13 Jun 2008 08:59:27 +0200> From: [hidden email]> Subject: RE: The performance of lucene searching(web entironment) test> To: [hidden email]> > On Wed, 2008-06-11 at 18:56 +0800, lutan wrote:> > Yes ,I have test again with same entironment but to use singleton > > IndexSearcher.the performance has increased. there 100 concurrent> > user request use different keyword ,and get 60 TPS(2 TPS before).> > and now the bottleneck seem to be CPU,and the CPU using approach > > 100%.and both RAM(using 70MB average), HD using as normal.> > It sounds like you have found the solution to your immediate problem.> Great.>
 
 
The performance increase dependents on your suggestion.
Today I hava another tesing,and using  RemoteSearchable(code like
the example of <lucene in action> supply).
app runing setps:
1,A customer request a keyword to web(JBoss:192.168.0.1).
2,JBoss call RMIServer(192.168.0.2)(the index file on it).
other tesing entironment as same as before.
 
the result:
loadrunner: 300 concurrent user(I find one user ,one TCP/IP
connection  form WebServer  to  RMIServer),
and  the TPS got 180+,web response time is
 about 2 second average. both WebServer and RMIServer
 has being using as normal of
cpu(50%),ram(not full).
 
the performance almost  achieve thrice !
 It's amazing to me:)
I consider the method of RMI would hava low performance(
because of expensively net using),
but the  result is really puzzled me  :(
 
 
 
> > Could I consider that as long as I have a larger capacity RAM ,and I > > will get a good performance.> > Depends on your index-size (in bytes). When your index grows, less and> less of it can fit in the disk-cache and more time will be required for> proper warm-up. But the change will happen gradually, so you'll only be> surprised if you suddenly increase your index-size to double or more> size.>
> > I don't understand " for disk-cache" meaning very clear.Could you please> > explain it again.Thanks a lot!(does't cache on RAM?)> > does warm-up == cache?> > There are (at least) two important memory mechanisms to consider.> My apologies if some of this is basic knowledge to you:> > 1) Disk-cache.> In general, the free RAM on your Linux-system is used for disk-cache.> With an index-size of 3GB and (just a guess) 1 GB free RAM, the> operating system is able to cache 1/3 or less of your index. If you open> the same index several times in a row, the disk-cache will be warmed to> the relevant parts of your index, so that you're not even hitting the> disk after a while. At least not for opening. This is the effect you> observed with your non-singleton based test, where the speed increased> slowly up to a not-so-high level.> > 2) Lucene internal structures.> I don't know much about this, so I hope somebody will correct me if I> make mistakes: Lucene has some internal structures
  that are initialized> when searches are performed. Depending on setup, this initialization can> be quite heavy (custom search for example). Performing warm-up, such as> searching with previously logged queries, will initialize these> structures before the real queries are received. This is the effect you> observed with your singleton searcher.> > 1 & 2 can be seen in combination, as the initialization of the internal> structures in Lucene requires a fair amount of seeks in the index data.> If there's nothing in the disk-cache and a conventional platter-based> harddisk is used, it takes some time. If the disk-cache is warmed from> previous use or a solid state drive setup is used, it is much faster.>
 
 
I have understand it by your reply,thanks a lot.
 
> > how many docs do lucene will be cached default?and could I control the> > cache size?> > I don't know. Maybe someone else will chime in?> > > ---------------------------------------------------------------------> To unsubscribe, e-mail: [hidden email]> For additional commands, e-mail: [hidden email]>
_________________________________________________________________
用手机MSN聊天写邮件看空间,无限沟通,分享精彩!
http://mobile.msn.com.cn/
Reply | Threaded
Open this post in threaded view
|

Re: number of hits per document

hossman
In reply to this post by Spencer Tickner

: > I could do it that way, but couting the spans per document is specific to
: > SpanQuerys. I would still have to count hits for TermQuerys separately. I
: > was looking for a generic way to count hits for any instance of Query within
: > a document.

the orriginal Query, Weight, and Scorer APIs provided no mechanism for
doing this -- this is one of the reasons why the SpanQuery API exists, to
model the types of queries that (can) collect this type of information as
they score documents.  Non-Span based queries typically have no idea about
this type of information.  (which typically allows them to be faster)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]