highlight problem

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

highlight problem

Jin,  Ying


Hi, All,

I use lucene highlight package to generate KWIC for our application.

The part of the code is as following:
=====================================================
        if(text != null ){
          TokenStream tokenStream = analyzer.tokenStream("contents",
              new StringReader(text));
          // Get 3 best fragments and seperate with a "..."
          result = highlighter.getBestFragments(tokenStream,
              text, 3, "...");
        }

=====================================================

However, I got a very strange problem. One of my search results from our
records contains far too much of the text of the paper. It doesn't happen
for the same paper when I changed the search criteria.

Thanks very mcuh for your help,
Ying

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlight problem

mark harwood
>> One of my
>> search results from our
>> records contains far too much of the text

This is a problem I haven't seen before. I suspect it
may have something to do with your choice of analyzer.
Your paper will only ever be fragmented on "token gap"
boundaries ie points in the token stream where the
current token position does not overlap with the
previous token's . If the section in your text which
contains the search terms contains a long stream of
overlapping tokens you will end up with a long
highlighted selection.

Which analyzer are using out of interest?


Cheers
Mark



--- [hidden email] wrote:
>
>
> Hi, All,
>
> I use lucene highlight package to generate KWIC for
> our application.
>
> The part of the code is as following:
>
=====================================================

>         if(text != null ){
>           TokenStream tokenStream =
> analyzer.tokenStream("contents",
>               new StringReader(text));
>           // Get 3 best fragments and seperate with
> a "..."
>           result =
> highlighter.getBestFragments(tokenStream,
>               text, 3, "...");
>         }
>
>
=====================================================

>
> However, I got a very strange problem. One of my
> search results from our
> records contains far too much of the text of the
> paper. It doesn't happen
> for the same paper when I changed the search
> criteria.
>
> Thanks very mcuh for your help,
> Ying
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [hidden email]
> For additional commands, e-mail:
> [hidden email]
>
>


       
       
               
___________________________________________________________
Yahoo! Messenger - want a free and easy way to contact your friends online? http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlight problem

Jin,  Ying
Quoting mark harwood <[hidden email]>:

Hi, Mark,

I just used StandardAnalyzer and code is as following:

=====================================================
      Analyzer analyzer = new StandardAnalyzer();
      BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
      String line = in.readLine();

      if (line.length() == -1)
        return;

      Query query = QueryParser.parse(line, "contents", analyzer);
      Hits hits = searcher.search(query);
      Highlighter highlighter =new Highlighter(new QueryScorer(query));
========================================================

Part of the fulltext result is:
========================================================
 sectors in South Africa have failed to incorporate many co-management
principles, such as joint... of the RNP. However, poor representation of
community interests on the joint management committee... of a joint management
committee and the improvement of infrastructure in the area. The key differences
......
========================================================
Of course the result is far more than this.

The paper itself looks normal to me. It makes me confused why this is happening.

Thanks for your help,
Ying







> >> One of my
> >> search results from our
> >> records contains far too much of the text
>
> This is a problem I haven't seen before. I suspect it
> may have something to do with your choice of analyzer.
> Your paper will only ever be fragmented on "token gap"
> boundaries ie points in the token stream where the
> current token position does not overlap with the
> previous token's . If the section in your text which
> contains the search terms contains a long stream of
> overlapping tokens you will end up with a long
> highlighted selection.
>
> Which analyzer are using out of interest?
>
>
> Cheers
> Mark
>
>
>
> --- [hidden email] wrote:
> >
> >
> > Hi, All,
> >
> > I use lucene highlight package to generate KWIC for
> > our application.
> >
> > The part of the code is as following:
> >
> =====================================================
> >         if(text != null ){
> >           TokenStream tokenStream =
> > analyzer.tokenStream("contents",
> >               new StringReader(text));
> >           // Get 3 best fragments and seperate with
> > a "..."
> >           result =
> > highlighter.getBestFragments(tokenStream,
> >               text, 3, "...");
> >         }
> >
> >
> =====================================================
> >
> > However, I got a very strange problem. One of my
> > search results from our
> > records contains far too much of the text of the
> > paper. It doesn't happen
> > for the same paper when I changed the search
> > criteria.
> >
> > Thanks very mcuh for your help,
> > Ying
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > [hidden email]
> > For additional commands, e-mail:
> > [hidden email]
> >
> >
>
>
>
>
>
> ___________________________________________________________
> Yahoo! Messenger - want a free and easy way to contact your friends online?
> http://uk.messenger.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlight problem

mark harwood
In reply to this post by Jin, Ying
As much as you have shown of the example output is
roughly what I would expect - using the default
simpleFragmenter you get roughly 100 character sized
fragments and you have shown 3 fragments sized 97, 100
and 105 chars long separated by "...".

> Of course the result is far more than this.

So are you saying you had even more fragments in the
"getBestFragments" return string separated by your
choice of "..." separator?

I also notice the text contains no markup - have you
removed that from the example?

Cheers
Mark






               
___________________________________________________________
How much free photo storage do you get? Store your holiday
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlight problem

Jin,  Ying
Hi, Mark,

Sorry for the confusing. The complete code is here:
===============================================
Analyzer analyzer = new StandardAnalyzer();
      BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
      String line = in.readLine();

      if (line.length() == -1)
        return;

      Query query = QueryParser.parse(line, "contents", analyzer);
      Hits hits = searcher.search(query);
      Highlighter highlighter =new Highlighter(new QueryScorer(query));

      String ids = "";
      for (int i = 0; i < hits.length(); i++) {
        Document doc = hits.doc(i);
        String text = doc.get("contents");
        String result = "";

        if(text != null ){
          TokenStream tokenStream = analyzer.tokenStream("contents",
              new StringReader(text));
          // Get 3 best fragments and seperate with a "..."                    
                                                 
          result = highlighter.getBestFragments(tokenStream,
              text, 3, "...");
        }

===============================================
Quoting mark harwood <[hidden email]>:

> As much as you have shown of the example output is
> roughly what I would expect - using the default
> simpleFragmenter you get roughly 100 character sized
> fragments and you have shown 3 fragments sized 97, 100
> and 105 chars long separated by "...".
>
> > Of course the result is far more than this.
>
> So are you saying you had even more fragments in the
> "getBestFragments" return string separated by your
> choice of "..." separator?
>
> I also notice the text contains no markup - have you
> removed that from the example?
>
> Cheers
> Mark
>
>
>
>
>
>
>
> ___________________________________________________________
> How much free photo storage do you get? Store your holiday
> snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlight problem

Erik Hatcher
Ying - please properly subscribe to the java-user list - I've  
moderated in each of your mails thus far.

     Erik


On May 5, 2005, at 2:18 PM, [hidden email] wrote:

> Hi, Mark,
>
> Sorry for the confusing. The complete code is here:
> ===============================================
> Analyzer analyzer = new StandardAnalyzer();
>       BufferedReader in = new BufferedReader(new InputStreamReader
> (System.in));
>       String line = in.readLine();
>
>       if (line.length() == -1)
>         return;
>
>       Query query = QueryParser.parse(line, "contents", analyzer);
>       Hits hits = searcher.search(query);
>       Highlighter highlighter =new Highlighter(new QueryScorer
> (query));
>
>       String ids = "";
>       for (int i = 0; i < hits.length(); i++) {
>         Document doc = hits.doc(i);
>         String text = doc.get("contents");
>         String result = "";
>
>         if(text != null ){
>           TokenStream tokenStream = analyzer.tokenStream("contents",
>               new StringReader(text));
>           // Get 3 best fragments and seperate with a "..."
>
>           result = highlighter.getBestFragments(tokenStream,
>               text, 3, "...");
>         }
>
> ===============================================
> Quoting mark harwood <[hidden email]>:
>
>
>> As much as you have shown of the example output is
>> roughly what I would expect - using the default
>> simpleFragmenter you get roughly 100 character sized
>> fragments and you have shown 3 fragments sized 97, 100
>> and 105 chars long separated by "...".
>>
>>
>>> Of course the result is far more than this.
>>>
>>
>> So are you saying you had even more fragments in the
>> "getBestFragments" return string separated by your
>> choice of "..." separator?
>>
>> I also notice the text contains no markup - have you
>> removed that from the example?
>>
>> Cheers
>> Mark
>>
>>
>>
>>
>>
>>
>>
>> ___________________________________________________________
>> How much free photo storage do you get? Store your holiday
>> snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlight problem

Jin,  Ying
In reply to this post by mark harwood
Hi, Mark,

Please ignore my previous posting. I sent it by accident.

Sorry for the confusing. The complete code is here:
===================================================
      Analyzer analyzer = new StandardAnalyzer();
      BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
      String line = in.readLine();

      if (line.length() == -1)
        return;

      Query query = QueryParser.parse(line, "contents", analyzer);
      Hits hits = searcher.search(query);
      Highlighter highlighter =new Highlighter(new QueryScorer(query));

      String ids = "";
      for (int i = 0; i < hits.length(); i++) {
        Document doc = hits.doc(i);
        String text = doc.get("contents");
        String result = "";

        if(text != null ){
          TokenStream tokenStream = analyzer.tokenStream("contents",
              new StringReader(text));
          // Get 3 best fragments and seperate with a "..."                    
                                                 
          result = highlighter.getBestFragments(tokenStream,
              text, 3, "...");
        }
===================================================

The result is as following. I only get three fragments, but the 3rd one contains
too much text and I don't copy all of them here. Just for example:
===================================================
 market, TunisiaThe uncertainty about resource availability plays a large role
in water <B>management</B>..., and at all different scales of water
<B>management</B>. First, at the irrigation scheme level... <B>management</B> of
the collective water resource is of ma jor imp ortance for these irrigation
schemes. The aim is here to understand the global risk taken at the scheme
level, by comparing the actual cropping choices and allocation rules to other
cropping possibilities and potential rules. Given an allocation rule, does a
collective cropping choice correspond to a high or a low risk taking ? In our
opinion, it would have been very difficult to estimate the individual risk
tolerances and, with them, the collective risk tolerance because the farms
diversity appeared to be very high and there are complex informal risk sharing
networks. That is why we did not try to determine the optimal risk taking The
study remains at the scale of the scheme: we consider, for example in El Melalsa
scheme, three global fields put under crops, one with wheat, another with
pepper-bean and the last one with melon. A model calculates the different yields
taking into account the water stresses impact, and hence the global profit, for
a given rainfall and a given allocation rule (see appendix).195.0.1A big risk
taking on El Melalsa irrigation schemeThe El Melalsa irrigation scheme is one of
the two studied schemes. It irrigates 160 ha owned by 54 farmers, using a pump
that delivers a flow of 24 l/s. The cropping pattern is based on the following
rotation: wheat then pepper associated with bean and finally melon. There is no
control over the area put under crops. Besides, there is a water turn but when a
farmer gets his turn, he can irrigate as long as he wants: the water turn length
lasts up to three weeks during spring, when both beans and melon have to be
irrigated. These two facts lead also to an over-cropping situation that can be
viewed as a Nash equilibrium (Faysse, &#65533; 2001). Two allocation rules were
simulated on this scheme (a brief description of the model is given in
appendix): first, the actual one, with a free individual irrigation length, and
then an ex ante allocation rule where the water turn length is fixed and each
crop receives during its water turn in proportion to its needs. By definition, a
"safe" cropping choice is one for which crops are sufficiently irrigated during
median rainfall year and with an irrigation set up when the soil reservoir
becomes lower than 0.85 than the Usable Water Reserve. The calculated cropping
choice is made of 40 ha of wheat, 20 of pepper-bean and 15 of melon. For a risky
cropping choice, we used the actual one on year 98-99, i.e. 60 ha of wheat, 21
of pepper-bean and 27 of melon. The choice between two cropping repartitions and
two allocation rules leads to four scenarios. On figure 9, the left table
presents the scenario description while the right table gives the total
valorization made on the whole scheme for different rainfalls: quinquennial
drought year, quinquennial wet year, the median year and the year 98-99 (which
stands between a median year and a quinquennial dry one). Moreover, the study
made on the field for the 98-99 campaign showed that the total profit made had
not exceeded 50 000 DT7 .water turnyear 98-99 quinquennial dry median
quinquennial wet50000 60000 70000 80000 90000cropping
patternscenariopepperbeanmelonwheatProduction on the schem e (Tunisan
Dinar)unlimited irrigation 0 (r&#65533;el) risky 61 21 27 period unlimited irrigation 1
period risky 2 61 21 27 daily repartition100000 110000120000
<Much more text>
===================================================
My search criteria is "joint management", but I get three markups on management.

Is there anything wrong with the text?

Thanks,
Ying


> As much as you have shown of the example output is
> roughly what I would expect - using the default
> simpleFragmenter you get roughly 100 character sized
> fragments and you have shown 3 fragments sized 97, 100
> and 105 chars long separated by "...".
>
> > Of course the result is far more than this.
>
> So are you saying you had even more fragments in the
> "getBestFragments" return string separated by your
> choice of "..." separator?
>
> I also notice the text contains no markup - have you
> removed that from the example?
>
> Cheers
> Mark
>
>
>
>
>
>
>
> ___________________________________________________________
> How much free photo storage do you get? Store your holiday
> snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlight problem

mark harwood
In reply to this post by Jin, Ying
All looks OK with that bit.
At the risk of sounding obvious - are you mistaking the results from
multiple documents as the highlighted content from just one document?
eg the end of your "for" loop looks like this:
       System.out.print(result);
   }
and you assume the printed display is from just one document?

That would explain why a different search produces different results -
you are seeing more documents.



[hidden email] wrote:

>Hi, Mark,
>
>Sorry for the confusing. The complete code is here:
>===============================================
>Analyzer analyzer = new StandardAnalyzer();
>      BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
>      String line = in.readLine();
>
>      if (line.length() == -1)
>        return;
>
>      Query query = QueryParser.parse(line, "contents", analyzer);
>      Hits hits = searcher.search(query);
>      Highlighter highlighter =new Highlighter(new QueryScorer(query));
>
>      String ids = "";
>      for (int i = 0; i < hits.length(); i++) {
>        Document doc = hits.doc(i);
>        String text = doc.get("contents");
>        String result = "";
>
>        if(text != null ){
>          TokenStream tokenStream = analyzer.tokenStream("contents",
>              new StringReader(text));
>          // Get 3 best fragments and seperate with a "..."                    
>                                                  
>          result = highlighter.getBestFragments(tokenStream,
>              text, 3, "...");
>        }
>
>===============================================
>Quoting mark harwood <[hidden email]>:
>
>  
>
>>As much as you have shown of the example output is
>>roughly what I would expect - using the default
>>simpleFragmenter you get roughly 100 character sized
>>fragments and you have shown 3 fragments sized 97, 100
>>and 105 chars long separated by "...".
>>
>>    
>>
>>>Of course the result is far more than this.
>>>      
>>>
>>So are you saying you had even more fragments in the
>>"getBestFragments" return string separated by your
>>choice of "..." separator?
>>
>>I also notice the text contains no markup - have you
>>removed that from the example?
>>
>>Cheers
>>Mark
>>
>>
>>
>>
>>
>>
>>
>>___________________________________________________________
>>How much free photo storage do you get? Store your holiday
>>snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [hidden email]
>>For additional commands, e-mail: [hidden email]
>>
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>  
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highlight problem

mark harwood
In reply to this post by Jin, Ying
Thanks for pointing out this issue.
The bug was related to having a doc bigger than the maxNumDocsToAnalyze
setting.  In this situation, the last fragment created was always sized
from maxNumDocsToAnalyze position to the remainder of the doc (in your
case, quite large!)

I have fixed this in SVN and Junit tests are clean. If you want to patch
your version comment out these lines in the Highlighter.java code

// append text after end of last token
            if (lastEndOffset < text.length())
               
newText.append(encoder.encodeText(text.substring(lastEndOffset)));

I believe the above code was trying to retain any non-token text after
the last token eg appending ?! or . so I have removed it to be on the
safe side.

Cheers
Mark





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]