NO_NORMS and TOKENIZED?

classic Classic list List threaded Threaded
40 messages Options
12
Reply | Threaded
Open this post in threaded view
|

NO_NORMS and TOKENIZED?

Nadav Har'El
Hi,

When adding a field to a document, Field.Index gives me four options: NO,
NO_NORMS, TOKENIZED and UN_TOKENIZED.

NO_NORMS means, according to the documentation "index the field's value
without an Analyzer, and disable the storing of norms."

What can I do if I want to index the field's value *with* an Analyzer, but
still disable the storing of norms (because the field length should not be
considered in scoring)? Can't I do that? Was this intentional, or is this
an oversight and a fifth option should be added?

Thanks,
Nadav.

--
Nadav Har'El                        |      Tuesday, Jan 23 2007, 4 Shevat 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |echo '[q]sa[ln0=aln256%Pln256/snlbx]
http://nadav.harel.org.il           |sb3135071790101768542287578439snlbxq'|dc

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NO_NORMS and TOKENIZED?

Yonik Seeley-2
On 1/23/07, Nadav Har'El <[hidden email]> wrote:

> Hi,
>
> When adding a field to a document, Field.Index gives me four options: NO,
> NO_NORMS, TOKENIZED and UN_TOKENIZED.
>
> NO_NORMS means, according to the documentation "index the field's value
> without an Analyzer, and disable the storing of norms."
>
> What can I do if I want to index the field's value *with* an Analyzer, but
> still disable the storing of norms (because the field length should not be
> considered in scoring)?

That works fine.

> Can't I do that? Was this intentional, or is this
> an oversight and a fifth option should be added?

Yes, that was intentional.
see http://issues.apache.org/jira/browse/LUCENE-448

I hadn't added a Field.Index option at all, and Doug suggested
NO_NORMS, probably because it's mostly harmless to new users who might
disable length normalization without realizing the implications.

For other fields, I prefer use of setOmitNorms()

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NO_NORMS and TOKENIZED?

Nadav Har'El
On Tue, Jan 23, 2007, Yonik Seeley wrote about "Re: NO_NORMS and TOKENIZED?":

> >When adding a field to a document, Field.Index gives me four options: NO,
> >NO_NORMS, TOKENIZED and UN_TOKENIZED.
>..
> >What can I do if I want to index the field's value *with* an Analyzer, but
> >still disable the storing of norms (because the field length should not be
> >considered in scoring)?
>...
> I hadn't added a Field.Index option at all, and Doug suggested
> NO_NORMS, probably because it's mostly harmless to new users who might
> disable length normalization without realizing the implications.
>
> For other fields, I prefer use of setOmitNorms()

Thanks! I wasn't aware of this method. It's exactly what I needed.

I never thought of trying to modify the Field after construction...
Perhaps the NO_NORMS javadoc should refer to setOmitNorms()? (Or I should
learn to search the documentation better :-)).

--
Nadav Har'El                        |      Tuesday, Jan 23 2007, 4 Shevat 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |Today is the tomorrow you worried about
http://nadav.harel.org.il           |yesterday, and now you know why.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NO_NORMS and TOKENIZED?

Otis Gospodnetic-2
In reply to this post by Nadav Har'El
Funny, I was looking to do the same thing the other day and gave up thinking it wasn't possible, not being aware of setOmitNorms().  Yeah, a javadoc patch would be welcome.

Otis

----- Original Message ----
From: Nadav Har'El <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 23, 2007 10:49:33 AM
Subject: Re: NO_NORMS and TOKENIZED?

On Tue, Jan 23, 2007, Yonik Seeley wrote about "Re: NO_NORMS and TOKENIZED?":

> >When adding a field to a document, Field.Index gives me four options: NO,
> >NO_NORMS, TOKENIZED and UN_TOKENIZED.
>..
> >What can I do if I want to index the field's value *with* an Analyzer, but
> >still disable the storing of norms (because the field length should not be
> >considered in scoring)?
>...
> I hadn't added a Field.Index option at all, and Doug suggested
> NO_NORMS, probably because it's mostly harmless to new users who might
> disable length normalization without realizing the implications.
>
> For other fields, I prefer use of setOmitNorms()

Thanks! I wasn't aware of this method. It's exactly what I needed.

I never thought of trying to modify the Field after construction...
Perhaps the NO_NORMS javadoc should refer to setOmitNorms()? (Or I should
learn to search the documentation better :-)).

--
Nadav Har'El                        |      Tuesday, Jan 23 2007, 4 Shevat 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |Today is the tomorrow you worried about
http://nadav.harel.org.il           |yesterday, and now you know why.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Multiword Highlighting

Anne Conger
Hi,

I'm wondering what the best way is to do highlighting of multiword phrases.
For example, if a search is for "president kennedy", how can I make sure
that "president" is only highlighted if it is next to "kennedy" and
"president" in "president clinton" is not.
I haven't figured out where in the process the phrases are being split into
separate words.
Would restructuring the query that is passed to the scorer help with this?
It's currently a set of boolean queries with each phrase as a separate
query.  Or should the exact phrases be set up as WeightedTerms?

Thanks!

Anne


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

mark harwood
This is a deficiency in the highlighter functionality that has been
discussed several times before. The summary is  -  not a trivial fix.

See here for background:

http://marc2.theaimsgroup.com/?l=lucene-user&m=114631181214303&w=1

http://www.gossamer-threads.com/lists/engine?do=post_view_printable;post=42014;list=lucene


Cheers,
Mark

Anne Conger wrote:

> Hi,
>
> I'm wondering what the best way is to do highlighting of multiword phrases.
> For example, if a search is for "president kennedy", how can I make sure
> that "president" is only highlighted if it is next to "kennedy" and
> "president" in "president clinton" is not.
> I haven't figured out where in the process the phrases are being split into
> separate words.
> Would restructuring the query that is passed to the scorer help with this?
> It's currently a set of boolean queries with each phrase as a separate
> query.  Or should the exact phrases be set up as WeightedTerms?
>
> Thanks!
>
> Anne
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>  



       
       
               
___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Mark Miller-3
Isn't it semi trivial if you are not interested in the fragments (I
swear it seems that most people are not)? Isn't it you that suggested
turning the query into a SpanQuery, extracting the spans and then doing
the highlighting after a rewrite? This seems somewhat trivial so what am
I missing? I have started a simple implementation of this, but stopped
short of combining the highlight spans (seems like a nasty n^2 problem
that I don't know a good algorithm around - every new highlight has to
be compared against every previous highlight for overlap : I am sure
your the man to ask about this). I plan on getting back into this soon.
Not trivial? Or do you just mean with the fragments...you seem to be
deeply interested in fragments but a lot of people seem to just want to
highlight the source text.

Any words of wisdom would be sorely appreciated.

- Mark

markharw00d wrote:

> This is a deficiency in the highlighter functionality that has been
> discussed several times before. The summary is  -  not a trivial fix.
>
> See here for background:
>
> http://marc2.theaimsgroup.com/?l=lucene-user&m=114631181214303&w=1
>
> http://www.gossamer-threads.com/lists/engine?do=post_view_printable;post=42014;list=lucene 
>
>
>
> Cheers,
> Mark
>
> Anne Conger wrote:
>> Hi,
>>
>> I'm wondering what the best way is to do highlighting of multiword
>> phrases.
>> For example, if a search is for "president kennedy", how can I make sure
>> that "president" is only highlighted if it is next to "kennedy" and
>> "president" in "president clinton" is not.
>> I haven't figured out where in the process the phrases are being
>> split into
>> separate words.
>> Would restructuring the query that is passed to the scorer help with
>> this?
>> It's currently a set of boolean queries with each phrase as a separate
>> query.  Or should the exact phrases be set up as WeightedTerms?
>>
>> Thanks!
>>
>> Anne
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>  
>
>
>
>    
>    
>        
> ___________________________________________________________ All new
> Yahoo! Mail "The new Interface is stunning in its simplicity and ease
> of use." - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

mark harwood
 >>Isn't it semi trivial if you are not interested in the fragments (I
swear it seems that most people are not)? I

I haven't conducted a survey but it's the typical web search engine
scenario - select only a small subset of the matching document content
for display in SERPS. I would expect that to be a pretty commonplace
requirement for which we should retain a solution.

Maybe a new highlighter with no attempt at summarising could more easily
address phrase support for small pieces of content. It will always be
hard to  faithfully represent all possible query match logic -
especially if there are NOTs, ANDs and ORs mixed in with all the term
proximity logic e.g. NotNear. Some compromise is required. I did suggest
that spans maybe a better basis for highlighting than terms and pointed
at some existing code to get you along this path - see here
http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2

There are also a couple of other Highlighter packages contributed
recently which I listed in my previous mail but I simply haven't had the
time to look at in detail so they may be useful. Anyone had any
experience of those?

 >> every new highlight has to be compared against every previous
highlight for overlap
Yes, Analyzers that produce overlapping tokens are an added complication
when implementing highlighting logic. I think we have a reasonable Junit
test containing several of the more exotic analyzer scenarios which you
could/should use for testing any other highlighter implementation.

Cheers,
Mark



Mark Miller wrote:

> Isn't it semi trivial if you are not interested in the fragments (I
> swear it seems that most people are not)? Isn't it you that suggested
> turning the query into a SpanQuery, extracting the spans and then
> doing the highlighting after a rewrite? This seems somewhat trivial so
> what am I missing? I have started a simple implementation of this, but
> stopped short of combining the highlight spans (seems like a nasty n^2
> problem that I don't know a good algorithm around - every new
> highlight has to be compared against every previous highlight for
> overlap : I am sure your the man to ask about this). I plan on getting
> back into this soon. Not trivial? Or do you just mean with the
> fragments...you seem to be deeply interested in fragments but a lot of
> people seem to just want to highlight the source text.
>
> Any words of wisdom would be sorely appreciated.
>
> - Mark
>
> markharw00d wrote:
>> This is a deficiency in the highlighter functionality that has been
>> discussed several times before. The summary is  -  not a trivial fix.
>>
>> See here for background:
>>
>> http://marc2.theaimsgroup.com/?l=lucene-user&m=114631181214303&w=1
>>
>> http://www.gossamer-threads.com/lists/engine?do=post_view_printable;post=42014;list=lucene 
>>
>>
>>
>> Cheers,
>> Mark
>>
>> Anne Conger wrote:
>>> Hi,
>>>
>>> I'm wondering what the best way is to do highlighting of multiword
>>> phrases.
>>> For example, if a search is for "president kennedy", how can I make
>>> sure
>>> that "president" is only highlighted if it is next to "kennedy" and
>>> "president" in "president clinton" is not.
>>> I haven't figured out where in the process the phrases are being
>>> split into
>>> separate words.
>>> Would restructuring the query that is passed to the scorer help with
>>> this?
>>> It's currently a set of boolean queries with each phrase as a separate
>>> query.  Or should the exact phrases be set up as WeightedTerms?
>>>
>>> Thanks!
>>>
>>> Anne
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>>>  
>>
>>
>>
>>                
>> ___________________________________________________________ All new
>> Yahoo! Mail "The new Interface is stunning in its simplicity and ease
>> of use." - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>



       
       
               
___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Mark Miller-3


markharw00d wrote:
> >>Isn't it semi trivial if you are not interested in the fragments (I
> swear it seems that most people are not)? I
>
> I haven't conducted a survey but it's the typical web search engine
> scenario - select only a small subset of the matching document content
> for display in SERPS. I would expect that to be a pretty commonplace
> requirement for which we should retain a solution.
No doubt. I certainly am not suggesting you ditch fragments and I have
no evidence more people just want to highlight a doc...it's just the
impression that I get from the mailing list is that most people just
want to highlight the returned doc...I am sure plenty of people need
google style results too, but my experience with Lucene has not often
been in the area of web search engines. I bet a lot of users would
benefit from a highlighter that highlights actual hits and doesn't
summarize though (both would be great). I wouln't claim to be an
authority on any of this though...take my opinion for what its worth --
very little.
>
> Maybe a new highlighter with no attempt at summarising could more
> easily address phrase support for small pieces of content. It will
> always be hard to  faithfully represent all possible query match logic
> - especially if there are NOTs, ANDs and ORs mixed in with all the
> term proximity logic e.g. NotNear. Some compromise is required. I did
> suggest that spans maybe a better basis for highlighting than terms
> and pointed at some existing code to get you along this path - see
> here http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2
I have some code that you wrote that seems to turn almost any query into
a series of spans. Perhaps it is not as robust as my limited testing
made it seem.
>
> There are also a couple of other Highlighter packages contributed
> recently which I listed in my previous mail but I simply haven't had
> the time to look at in detail so they may be useful. Anyone had any
> experience of those?
Non of them seem to do full span highlighting...again based on my
limited investigation.
>
> >> every new highlight has to be compared against every previous
> highlight for overlap
> Yes, Analyzers that produce overlapping tokens are an added
> complication when implementing highlighting logic. I think we have a
> reasonable Junit test containing several of the more exotic analyzer
> scenarios which you could/should use for testing any other highlighter
> implementation.
thanks for the tip.

I appreciate your response Mark. I will continue to look at your span
extractor...I thought that it alone was enough to what I wanted, but
your comments seem to suggest maybe I'll need more. I hope not <g> If I
do manage something I will be sure to post my results.


- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Mark Miller-3
In reply to this post by mark harwood

>
> Maybe a new highlighter with no attempt at summarising could more
> easily address phrase support for small pieces of content. It will
> always be hard to  faithfully represent all possible query match logic
> - especially if there are NOTs, ANDs and ORs mixed in with all the
> term proximity logic e.g. NotNear. Some compromise is required. I did
> suggest that spans maybe a better basis for highlighting than terms
> and pointed at some existing code to get you along this path - see
> here http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2
>
Had not explored the link yet. That is the code I found and am using. It
will not extract spans in all cases then? I tried some basic tests and
it seemed to work great...I implemented a simple highlighter that does
not deal with highlighting overlap. I assume now that a complicated
query will not properly be extracted then? I guess I have some testing
to do.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Otis Gospodnetic-2
In reply to this post by Anne Conger
For what it's worth Mark (Miller), there *is* a need for "just highlight the query terms without trying to get excerpts" functionality - something a la Google cache (different colours...mmm, nice).  I've had people ask me for this before, and I know I could use this functionality, too.  Please contrib to contrib/ if you end up working on this.

Otis
--
Simpy -- http://www.simpy.com/ -- Tag.  Search.  Share.

----- Original Message ----
From: Mark Miller <[hidden email]>
To: [hidden email]
Sent: Sunday, January 28, 2007 7:39:29 AM
Subject: Re: Multiword Highlighting


markharw00d wrote:
> >>Isn't it semi trivial if you are not interested in the fragments (I
> swear it seems that most people are not)? I
>
> I haven't conducted a survey but it's the typical web search engine
> scenario - select only a small subset of the matching document content
> for display in SERPS. I would expect that to be a pretty commonplace
> requirement for which we should retain a solution.
No doubt. I certainly am not suggesting you ditch fragments and I have
no evidence more people just want to highlight a doc...it's just the
impression that I get from the mailing list is that most people just
want to highlight the returned doc...I am sure plenty of people need
google style results too, but my experience with Lucene has not often
been in the area of web search engines. I bet a lot of users would
benefit from a highlighter that highlights actual hits and doesn't
summarize though (both would be great). I wouln't claim to be an
authority on any of this though...take my opinion for what its worth --
very little.
>
> Maybe a new highlighter with no attempt at summarising could more
> easily address phrase support for small pieces of content. It will
> always be hard to  faithfully represent all possible query match logic
> - especially if there are NOTs, ANDs and ORs mixed in with all the
> term proximity logic e.g. NotNear. Some compromise is required. I did
> suggest that spans maybe a better basis for highlighting than terms
> and pointed at some existing code to get you along this path - see
> here http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2
I have some code that you wrote that seems to turn almost any query into
a series of spans. Perhaps it is not as robust as my limited testing
made it seem.
>
> There are also a couple of other Highlighter packages contributed
> recently which I listed in my previous mail but I simply haven't had
> the time to look at in detail so they may be useful. Anyone had any
> experience of those?
Non of them seem to do full span highlighting...again based on my
limited investigation.
>
> >> every new highlight has to be compared against every previous
> highlight for overlap
> Yes, Analyzers that produce overlapping tokens are an added
> complication when implementing highlighting logic. I think we have a
> reasonable Junit test containing several of the more exotic analyzer
> scenarios which you could/should use for testing any other highlighter
> implementation.
thanks for the tip.

I appreciate your response Mark. I will continue to look at your span
extractor...I thought that it alone was enough to what I wanted, but
your comments seem to suggest maybe I'll need more. I hope not <g> If I
do manage something I will be sure to post my results.


- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

mark harwood
 >>For what it's worth Mark (Miller), there *is* a need for "just
highlight the query terms without trying to get excerpts" functionality
 >>- something a la Google cache (different colours...mmm, nice).

FWIW, the existing highlighter doesn't *have* to fragment - just pass a
NullFragmenter to the highlighter.
Ideally we'd have one implementation that tackles phrase support and
preserves (optional) support for selecting fragments. I can see that to
achieve this the existing highlighter design would need to change.
Currently the highlighter identifies fragments first (typically using an
implementation which arbitrarily chops text after 'n' words) and then
selects which of these fragments have the highest density of
high-scoring query terms. This logic would need to change to :
1) Use QuerySpansExtractor to identify all the *spans* in the document
2) Use a sliding window to select fragments, taking care to select
fragments that wholly contain spans, rather than selecting only part of
a span.
3) Mark up the hits.
Clearly, for people uninterested in selecting fragments, step 2 can be
skipped.

Cheers
Mark


       
       
               
___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Mark Miller-3
I do use the NullFragmenter now. I have no interest in the fragments at
the moment, just in showing hits on the source document. It would be
great if I could just show the real hits though. The span approach seems
to work fine for me. I have even tested the highlighting using my
sentence and paragraph proximity search queries from my query parser.
These use a modified NotSpan (I call it WithinSpan) within an unbound
NearSpan. I did a few queries that combine that structure with wildcard
and boolean queries...everything appeared to work grand -- I got all the
correct highlights. I just have to combine the highlights (spans) and
refine my code (and that color comment Otis made is something I am
interested in well -- it would be great to have the words found in a
single spanquery be the same color, or a similar shade).

- Mark

markharw00d wrote:

> >>For what it's worth Mark (Miller), there *is* a need for "just
> highlight the query terms without trying to get excerpts" functionality
> >>- something a la Google cache (different colours...mmm, nice).
>
> FWIW, the existing highlighter doesn't *have* to fragment - just pass
> a NullFragmenter to the highlighter.
> Ideally we'd have one implementation that tackles phrase support and
> preserves (optional) support for selecting fragments. I can see that
> to achieve this the existing highlighter design would need to change.
> Currently the highlighter identifies fragments first (typically using
> an implementation which arbitrarily chops text after 'n' words) and
> then selects which of these fragments have the highest density of
> high-scoring query terms. This logic would need to change to :
> 1) Use QuerySpansExtractor to identify all the *spans* in the document
> 2) Use a sliding window to select fragments, taking care to select
> fragments that wholly contain spans, rather than selecting only part
> of a span.
> 3) Mark up the hits.
> Clearly, for people uninterested in selecting fragments, step 2 can be
> skipped.
>
> Cheers
> Mark
>
>
>    
>    
>        
> ___________________________________________________________ All new
> Yahoo! Mail "The new Interface is stunning in its simplicity and ease
> of use." - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Mark Miller-3
In reply to this post by Otis Gospodnetic-2
I have been away from this for a week, but my interest has started
building again. The whole spans implementation seems to work great for
finding the actual hits but there is a somewhat annoying limitation:
because I am using Spans it seems I can only either highlight the entire
found span or just the first and last token of the found span. First and
last token works great for any span involving two query tokens (the only
type I am concerned with at the moment), but a 3 word span would not
have the middle word highlighted (unless you highlight the whole darn
span). Other than that, the implementation is pretty darn simple and
seems to work well. It wouldn't be too hard to set the option of
complete span highlighting or first and last token.

Still interested in considering this for Contrib? Perhaps you want to
wait for someone to merge the idea with the current Contrib highlighter
(add fragments) as Mark H. suggested in his last email on the subject.
Or there just may not be much interest -- the other recent highlighters
haven't really gone anywhere that I have seen (though I don't think they
attempted 'actual' hit highlighting).

If there is interest, suggested package name?

Otis Gospodnetic wrote:
> For what it's worth Mark (Miller), there *is* a need for "just highlight the query terms without trying to get excerpts" functionality - something a la Google cache (different colours...mmm, nice).  I've had people ask me for this before, and I know I could use this functionality, too.  Please contrib to contrib/ if you end up working on this.
>
> Otis
> --
> Simpy -- http://www.simpy.com/ -- Tag.  Search.  Share.
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

mark harwood
In reply to this post by Anne Conger
Hi Mark,
Have you looked at the returned spans from any other potential problem scenarios (other than the 3 word one you suggest) e.g. complex nested "SpanOr" or "SpanNot" logic?

>>Or there just may not be much interest

There's certainly interest on my part on seeing this merged with the existing highlighter (to include the option of fragmenting). Unfortunately available time can be an issue for me.
Can you attach your code to a new Jira entry so I can have a play?
I imagine if I do combine it with the existing Highlighter it will break the existing API so would probably have to create a new SpansBasedHighlighter.

Cheers,
Mark


----- Original Message ----
From: Mark Miller <[hidden email]>
To: [hidden email]
Sent: Friday, 2 February, 2007 3:58:01 PM
Subject: Re: Multiword Highlighting

I have been away from this for a week, but my interest has started
building again. The whole spans implementation seems to work great for
finding the actual hits but there is a somewhat annoying limitation:
because I am using Spans it seems I can only either highlight the entire
found span or just the first and last token of the found span. First and
last token works great for any span involving two query tokens (the only
type I am concerned with at the moment), but a 3 word span would not
have the middle word highlighted (unless you highlight the whole darn
span). Other than that, the implementation is pretty darn simple and
seems to work well. It wouldn't be too hard to set the option of
complete span highlighting or first and last token.

Still interested in considering this for Contrib? Perhaps you want to
wait for someone to merge the idea with the current Contrib highlighter
(add fragments) as Mark H. suggested in his last email on the subject.
Or there just may not be much interest -- the other recent highlighters
haven't really gone anywhere that I have seen (though I don't think they
attempted 'actual' hit highlighting).

If there is interest, suggested package name?

Otis Gospodnetic wrote:
> For what it's worth Mark (Miller), there *is* a need for "just highlight the query terms without trying to get excerpts" functionality - something a la Google cache (different colours...mmm, nice).  I've had people ask me for this before, and I know I could use this functionality, too.  Please contrib to contrib/ if you end up working on this.
>
> Otis
> --
> Simpy -- http://www.simpy.com/ -- Tag.  Search.  Share.
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






               
____________________________________________________
 
Yahoo! Photos is now offering a quality print service from just 7p a photo. http://uk.photos.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Mark Miller-3


mark harwood wrote:
> Hi Mark,
> Have you looked at the returned spans from any other potential problem scenarios (other than the 3 word one you suggest) e.g. complex nested "SpanOr" or "SpanNot" logic?
>  
Nothing super intense, but I haved look at some semi complex nesting and
it all looks great if you use the full span highlighting...highlighting
the first and last word of the span only works great if your limited to
word to word proximity searching (like in my parser <G> works great for
my sentence and paragraph proximity searching, though i had to add the
option of hiding my index marker tokens from the output)

Perhaps you know of something that I haven't run into that may not
highlight correctly ?
> Can you attach your code to a new Jira entry so I can have a play?
>
>  
I certainly will.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NO_NORMS and TOKENIZED?

Nadav Har'El
In reply to this post by Otis Gospodnetic-2
On Fri, Jan 26, 2007, Otis Gospodnetic wrote about "Re: NO_NORMS and TOKENIZED?":
> Funny, I was looking to do the same thing the other day and gave up thinking it wasn't possible, not being aware of setOmitNorms().  Yeah, a javadoc patch would be welcome.
>
> Otis

Before I go ahead and post a javadoc patch, I want to question again the
wisdom of this whole situation:

Currently, most of a Field's parameters must be defined during its
construction. There is no method to change whether this Field object is to
be stored, to be compressed, to be indexed, to be tokenized - all these
things MUST be defined during the field's construction. So it is very
strange, and completely unexpected (at least to me and to Otis), that just
the "omitNorms" parameter can be changed after after construction, with a
"setOmitNorms" method - and not only can it be set after construction, in
some cases it must be set after construction, because the constructor doesn't
allow you to set it if you want an analyzer...

So perhaps changing the code, not just the javadoc, would be better?
One way to do it while keeping backward compatibility is to add something
like TOKENIZED_NO_NORMS to Field.Index.

> >...
> > I hadn't added a Field.Index option at all, and Doug suggested
> > NO_NORMS, probably because it's mostly harmless to new users who might
> > disable length normalization without realizing the implications.

If we had also a "TOKENIZED_NO_NORMS", why would new users accidentally
use it? I guess the javadoc of this parameter could also warn against its
use (something like "not recommended for general use", or whatever)?

--
Nadav Har'El                        |    Thursday, Feb 15 2007, 27 Shevat 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |How long a minute depends on what side of
http://nadav.harel.org.il           |the bathroom door you're on.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Erick Erickson
In reply to this post by Mark Miller-3
I hope you're all following this old thread, because I've just run into
something I don't quite know what to do about with the SpansExtractor code
that I shamelessly stole.

Let's say my text is "a b c d e f g h" and my query is "a AND z". The
implementation I stole for SpansExtractor (mentioned several times in this
thread) returns a span for "a" which doesn't preserve the sense of the
query. The root of the problem is that when it gets down to assembling the
getSpansFromTermQuery, the sense of "AND" is lost and I get span for the "a"
in the query.

The rest of the kinds of spans don't seem to have the same issue. OR should
return the "a" in the example above. Any phrase queries that come through
work fine. In fact, our application requires that we have an implied
proximity, mostly anyway, so I haven't had to deal with this until now.....

One way, it seems to me, to handle this would be to transform the query
above into a span query with a limit of 10,000, where 10,000 is a magic
number that I'm confident is OK in my application because of the
PositionIncrementGaps I set up during indexing.

Is there a more elegant way of doing this? Or am I missing the boat
entirely? Or did I mess up when I stole the code?

Or, and this would be the easiest for me at least, has this work already
been done and all I really need to do is get a different implementation of
SpansExtractor <G>?

Thanks
Erick


On 2/2/07, Mark Miller <[hidden email]> wrote:

>
>
>
> mark harwood wrote:
> > Hi Mark,
> > Have you looked at the returned spans from any other potential problem
> scenarios (other than the 3 word one you suggest) e.g. complex nested
> "SpanOr" or "SpanNot" logic?
> >
> Nothing super intense, but I haved look at some semi complex nesting and
> it all looks great if you use the full span highlighting...highlighting
> the first and last word of the span only works great if your limited to
> word to word proximity searching (like in my parser <G> works great for
> my sentence and paragraph proximity searching, though i had to add the
> option of hiding my index marker tokens from the output)
>
> Perhaps you know of something that I haven't run into that may not
> highlight correctly ?
> > Can you attach your code to a new Jira entry so I can have a play?
> >
> >
> I certainly will.
>
> - Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: NO_NORMS and TOKENIZED?

Yonik Seeley-2
In reply to this post by Nadav Har'El
I originally added it without an Index param at all.
I can't say I'm a fan of the way Field currently does things, and I
didn't want everyone to pay the price for yet more options.

Look at the code for the Field constructor:

  public Field(String name, String value, Store store, Index index,
TermVector termVector) {
    if (name == null)
      throw new NullPointerException("name cannot be null");
    if (value == null)
      throw new NullPointerException("value cannot be null");
    if (name.length() == 0 && value.length() == 0)
      throw new IllegalArgumentException("name and value cannot both be empty");
    if (index == Index.NO && store == Store.NO)
      throw new IllegalArgumentException("it doesn't make sense to
have a field that "
         + "is neither indexed nor stored");
    if (index == Index.NO && termVector != TermVector.NO)
      throw new IllegalArgumentException("cannot store term vector information "
         + "for a field that is not indexed");

    this.name = name.intern();        // field names are interned
    this.fieldsData = value;

    if (store == Store.YES){
      this.isStored = true;
      this.isCompressed = false;
    }
    else if (store == Store.COMPRESS) {
      this.isStored = true;
      this.isCompressed = true;
    }
    else if (store == Store.NO){
      this.isStored = false;
      this.isCompressed = false;
    }
    else
      throw new IllegalArgumentException("unknown store parameter " + store);

    if (index == Index.NO) {
      this.isIndexed = false;
      this.isTokenized = false;
    } else if (index == Index.TOKENIZED) {
      this.isIndexed = true;
      this.isTokenized = true;
    } else if (index == Index.UN_TOKENIZED) {
      this.isIndexed = true;
      this.isTokenized = false;
    } else if (index == Index.NO_NORMS) {
      this.isIndexed = true;
      this.isTokenized = false;
      this.omitNorms = true;
    } else {
      throw new IllegalArgumentException("unknown index parameter " + index);
    }

    this.isBinary = false;

    setStoreTermVector(termVector);
  }

 protected void setStoreTermVector(Field.TermVector termVector) {
    if (termVector == Field.TermVector.NO) {
      this.storeTermVector = false;
      this.storePositionWithTermVector = false;
      this.storeOffsetWithTermVector = false;
    }
    else if (termVector == Field.TermVector.YES) {
      this.storeTermVector = true;
      this.storePositionWithTermVector = false;
      this.storeOffsetWithTermVector = false;
    }
    else if (termVector == Field.TermVector.WITH_POSITIONS) {
      this.storeTermVector = true;
      this.storePositionWithTermVector = true;
      this.storeOffsetWithTermVector = false;
    }
    else if (termVector == Field.TermVector.WITH_OFFSETS) {
      this.storeTermVector = true;
      this.storePositionWithTermVector = false;
      this.storeOffsetWithTermVector = true;
    }
    else if (termVector == Field.TermVector.WITH_POSITIONS_OFFSETS) {
      this.storeTermVector = true;
      this.storePositionWithTermVector = true;
      this.storeOffsetWithTermVector = true;
    }
    else {
      throw new IllegalArgumentException("unknown termVector parameter
" + termVector);
    }
  }


I simply think this is too high of a price to pay for "type safety".
Everyone shouldn't have to pay a performance penalty for making things
a little "safer".  I'm probably in the minority, hence I never said
anything about it before.

It's also made Solr code worse because it stores these things as
flags, but has to go through the same if-then-else contortions to
construct a Field with the proper parameters (just to have Field go
through the reverse contortions to de-multiplex these options).

Think about what would happen if we added a few more options on
storing term vectors... exponential explosion in those if-then-else
statements.

How would I have handled it?
  With a single Field class, I would probably have used old-fashion
c-style flags (a bit field).  Nice an extensible (you can add new
flags without adding/changing any APIs, no performance impact to
adding new options, you can pass around all these flags as a unit,
check multiple flags with a single instruction, etc.
  If we look at inheritance, I'd be tempted to let people subclass and
allow them to pass data directly to the indexer rather than trying to
store it all and enumerate all the possibilities in the single Field
class.
   Reader getBinaryValue()  or even int writeBinaryValue(Writer or IndexOutput)
   TokenStream getTokens()
   boolean isIndexed()

Sorry for the rant... I guess my short answer is that I don't have an
opinion on adding another type-safe constant TOKENIZED_NO_NORMS
because I don't like the whole scheme.

-Yonik


On 2/15/07, Nadav Har'El <[hidden email]> wrote:

> On Fri, Jan 26, 2007, Otis Gospodnetic wrote about "Re: NO_NORMS and TOKENIZED?":
> > Funny, I was looking to do the same thing the other day and gave up thinking it wasn't possible, not being aware of setOmitNorms().  Yeah, a javadoc patch would be welcome.
> >
> > Otis
>
> Before I go ahead and post a javadoc patch, I want to question again the
> wisdom of this whole situation:
>
> Currently, most of a Field's parameters must be defined during its
> construction. There is no method to change whether this Field object is to
> be stored, to be compressed, to be indexed, to be tokenized - all these
> things MUST be defined during the field's construction. So it is very
> strange, and completely unexpected (at least to me and to Otis), that just
> the "omitNorms" parameter can be changed after after construction, with a
> "setOmitNorms" method - and not only can it be set after construction, in
> some cases it must be set after construction, because the constructor doesn't
> allow you to set it if you want an analyzer...
>
> So perhaps changing the code, not just the javadoc, would be better?
> One way to do it while keeping backward compatibility is to add something
> like TOKENIZED_NO_NORMS to Field.Index.
>
> > >...
> > > I hadn't added a Field.Index option at all, and Doug suggested
> > > NO_NORMS, probably because it's mostly harmless to new users who might
> > > disable length normalization without realizing the implications.
>
> If we had also a "TOKENIZED_NO_NORMS", why would new users accidentally
> use it? I guess the javadoc of this parameter could also warn against its
> use (something like "not recommended for general use", or whatever)?
>
> --
> Nadav Har'El                        |    Thursday, Feb 15 2007, 27 Shevat 5767
> IBM Haifa Research Lab              |-----------------------------------------
>                                     |How long a minute depends on what side of
> http://nadav.harel.org.il           |the bathroom door you're on.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiword Highlighting

Mark Miller-3
In reply to this post by Erick Erickson
Good catch Erick! I'll have to tackle this as well. Mark H is the
originator of that code so maybe he will chime in, but what I am think
is this:

In the getSpansFromBooleanquery, keep track of which clauses are
required. Then based on if any Spans are actually returned from
getSpansFromTerm for each required clause, add only the correct spans to
the returned spans. If you get what I mean <g>. I am sure there are some
more cases than that to consider, but I think the direction might work.

If you don't tackle it or can't share I'll be doing it myself.

- Mark

Erick Erickson wrote:

> I hope you're all following this old thread, because I've just run into
> something I don't quite know what to do about with the SpansExtractor
> code
> that I shamelessly stole.
>
> Let's say my text is "a b c d e f g h" and my query is "a AND z". The
> implementation I stole for SpansExtractor (mentioned several times in
> this
> thread) returns a span for "a" which doesn't preserve the sense of the
> query. The root of the problem is that when it gets down to assembling
> the
> getSpansFromTermQuery, the sense of "AND" is lost and I get span for
> the "a"
> in the query.
>
> The rest of the kinds of spans don't seem to have the same issue. OR
> should
> return the "a" in the example above. Any phrase queries that come through
> work fine. In fact, our application requires that we have an implied
> proximity, mostly anyway, so I haven't had to deal with this until
> now.....
>
> One way, it seems to me, to handle this would be to transform the query
> above into a span query with a limit of 10,000, where 10,000 is a magic
> number that I'm confident is OK in my application because of the
> PositionIncrementGaps I set up during indexing.
>
> Is there a more elegant way of doing this? Or am I missing the boat
> entirely? Or did I mess up when I stole the code?
>
> Or, and this would be the easiest for me at least, has this work already
> been done and all I really need to do is get a different
> implementation of
> SpansExtractor <G>?
>
> Thanks
> Erick
>
>
> On 2/2/07, Mark Miller <[hidden email]> wrote:
>>
>>
>>
>> mark harwood wrote:
>> > Hi Mark,
>> > Have you looked at the returned spans from any other potential problem
>> scenarios (other than the 3 word one you suggest) e.g. complex nested
>> "SpanOr" or "SpanNot" logic?
>> >
>> Nothing super intense, but I haved look at some semi complex nesting and
>> it all looks great if you use the full span highlighting...highlighting
>> the first and last word of the span only works great if your limited to
>> word to word proximity searching (like in my parser <G> works great for
>> my sentence and paragraph proximity searching, though i had to add the
>> option of hiding my index marker tokens from the output)
>>
>> Perhaps you know of something that I haven't run into that may not
>> highlight correctly ?
>> > Can you attach your code to a new Jira entry so I can have a play?
>> >
>> >
>> I certainly will.
>>
>> - Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12