Which token filter can combine 2 terms into 1?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Which token filter can combine 2 terms into 1?

Xi Shen
Hi,

I am looking for a token filter that can combine 2 terms into 1? E.g.

the input has been tokenized by white space:

t1 t2 t2a t3

I want a filter that output:

t1 t2t2a t3

I know it is a very special case, and I am thinking about develop a filter
of my own. But I cannot figure out which API I should use to look for terms
in a Token Stream.

--
Regards,
David Shen

http://about.me/davidshen
https://twitter.com/#!/davidshen84
Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Danil ŢORIN
Easiest way would be to pre-process your input and join those 2 tokens
before splitting them by white space.

But from given context I might miss some details...still worth a shot.

On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]> wrote:

> Hi,
>
> I am looking for a token filter that can combine 2 terms into 1? E.g.
>
> the input has been tokenized by white space:
>
> t1 t2 t2a t3
>
> I want a filter that output:
>
> t1 t2t2a t3
>
> I know it is a very special case, and I am thinking about develop a filter
> of my own. But I cannot figure out which API I should use to look for terms
> in a Token Stream.
>
> --
> Regards,
> David Shen
>
> http://about.me/davidshen
> https://twitter.com/#!/davidshen84
>
Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Xi Shen
I have to use the white space and word delimiter to process the input
first. I tried many combination, and it seems to me that it is inevitable
the term will be split into two :(

I think developing my own filter is the only resolution...but I just cannot
find a guide to help me understand what I need to do to implement a
TokenFilter.


On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]> wrote:

> Easiest way would be to pre-process your input and join those 2 tokens
> before splitting them by white space.
>
> But from given context I might miss some details...still worth a shot.
>
> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]> wrote:
>
> > Hi,
> >
> > I am looking for a token filter that can combine 2 terms into 1? E.g.
> >
> > the input has been tokenized by white space:
> >
> > t1 t2 t2a t3
> >
> > I want a filter that output:
> >
> > t1 t2t2a t3
> >
> > I know it is a very special case, and I am thinking about develop a
> filter
> > of my own. But I cannot figure out which API I should use to look for
> terms
> > in a Token Stream.
> >
> > --
> > Regards,
> > David Shen
> >
> > http://about.me/davidshen
> > https://twitter.com/#!/davidshen84
> >
>



--
Regards,
David Shen

http://about.me/davidshen
https://twitter.com/#!/davidshen84
Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Alan Woodward
Have a look at ShingleFilter:  http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html

On 21 Dec 2012, at 08:42, Xi Shen wrote:

> I have to use the white space and word delimiter to process the input
> first. I tried many combination, and it seems to me that it is inevitable
> the term will be split into two :(
>
> I think developing my own filter is the only resolution...but I just cannot
> find a guide to help me understand what I need to do to implement a
> TokenFilter.
>
>
> On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]> wrote:
>
>> Easiest way would be to pre-process your input and join those 2 tokens
>> before splitting them by white space.
>>
>> But from given context I might miss some details...still worth a shot.
>>
>> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]> wrote:
>>
>>> Hi,
>>>
>>> I am looking for a token filter that can combine 2 terms into 1? E.g.
>>>
>>> the input has been tokenized by white space:
>>>
>>> t1 t2 t2a t3
>>>
>>> I want a filter that output:
>>>
>>> t1 t2t2a t3
>>>
>>> I know it is a very special case, and I am thinking about develop a
>> filter
>>> of my own. But I cannot figure out which API I should use to look for
>> terms
>>> in a Token Stream.
>>>
>>> --
>>> Regards,
>>> David Shen
>>>
>>> http://about.me/davidshen
>>> https://twitter.com/#!/davidshen84
>>>
>>
>
>
>
> --
> Regards,
> David Shen
>
> http://about.me/davidshen
> https://twitter.com/#!/davidshen84

Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Xi Shen
Unfortunately, no...I am not combine every two term into one. I am
combining a specific pair.

E.g. the Token Stream: t1 t2 t2a t3
should be rewritten into t1 t2t2a t3

But the TS: t1 t2 t3 t2a
should not be rewritten, and it is already correct


On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
[hidden email]> wrote:

> Have a look at ShingleFilter:
> http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html
>
> On 21 Dec 2012, at 08:42, Xi Shen wrote:
>
> > I have to use the white space and word delimiter to process the input
> > first. I tried many combination, and it seems to me that it is inevitable
> > the term will be split into two :(
> >
> > I think developing my own filter is the only resolution...but I just
> cannot
> > find a guide to help me understand what I need to do to implement a
> > TokenFilter.
> >
> >
> > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]> wrote:
> >
> >> Easiest way would be to pre-process your input and join those 2 tokens
> >> before splitting them by white space.
> >>
> >> But from given context I might miss some details...still worth a shot.
> >>
> >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am looking for a token filter that can combine 2 terms into 1? E.g.
> >>>
> >>> the input has been tokenized by white space:
> >>>
> >>> t1 t2 t2a t3
> >>>
> >>> I want a filter that output:
> >>>
> >>> t1 t2t2a t3
> >>>
> >>> I know it is a very special case, and I am thinking about develop a
> >> filter
> >>> of my own. But I cannot figure out which API I should use to look for
> >> terms
> >>> in a Token Stream.
> >>>
> >>> --
> >>> Regards,
> >>> David Shen
> >>>
> >>> http://about.me/davidshen
> >>> https://twitter.com/#!/davidshen84
> >>>
> >>
> >
> >
> >
> > --
> > Regards,
> > David Shen
> >
> > http://about.me/davidshen
> > https://twitter.com/#!/davidshen84
>
>


--
Regards,
David Shen

http://about.me/davidshen
https://twitter.com/#!/davidshen84
Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Erick Erickson
If it's a fixed list and not excessively long, would synonyms work?

But if theres some kind of logic you need to apply, I don't think you're
going to find anything OOB.
The problem is that by the time a token filter gets called, they are
already split up, you'll probably
have to write a custom filter that manages that logic.

Best
Erick


On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <[hidden email]> wrote:

> Unfortunately, no...I am not combine every two term into one. I am
> combining a specific pair.
>
> E.g. the Token Stream: t1 t2 t2a t3
> should be rewritten into t1 t2t2a t3
>
> But the TS: t1 t2 t3 t2a
> should not be rewritten, and it is already correct
>
>
> On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
> [hidden email]> wrote:
>
> > Have a look at ShingleFilter:
> >
> http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html
> >
> > On 21 Dec 2012, at 08:42, Xi Shen wrote:
> >
> > > I have to use the white space and word delimiter to process the input
> > > first. I tried many combination, and it seems to me that it is
> inevitable
> > > the term will be split into two :(
> > >
> > > I think developing my own filter is the only resolution...but I just
> > cannot
> > > find a guide to help me understand what I need to do to implement a
> > > TokenFilter.
> > >
> > >
> > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]>
> wrote:
> > >
> > >> Easiest way would be to pre-process your input and join those 2 tokens
> > >> before splitting them by white space.
> > >>
> > >> But from given context I might miss some details...still worth a shot.
> > >>
> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]>
> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am looking for a token filter that can combine 2 terms into 1? E.g.
> > >>>
> > >>> the input has been tokenized by white space:
> > >>>
> > >>> t1 t2 t2a t3
> > >>>
> > >>> I want a filter that output:
> > >>>
> > >>> t1 t2t2a t3
> > >>>
> > >>> I know it is a very special case, and I am thinking about develop a
> > >> filter
> > >>> of my own. But I cannot figure out which API I should use to look for
> > >> terms
> > >>> in a Token Stream.
> > >>>
> > >>> --
> > >>> Regards,
> > >>> David Shen
> > >>>
> > >>> http://about.me/davidshen
> > >>> https://twitter.com/#!/davidshen84
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > > David Shen
> > >
> > > http://about.me/davidshen
> > > https://twitter.com/#!/davidshen84
> >
> >
>
>
> --
> Regards,
> David Shen
>
> http://about.me/davidshen
> https://twitter.com/#!/davidshen84
>
Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Jack Krupansky-2
And to be more specific, most query parsers will have already separated the
terms and will call the analyzer with only one term at a time, so no term
recombination is possible for those parsed terms, at query time.

-- Jack Krupansky
-----Original Message-----
From: Erick Erickson
Sent: Friday, December 21, 2012 8:27 AM
To: java-user
Subject: Re: Which token filter can combine 2 terms into 1?

If it's a fixed list and not excessively long, would synonyms work?

But if theres some kind of logic you need to apply, I don't think you're
going to find anything OOB.
The problem is that by the time a token filter gets called, they are
already split up, you'll probably
have to write a custom filter that manages that logic.

Best
Erick


On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <[hidden email]> wrote:

> Unfortunately, no...I am not combine every two term into one. I am
> combining a specific pair.
>
> E.g. the Token Stream: t1 t2 t2a t3
> should be rewritten into t1 t2t2a t3
>
> But the TS: t1 t2 t3 t2a
> should not be rewritten, and it is already correct
>
>
> On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
> [hidden email]> wrote:
>
> > Have a look at ShingleFilter:
> >
> http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html
> >
> > On 21 Dec 2012, at 08:42, Xi Shen wrote:
> >
> > > I have to use the white space and word delimiter to process the input
> > > first. I tried many combination, and it seems to me that it is
> inevitable
> > > the term will be split into two :(
> > >
> > > I think developing my own filter is the only resolution...but I just
> > cannot
> > > find a guide to help me understand what I need to do to implement a
> > > TokenFilter.
> > >
> > >
> > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]>
> wrote:
> > >
> > >> Easiest way would be to pre-process your input and join those 2
> > >> tokens
> > >> before splitting them by white space.
> > >>
> > >> But from given context I might miss some details...still worth a
> > >> shot.
> > >>
> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]>
> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am looking for a token filter that can combine 2 terms into 1?
> > >>> E.g.
> > >>>
> > >>> the input has been tokenized by white space:
> > >>>
> > >>> t1 t2 t2a t3
> > >>>
> > >>> I want a filter that output:
> > >>>
> > >>> t1 t2t2a t3
> > >>>
> > >>> I know it is a very special case, and I am thinking about develop a
> > >> filter
> > >>> of my own. But I cannot figure out which API I should use to look
> > >>> for
> > >> terms
> > >>> in a Token Stream.
> > >>>
> > >>> --
> > >>> Regards,
> > >>> David Shen
> > >>>
> > >>> http://about.me/davidshen
> > >>> https://twitter.com/#!/davidshen84
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > Regards,
> > > David Shen
> > >
> > > http://about.me/davidshen
> > > https://twitter.com/#!/davidshen84
> >
> >
>
>
> --
> Regards,
> David Shen
>
> http://about.me/davidshen
> https://twitter.com/#!/davidshen84
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Tom-2
On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky <[hidden email]>wrote:

> And to be more specific, most query parsers will have already separated
> the terms and will call the analyzer with only one term at a time, so no
> term recombination is possible for those parsed terms, at query time.
>
Most analyzers will do that, yes. But if Xi writes his own analyzer with
his own combiner filter, then he should also use this for query generation
and thus get the desired combinations / snippets there as well.

Xi, here is the recipe:
- SnippetFilter extends TokenFilter
-SnippetFilter  needs access to your lexicon: a data structure to store
your snippets. In the general case this is a tree, and going along a branch
will tell you whenever a valid snipped has been built or if the snipped
could be longer. (Example: "internal revenue" can be one snippet but,
depending on the next token, a larger snipped of "internal revenue service"
could be built.)
- Logic of the SnippetFilter.incrementToken() goes something like this: You
need a loop which retrieves tokens from the input variable until the input
is empty. You store each retrieved token in a variable(s) x in
SnippetFilter . As long as you have a potential match against your lexicon,
you can continue in this loop. Once you realize that there is something
within x which can not possibly become a (longer) snippet, break out of the
loop and allow the consumer to retrieve it.
- make sure your analyzer inserts SnippetFilter at the correct spot in the
filter chain.

Cheers
FiveMileTom





>
> -- Jack Krupansky
> -----Original Message----- From: Erick Erickson
> Sent: Friday, December 21, 2012 8:27 AM
> To: java-user
> Subject: Re: Which token filter can combine 2 terms into 1?
>
>
> If it's a fixed list and not excessively long, would synonyms work?
>
> But if theres some kind of logic you need to apply, I don't think you're
> going to find anything OOB.
> The problem is that by the time a token filter gets called, they are
> already split up, you'll probably
> have to write a custom filter that manages that logic.
>
> Best
> Erick
>
>
> On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <[hidden email]> wrote:
>
>  Unfortunately, no...I am not combine every two term into one. I am
>> combining a specific pair.
>>
>> E.g. the Token Stream: t1 t2 t2a t3
>> should be rewritten into t1 t2t2a t3
>>
>> But the TS: t1 t2 t3 t2a
>> should not be rewritten, and it is already correct
>>
>>
>> On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
>> alan.woodward@romseysoftware.**co.uk <[hidden email]>>
>> wrote:
>>
>> > Have a look at ShingleFilter:
>> >
>> http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**
>> lucene/analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>> >
>> > On 21 Dec 2012, at 08:42, Xi Shen wrote:
>> >
>> > > I have to use the white space and word delimiter to process the input
>> > > first. I tried many combination, and it seems to me that it is
>> inevitable
>> > > the term will be split into two :(
>> > >
>> > > I think developing my own filter is the only resolution...but I just
>> > cannot
>> > > find a guide to help me understand what I need to do to implement a
>> > > TokenFilter.
>> > >
>> > >
>> > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]>
>> wrote:
>> > >
>> > >> Easiest way would be to pre-process your input and join those 2 > >>
>> tokens
>> > >> before splitting them by white space.
>> > >>
>> > >> But from given context I might miss some details...still worth a >
>> >> shot.
>> > >>
>> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]>
>> wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> I am looking for a token filter that can combine 2 terms into 1? >
>> >>> E.g.
>> > >>>
>> > >>> the input has been tokenized by white space:
>> > >>>
>> > >>> t1 t2 t2a t3
>> > >>>
>> > >>> I want a filter that output:
>> > >>>
>> > >>> t1 t2t2a t3
>> > >>>
>> > >>> I know it is a very special case, and I am thinking about develop a
>> > >> filter
>> > >>> of my own. But I cannot figure out which API I should use to look >
>> >>> for
>> > >> terms
>> > >>> in a Token Stream.
>> > >>>
>> > >>> --
>> > >>> Regards,
>> > >>> David Shen
>> > >>>
>> > >>> http://about.me/davidshen
>> > >>> https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>> > >>>
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > David Shen
>> > >
>> > > http://about.me/davidshen
>> > > https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>> >
>> >
>>
>>
>> --
>> Regards,
>> David Shen
>>
>> http://about.me/davidshen
>> https://twitter.com/#!/**davidshen84 <https://twitter.com/#!/davidshen84>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<[hidden email]>
> For additional commands, e-mail: [hidden email].**org<[hidden email]>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Chris Hostetter-3
In reply to this post by Xi Shen
: Unfortunately, no...I am not combine every two term into one. I am
: combining a specific pair.

I'm confused ... you've already said that you expect you will need a
custom filter because your usecase is very special -- and you haven't
given us many details about exactly when/why/how you want to a filter to
decide to combine two tokens so no one can make a guess as to wether any
existing filter fits your usecase exactly -- but Alan did point out an
example of a filter you could look at as a guide for how to go about
cobining filters, and your response to that was that it isn't exactly what
you are looking for.

I think you need either need to give us more info about exactly what you
are looking for, or you need to look closer at the code for ShingleFilter
and ask more specific questions about the parts you don't understand in
the quest to implement your own custom filter.

: > Have a look at ShingleFilter:
: > http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html

: > > I think developing my own filter is the only resolution...but I just
: > cannot
: > > find a guide to help me understand what I need to do to implement a
: > > TokenFilter.


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Jack Krupansky-2
In reply to this post by Tom-2
You still have the query parser's parsing before analysis to deal with, no
matter what magic you code in your analyzer.

-- Jack Krupansky

-----Original Message-----
From: Tom
Sent: Friday, December 21, 2012 2:24 PM
To: [hidden email]
Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky
<[hidden email]>wrote:

> And to be more specific, most query parsers will have already separated
> the terms and will call the analyzer with only one term at a time, so no
> term recombination is possible for those parsed terms, at query time.
>
Most analyzers will do that, yes. But if Xi writes his own analyzer with
his own combiner filter, then he should also use this for query generation
and thus get the desired combinations / snippets there as well.

Xi, here is the recipe:
- SnippetFilter extends TokenFilter
-SnippetFilter  needs access to your lexicon: a data structure to store
your snippets. In the general case this is a tree, and going along a branch
will tell you whenever a valid snipped has been built or if the snipped
could be longer. (Example: "internal revenue" can be one snippet but,
depending on the next token, a larger snipped of "internal revenue service"
could be built.)
- Logic of the SnippetFilter.incrementToken() goes something like this: You
need a loop which retrieves tokens from the input variable until the input
is empty. You store each retrieved token in a variable(s) x in
SnippetFilter . As long as you have a potential match against your lexicon,
you can continue in this loop. Once you realize that there is something
within x which can not possibly become a (longer) snippet, break out of the
loop and allow the consumer to retrieve it.
- make sure your analyzer inserts SnippetFilter at the correct spot in the
filter chain.

Cheers
FiveMileTom





>
> -- Jack Krupansky
> -----Original Message----- From: Erick Erickson
> Sent: Friday, December 21, 2012 8:27 AM
> To: java-user
> Subject: Re: Which token filter can combine 2 terms into 1?
>
>
> If it's a fixed list and not excessively long, would synonyms work?
>
> But if theres some kind of logic you need to apply, I don't think you're
> going to find anything OOB.
> The problem is that by the time a token filter gets called, they are
> already split up, you'll probably
> have to write a custom filter that manages that logic.
>
> Best
> Erick
>
>
> On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <[hidden email]> wrote:
>
>  Unfortunately, no...I am not combine every two term into one. I am
>> combining a specific pair.
>>
>> E.g. the Token Stream: t1 t2 t2a t3
>> should be rewritten into t1 t2t2a t3
>>
>> But the TS: t1 t2 t3 t2a
>> should not be rewritten, and it is already correct
>>
>>
>> On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
>> alan.woodward@romseysoftware.**co.uk
>> <[hidden email]>>
>> wrote:
>>
>> > Have a look at ShingleFilter:
>> >
>> http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**
>> lucene/analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>> >
>> > On 21 Dec 2012, at 08:42, Xi Shen wrote:
>> >
>> > > I have to use the white space and word delimiter to process the input
>> > > first. I tried many combination, and it seems to me that it is
>> inevitable
>> > > the term will be split into two :(
>> > >
>> > > I think developing my own filter is the only resolution...but I just
>> > cannot
>> > > find a guide to help me understand what I need to do to implement a
>> > > TokenFilter.
>> > >
>> > >
>> > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]>
>> wrote:
>> > >
>> > >> Easiest way would be to pre-process your input and join those 2 > >>
>> tokens
>> > >> before splitting them by white space.
>> > >>
>> > >> But from given context I might miss some details...still worth a >
>> >> shot.
>> > >>
>> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]>
>> wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> I am looking for a token filter that can combine 2 terms into 1? >
>> >>> E.g.
>> > >>>
>> > >>> the input has been tokenized by white space:
>> > >>>
>> > >>> t1 t2 t2a t3
>> > >>>
>> > >>> I want a filter that output:
>> > >>>
>> > >>> t1 t2t2a t3
>> > >>>
>> > >>> I know it is a very special case, and I am thinking about develop a
>> > >> filter
>> > >>> of my own. But I cannot figure out which API I should use to look >
>> >>> for
>> > >> terms
>> > >>> in a Token Stream.
>> > >>>
>> > >>> --
>> > >>> Regards,
>> > >>> David Shen
>> > >>>
>> > >>> http://about.me/davidshen
>> > >>> https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>> > >>>
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > David Shen
>> > >
>> > > http://about.me/davidshen
>> > > https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>> >
>> >
>>
>>
>> --
>> Regards,
>> David Shen
>>
>> http://about.me/davidshen
>> https://twitter.com/#!/**davidshen84 <https://twitter.com/#!/davidshen84>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.**apache.org<[hidden email]>
> For additional commands, e-mail:
> [hidden email].**org<[hidden email]>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Tom-2
On Fri, Dec 21, 2012 at 2:44 PM, Jack Krupansky <[hidden email]>wrote:

> You still have the query parser's parsing before analysis to deal with, no
> matter what magic you code in your analyzer.
>

Not quite.
"query parser's parsing" comes first, you are correct on that. But it is
irrelevant for splitting field values into search terms, because this part
of the whole process is done by an analyzer. Therefore, if you make sure
the correct analyzer is used, then the parsing and splitting into
individual search terms will be done by this analyzer, not by the query
parser.

Try it: Implement an analyzer with the SnippetFilter below. Start Luke and
make sure this analyzer is selected in "Analyzer to use for query parsing".
In the search expression, type in any length of text for example:

body:"word1 word2 word3"

and you will get the possibly combined Terms.

For example, let's say one snipped in your SnippetFilter is: "word2 word3"
you will get

Term 0: field=body text=word1
Term 1: field=body text=word2 word3

In this case, word2 and word3 will NOT be split.



>
> -- Jack Krupansky
>
> -----Original Message----- From: Tom
> Sent: Friday, December 21, 2012 2:24 PM
> To: [hidden email]
>
> Subject: Re: Which token filter can combine 2 terms into 1?
>
> On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky <[hidden email]>*
> *wrote:
>
>  And to be more specific, most query parsers will have already separated
>> the terms and will call the analyzer with only one term at a time, so no
>> term recombination is possible for those parsed terms, at query time.
>>
>>  Most analyzers will do that, yes. But if Xi writes his own analyzer with
> his own combiner filter, then he should also use this for query generation
> and thus get the desired combinations / snippets there as well.
>
> Xi, here is the recipe:
> - SnippetFilter extends TokenFilter
> -SnippetFilter  needs access to your lexicon: a data structure to store
> your snippets. In the general case this is a tree, and going along a branch
> will tell you whenever a valid snipped has been built or if the snipped
> could be longer. (Example: "internal revenue" can be one snippet but,
> depending on the next token, a larger snipped of "internal revenue service"
> could be built.)
> - Logic of the SnippetFilter.incrementToken() goes something like this: You
> need a loop which retrieves tokens from the input variable until the input
> is empty. You store each retrieved token in a variable(s) x in
> SnippetFilter . As long as you have a potential match against your lexicon,
> you can continue in this loop. Once you realize that there is something
> within x which can not possibly become a (longer) snippet, break out of the
> loop and allow the consumer to retrieve it.
> - make sure your analyzer inserts SnippetFilter at the correct spot in the
> filter chain.
>
> Cheers
> FiveMileTom
>
>
>
>
>
>
>> -- Jack Krupansky
>> -----Original Message----- From: Erick Erickson
>> Sent: Friday, December 21, 2012 8:27 AM
>> To: java-user
>> Subject: Re: Which token filter can combine 2 terms into 1?
>>
>>
>> If it's a fixed list and not excessively long, would synonyms work?
>>
>> But if theres some kind of logic you need to apply, I don't think you're
>> going to find anything OOB.
>> The problem is that by the time a token filter gets called, they are
>> already split up, you'll probably
>> have to write a custom filter that manages that logic.
>>
>> Best
>> Erick
>>
>>
>> On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <[hidden email]> wrote:
>>
>>  Unfortunately, no...I am not combine every two term into one. I am
>>
>>> combining a specific pair.
>>>
>>> E.g. the Token Stream: t1 t2 t2a t3
>>> should be rewritten into t1 t2t2a t3
>>>
>>> But the TS: t1 t2 t3 t2a
>>> should not be rewritten, and it is already correct
>>>
>>>
>>> On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
>>> alan.woodward@romseysoftware.****co.uk <alan.woodward@romseysoftware.**
>>> co.uk <[hidden email]>>>
>>>
>>> wrote:
>>>
>>> > Have a look at ShingleFilter:
>>> >
>>> http://lucene.apache.org/core/****3_6_0/api/all/org/apache/**<http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**>
>>> lucene/analysis/shingle/****ShingleFilter.html<http://**
>>> lucene.apache.org/core/3_6_0/**api/all/org/apache/lucene/**
>>> analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>>> >
>>>
>>> >
>>> > On 21 Dec 2012, at 08:42, Xi Shen wrote:
>>> >
>>> > > I have to use the white space and word delimiter to process the input
>>> > > first. I tried many combination, and it seems to me that it is
>>> inevitable
>>> > > the term will be split into two :(
>>> > >
>>> > > I think developing my own filter is the only resolution...but I just
>>> > cannot
>>> > > find a guide to help me understand what I need to do to implement a
>>> > > TokenFilter.
>>> > >
>>> > >
>>> > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]>
>>> wrote:
>>> > >
>>> > >> Easiest way would be to pre-process your input and join those 2 > >>
>>> tokens
>>> > >> before splitting them by white space.
>>> > >>
>>> > >> But from given context I might miss some details...still worth a >
>>> >> shot.
>>> > >>
>>> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]>
>>> wrote:
>>> > >>
>>> > >>> Hi,
>>> > >>>
>>> > >>> I am looking for a token filter that can combine 2 terms into 1? >
>>> >>> E.g.
>>> > >>>
>>> > >>> the input has been tokenized by white space:
>>> > >>>
>>> > >>> t1 t2 t2a t3
>>> > >>>
>>> > >>> I want a filter that output:
>>> > >>>
>>> > >>> t1 t2t2a t3
>>> > >>>
>>> > >>> I know it is a very special case, and I am thinking about develop a
>>> > >> filter
>>> > >>> of my own. But I cannot figure out which API I should use to look >
>>> >>> for
>>> > >> terms
>>> > >>> in a Token Stream.
>>> > >>>
>>> > >>> --
>>> > >>> Regards,
>>> > >>> David Shen
>>> > >>>
>>> > >>> http://about.me/davidshen
>>> > >>> https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84>
>>> <https://twitter.**com/#!/davidshen84<https://twitter.com/#!/davidshen84>
>>> >
>>>
>>> > >>>
>>> > >>
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Regards,
>>> > > David Shen
>>> > >
>>> > > http://about.me/davidshen
>>> > > https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84>
>>> <https://twitter.**com/#!/davidshen84<https://twitter.com/#!/davidshen84>
>>> >
>>>
>>> >
>>> >
>>>
>>>
>>> --
>>> Regards,
>>> David Shen
>>>
>>> http://about.me/davidshen
>>> https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84><
>>> https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>>> >
>>>
>>>
>>>
>> ------------------------------****----------------------------**
>> --**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.****apache.org<
>> java-user-**[hidden email]<[hidden email]>
>> >
>> For additional commands, e-mail: [hidden email].****org<
>> java-user-help@lucene.**apache.org <[hidden email]>>
>>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<[hidden email]>
> For additional commands, e-mail: [hidden email].**org<[hidden email]>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Jack Krupansky-2
Ah! You're quoting full phrases. You weren't clear about that originally.
Thanks for the clarification.

-- Jack Krupansky

-----Original Message-----
From: Tom
Sent: Wednesday, December 26, 2012 5:54 PM
To: [hidden email]
Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 2:44 PM, Jack Krupansky
<[hidden email]>wrote:

> You still have the query parser's parsing before analysis to deal with, no
> matter what magic you code in your analyzer.
>

Not quite.
"query parser's parsing" comes first, you are correct on that. But it is
irrelevant for splitting field values into search terms, because this part
of the whole process is done by an analyzer. Therefore, if you make sure
the correct analyzer is used, then the parsing and splitting into
individual search terms will be done by this analyzer, not by the query
parser.

Try it: Implement an analyzer with the SnippetFilter below. Start Luke and
make sure this analyzer is selected in "Analyzer to use for query parsing".
In the search expression, type in any length of text for example:

body:"word1 word2 word3"

and you will get the possibly combined Terms.

For example, let's say one snipped in your SnippetFilter is: "word2 word3"
you will get

Term 0: field=body text=word1
Term 1: field=body text=word2 word3

In this case, word2 and word3 will NOT be split.



>
> -- Jack Krupansky
>
> -----Original Message----- From: Tom
> Sent: Friday, December 21, 2012 2:24 PM
> To: [hidden email]
>
> Subject: Re: Which token filter can combine 2 terms into 1?
>
> On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky <[hidden email]>*
> *wrote:
>
>  And to be more specific, most query parsers will have already separated
>> the terms and will call the analyzer with only one term at a time, so no
>> term recombination is possible for those parsed terms, at query time.
>>
>>  Most analyzers will do that, yes. But if Xi writes his own analyzer with
> his own combiner filter, then he should also use this for query generation
> and thus get the desired combinations / snippets there as well.
>
> Xi, here is the recipe:
> - SnippetFilter extends TokenFilter
> -SnippetFilter  needs access to your lexicon: a data structure to store
> your snippets. In the general case this is a tree, and going along a
> branch
> will tell you whenever a valid snipped has been built or if the snipped
> could be longer. (Example: "internal revenue" can be one snippet but,
> depending on the next token, a larger snipped of "internal revenue
> service"
> could be built.)
> - Logic of the SnippetFilter.incrementToken() goes something like this:
> You
> need a loop which retrieves tokens from the input variable until the input
> is empty. You store each retrieved token in a variable(s) x in
> SnippetFilter . As long as you have a potential match against your
> lexicon,
> you can continue in this loop. Once you realize that there is something
> within x which can not possibly become a (longer) snippet, break out of
> the
> loop and allow the consumer to retrieve it.
> - make sure your analyzer inserts SnippetFilter at the correct spot in the
> filter chain.
>
> Cheers
> FiveMileTom
>
>
>
>
>
>
>> -- Jack Krupansky
>> -----Original Message----- From: Erick Erickson
>> Sent: Friday, December 21, 2012 8:27 AM
>> To: java-user
>> Subject: Re: Which token filter can combine 2 terms into 1?
>>
>>
>> If it's a fixed list and not excessively long, would synonyms work?
>>
>> But if theres some kind of logic you need to apply, I don't think you're
>> going to find anything OOB.
>> The problem is that by the time a token filter gets called, they are
>> already split up, you'll probably
>> have to write a custom filter that manages that logic.
>>
>> Best
>> Erick
>>
>>
>> On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen <[hidden email]> wrote:
>>
>>  Unfortunately, no...I am not combine every two term into one. I am
>>
>>> combining a specific pair.
>>>
>>> E.g. the Token Stream: t1 t2 t2a t3
>>> should be rewritten into t1 t2t2a t3
>>>
>>> But the TS: t1 t2 t3 t2a
>>> should not be rewritten, and it is already correct
>>>
>>>
>>> On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
>>> alan.woodward@romseysoftware.****co.uk <alan.woodward@romseysoftware.**
>>> co.uk <[hidden email]>>>
>>>
>>> wrote:
>>>
>>> > Have a look at ShingleFilter:
>>> >
>>> http://lucene.apache.org/core/****3_6_0/api/all/org/apache/**<http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**>
>>> lucene/analysis/shingle/****ShingleFilter.html<http://**
>>> lucene.apache.org/core/3_6_0/**api/all/org/apache/lucene/**
>>> analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>>> >
>>>
>>> >
>>> > On 21 Dec 2012, at 08:42, Xi Shen wrote:
>>> >
>>> > > I have to use the white space and word delimiter to process the
>>> > > input
>>> > > first. I tried many combination, and it seems to me that it is
>>> inevitable
>>> > > the term will be split into two :(
>>> > >
>>> > > I think developing my own filter is the only resolution...but I just
>>> > cannot
>>> > > find a guide to help me understand what I need to do to implement a
>>> > > TokenFilter.
>>> > >
>>> > >
>>> > > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN <[hidden email]>
>>> wrote:
>>> > >
>>> > >> Easiest way would be to pre-process your input and join those 2 >
>>> > >>  >>
>>> tokens
>>> > >> before splitting them by white space.
>>> > >>
>>> > >> But from given context I might miss some details...still worth a >
>>> >> shot.
>>> > >>
>>> > >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen <[hidden email]>
>>> wrote:
>>> > >>
>>> > >>> Hi,
>>> > >>>
>>> > >>> I am looking for a token filter that can combine 2 terms into 1? >
>>> >>> E.g.
>>> > >>>
>>> > >>> the input has been tokenized by white space:
>>> > >>>
>>> > >>> t1 t2 t2a t3
>>> > >>>
>>> > >>> I want a filter that output:
>>> > >>>
>>> > >>> t1 t2t2a t3
>>> > >>>
>>> > >>> I know it is a very special case, and I am thinking about develop
>>> > >>> a
>>> > >> filter
>>> > >>> of my own. But I cannot figure out which API I should use to look
>>> > >>>  >
>>> >>> for
>>> > >> terms
>>> > >>> in a Token Stream.
>>> > >>>
>>> > >>> --
>>> > >>> Regards,
>>> > >>> David Shen
>>> > >>>
>>> > >>> http://about.me/davidshen
>>> > >>> https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84>
>>> <https://twitter.**com/#!/davidshen84<https://twitter.com/#!/davidshen84>
>>> >
>>>
>>> > >>>
>>> > >>
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Regards,
>>> > > David Shen
>>> > >
>>> > > http://about.me/davidshen
>>> > > https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84>
>>> <https://twitter.**com/#!/davidshen84<https://twitter.com/#!/davidshen84>
>>> >
>>>
>>> >
>>> >
>>>
>>>
>>> --
>>> Regards,
>>> David Shen
>>>
>>> http://about.me/davidshen
>>> https://twitter.com/#!/****davidshen84<https://twitter.com/#!/**davidshen84><
>>> https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
>>> >
>>>
>>>
>>>
>> ------------------------------****----------------------------**
>> --**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.****apache.org<
>> java-user-**[hidden email]<[hidden email]>
>> >
>> For additional commands, e-mail: [hidden email].****org<
>> java-user-help@lucene.**apache.org <[hidden email]>>
>>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.**apache.org<[hidden email]>
> For additional commands, e-mail:
> [hidden email].**org<[hidden email]>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which token filter can combine 2 terms into 1?

Lance Norskog-2
In reply to this post by Xi Shen
How do you choose t2 and t2a? If you have a full inventory of these
pairs, you can make these multi-word synonyms and use the Synonym filter
to combine them.

On 12/20/2012 11:50 PM, Xi Shen wrote:

> Hi,
>
> I am looking for a token filter that can combine 2 terms into 1? E.g.
>
> the input has been tokenized by white space:
>
> t1 t2 t2a t3
>
> I want a filter that output:
>
> t1 t2t2a t3
>
> I know it is a very special case, and I am thinking about develop a filter
> of my own. But I cannot figure out which API I should use to look for terms
> in a Token Stream.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]