hypens

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

hypens

jpowers
Hello,

 

If I have a user search for "b-trunk"  I would like them to be able to

find "b-trunk" (with hypen).   I would also like someone searching for

"b trunk" to also find "b-trunk".

 

On the other side, if someone searches for 12412 I would like them to be

able to find 12412-235, 12412-121, 12412-etc...      as well as letting

someone type in 12412-235 directly and get a good result list: the one
item would be best, but a larger list with that one on top is good too.

 

So for now I am using the standardanalyzer.   I do a search for what

they give me in quotes on all fields as well as the same thing w/o
quotes.  When I print out the final query the half of the overall query
in quotes seems to have the hypens stripped out, but the w/o quotes

version doesn't...so this lets me find what I want.   But I have each

search phrase in the final query twice now.    it seems to work fine,

but it seems pretty inelegant--unelegant even.  

 

It seems like I can't just strip out the hypens, nor keep them.    I am

storing the name as keyword, but everything else as Text.   I thought

that would matter but a description or keyword or other field may have
something like "this also relates to 23523-235"  so if someone was
searching for 23523 I would also want this in the list... and if they

searched for the 23523-235 then I would also want this still.    So I

don't know if its solvable by the type of field I use to index it.   Or

do I have to store each field twice with different analyzer?  That seems
just as clumsy as my double-search solution.  

 

Any thoughts?

 

Reply | Threaded
Open this post in threaded view
|

Re: hypens

Karl Wettin-3

17 apr 2006 kl. 18.59 skrev John Powers:

> Hello,
>
> If I have a user search for "b-trunk"  I would like them to be able to
> find "b-trunk" (with hypen).   I would also like someone searching for
> "b trunk" to also find "b-trunk".

If you don't care about spans, make a filter that rebuilds the token  
at index time. It's a bit quick and dirty, but I do things like this  
in some cases without any major problems.

Below code builds [btrunk] [b-trunk] [b] [trunk].

I would not recommend you to do this without considering what you do.



public class LowASCIIDashWordFilter extends TokenFilter {

     private static Pattern p = Pattern.compile("(\\w+)-(\\w+)");

     /** Construct a token stream filtering the given input. */
     public LowASCIIDashWordFilter(TokenStream input) {
         super(input);
     }

     private LinkedList<Token> buf;

     /** Returns the next token in the stream, or null at EOS. */
     public Token next() throws IOException {
         Token next;
         if (buf == null) {
             buf = new LinkedList<Token>();
             next = input.next();
             while (next != null) {
                 Matcher m = p.matcher(next.termText());
                 if (m.matches()) {
                     buf.add(new Token(m.group(1) + m.group(2),  
m.start(1), m.end(2), "composite dashword"));
                     buf.add(new Token(m.group(1), m.start(1), m.end
(1), "left dashword"));
                     buf.add(new Token(m.group(2), m.start(2), m.end
(2), "right dashword"));
                 }
                 next = input.next();
             }
         }

         if (buf.size() > 0) {
             return buf.removeFirst();
         } else {
             return null;
         }
     }

}



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: hypens

Ramana Jelda
In reply to this post by jpowers
 Hi,
I would use index & search analyzers in this case..
"b-trunk" is analyzed & indexed as b,btrunk,trunk
Search term "b-trunk" is anlayzed using search analyzer as "btrunk" and
searched. U will find the result..

Similarly for 12412-235, 12412-121, 12412-etc , indexed as
12412,12412235,235 etc....
So obviously it will find 12412 search term.


Good luck,
Jelda


> -----Original Message-----
> From: John Powers [mailto:[hidden email]]
> Sent: Monday, April 17, 2006 6:59 PM
> To: [hidden email]
> Subject: hypens
>
> Hello,
>
>  
>
> If I have a user search for "b-trunk"  I would like them to be able to
>
> find "b-trunk" (with hypen).   I would also like someone searching for
>
> "b trunk" to also find "b-trunk".
>
>  
>
> On the other side, if someone searches for 12412 I would like
> them to be
>
> able to find 12412-235, 12412-121, 12412-etc...      as well
> as letting
>
> someone type in 12412-235 directly and get a good result
> list: the one item would be best, but a larger list with that
> one on top is good too.
>
>  
>
> So for now I am using the standardanalyzer.   I do a search for what
>
> they give me in quotes on all fields as well as the same
> thing w/o quotes.  When I print out the final query the half
> of the overall query in quotes seems to have the hypens
> stripped out, but the w/o quotes
>
> version doesn't...so this lets me find what I want.   But I have each
>
> search phrase in the final query twice now.    it seems to work fine,
>
> but it seems pretty inelegant--unelegant even.  
>
>  
>
> It seems like I can't just strip out the hypens, nor keep
> them.    I am
>
> storing the name as keyword, but everything else as Text.   I thought
>
> that would matter but a description or keyword or other field
> may have something like "this also relates to 23523-235"  so
> if someone was searching for 23523 I would also want this in
> the list... and if they
>
> searched for the 23523-235 then I would also want this still.    So I
>
> don't know if its solvable by the type of field I use to
> index it.   Or
>
> do I have to store each field twice with different analyzer?  
> That seems just as clumsy as my double-search solution.  
>
>  
>
> Any thoughts?
>
>  
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: hypens

jpowers
In reply to this post by jpowers
What do you mean by "use index and search analyzers".  Don't you always
have to pass in an analyzer?   I am using the standardanalyzer in both
cases.

Which analyzer are you recommending I use for this?  

-----Original Message-----
From: Ramana Jelda [mailto:[hidden email]]
Sent: Tuesday, April 18, 2006 3:45 AM
To: [hidden email]
Subject: RE: hypens

 Hi,
I would use index & search analyzers in this case..
"b-trunk" is analyzed & indexed as b,btrunk,trunk
Search term "b-trunk" is anlayzed using search analyzer as "btrunk" and
searched. U will find the result..

Similarly for 12412-235, 12412-121, 12412-etc , indexed as
12412,12412235,235 etc....
So obviously it will find 12412 search term.


Good luck,
Jelda


> -----Original Message-----
> From: John Powers [mailto:[hidden email]]
> Sent: Monday, April 17, 2006 6:59 PM
> To: [hidden email]
> Subject: hypens
>
> Hello,
>
>  
>
> If I have a user search for "b-trunk"  I would like them to be able to
>
> find "b-trunk" (with hypen).   I would also like someone searching for
>
> "b trunk" to also find "b-trunk".
>
>  
>
> On the other side, if someone searches for 12412 I would like
> them to be
>
> able to find 12412-235, 12412-121, 12412-etc...      as well
> as letting
>
> someone type in 12412-235 directly and get a good result
> list: the one item would be best, but a larger list with that
> one on top is good too.
>
>  
>
> So for now I am using the standardanalyzer.   I do a search for what
>
> they give me in quotes on all fields as well as the same
> thing w/o quotes.  When I print out the final query the half
> of the overall query in quotes seems to have the hypens
> stripped out, but the w/o quotes
>
> version doesn't...so this lets me find what I want.   But I have each
>
> search phrase in the final query twice now.    it seems to work fine,
>
> but it seems pretty inelegant--unelegant even.  
>
>  
>
> It seems like I can't just strip out the hypens, nor keep
> them.    I am
>
> storing the name as keyword, but everything else as Text.   I thought
>
> that would matter but a description or keyword or other field
> may have something like "this also relates to 23523-235"  so
> if someone was searching for 23523 I would also want this in
> the list... and if they
>
> searched for the 23523-235 then I would also want this still.    So I
>
> don't know if its solvable by the type of field I use to
> index it.   Or
>
> do I have to store each field twice with different analyzer?  
> That seems just as clumsy as my double-search solution.  
>
>  
>
> Any thoughts?
>
>  
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: hypens

Ramana Jelda
I mean, using separate analyzers for indexing & searching..

I will not use any standard analyzers provided by lucene rather implement a
custom anaylzer which is not so difficult.


Jelda

> -----Original Message-----
> From: John Powers [mailto:[hidden email]]
> Sent: Tuesday, April 18, 2006 4:53 PM
> To: [hidden email]
> Subject: RE: hypens
>
> What do you mean by "use index and search analyzers".  Don't
> you always
> have to pass in an analyzer?   I am using the standardanalyzer in both
> cases.
>
> Which analyzer are you recommending I use for this?  
>
> -----Original Message-----
> From: Ramana Jelda [mailto:[hidden email]]
> Sent: Tuesday, April 18, 2006 3:45 AM
> To: [hidden email]
> Subject: RE: hypens
>
>  Hi,
> I would use index & search analyzers in this case..
> "b-trunk" is analyzed & indexed as b,btrunk,trunk Search term
> "b-trunk" is anlayzed using search analyzer as "btrunk" and
> searched. U will find the result..
>
> Similarly for 12412-235, 12412-121, 12412-etc , indexed as
> 12412,12412235,235 etc....
> So obviously it will find 12412 search term.
>
>
> Good luck,
> Jelda
>
>
> > -----Original Message-----
> > From: John Powers [mailto:[hidden email]]
> > Sent: Monday, April 17, 2006 6:59 PM
> > To: [hidden email]
> > Subject: hypens
> >
> > Hello,
> >
> >  
> >
> > If I have a user search for "b-trunk"  I would like them to
> be able to
> >
> > find "b-trunk" (with hypen).   I would also like someone
> searching for
> >
> > "b trunk" to also find "b-trunk".
> >
> >  
> >
> > On the other side, if someone searches for 12412 I would
> like them to
> > be
> >
> > able to find 12412-235, 12412-121, 12412-etc...      as well
> > as letting
> >
> > someone type in 12412-235 directly and get a good result
> > list: the one item would be best, but a larger list with
> that one on
> > top is good too.
> >
> >  
> >
> > So for now I am using the standardanalyzer.   I do a search for what
> >
> > they give me in quotes on all fields as well as the same thing w/o
> > quotes.  When I print out the final query the half of the overall
> > query in quotes seems to have the hypens stripped out, but the w/o
> > quotes
> >
> > version doesn't...so this lets me find what I want.   But I
> have each
> >
> > search phrase in the final query twice now.    it seems to
> work fine,
> >
> > but it seems pretty inelegant--unelegant even.  
> >
> >  
> >
> > It seems like I can't just strip out the hypens, nor keep
> > them.    I am
> >
> > storing the name as keyword, but everything else as Text.  
> I thought
> >
> > that would matter but a description or keyword or other
> field may have
> > something like "this also relates to 23523-235"  so if someone was
> > searching for 23523 I would also want this in the list...
> and if they
> >
> > searched for the 23523-235 then I would also want this
> still.    So I
> >
> > don't know if its solvable by the type of field I use to
> > index it.   Or
> >
> > do I have to store each field twice with different analyzer?  
> > That seems just as clumsy as my double-search solution.  
> >
> >  
> >
> > Any thoughts?
> >
> >  
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: hypens

Yonik Seeley
In reply to this post by jpowers
On 4/18/06, John Powers <[hidden email]> wrote:
> What do you mean by "use index and search analyzers".  Don't you always
> have to pass in an analyzer?   I am using the standardanalyzer in both
> cases.

I think he means a different analyzer for search than is used for
indexing.  It can make sense in certain cases.

Solr has a WordDelimiterFilter that handles hyphen (and many other)
cases like this.
It can make wi-fi match a query of wifi or "wi fi" or "WiFi".  Solr
also allows easy specification of different analyzers for index vs
query time.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]