Quantcast

any analyzer will keep punctuation?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

any analyzer will keep punctuation?

Yonghui Zhao
Lucene standard anlyzer will remove almost all punctuation.
In some cases, we want to keep some punctuation, for example in music
search, some singer name and album name could be a punctuation.

Is there any analyzer that we can customized punctuation to be removed?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: any analyzer will keep punctuation?

Ahmet Arslan
Hi,

Whitespace analyser/tokenizer for example.

Ahmet



On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <[hidden email]> wrote:
Lucene standard anlyzer will remove almost all punctuation.
In some cases, we want to keep some punctuation, for example in music
search, some singer name and album name could be a punctuation.

Is there any analyzer that we can customized punctuation to be removed?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: any analyzer will keep punctuation?

Yonghui Zhao
Yes whitespace analyzer will keep punctuation, but it only breaks word by
space.


I didn’t explain my requirement clearly.

I want to an analyzer like standard analyzer but may keep some punctuation
configured.

2017-03-06 18:03 GMT+08:00 Ahmet Arslan <[hidden email]>:

> Hi,
>
> Whitespace analyser/tokenizer for example.
>
> Ahmet
>
>
>
> On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <[hidden email]>
> wrote:
> Lucene standard anlyzer will remove almost all punctuation.
> In some cases, we want to keep some punctuation, for example in music
> search, some singer name and album name could be a punctuation.
>
> Is there any analyzer that we can customized punctuation to be removed?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: any analyzer will keep punctuation?

Michael McCandless-2
You could use ICUTokenizer and make a custom RuleBasedBreakIterator .rbbi
file to control precisely when splitting should happen, but that language
is complex to configure ;)

Another option is to maybe make a CharFilter ahead of StandardTokenizer
that tries to rewrite the punctuation you want to keep into something that
StandardTokenizer would not split on.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Mar 6, 2017 at 5:22 AM, Yonghui Zhao <[hidden email]> wrote:

> Yes whitespace analyzer will keep punctuation, but it only breaks word by
> space.
>
>
> I didn’t explain my requirement clearly.
>
> I want to an analyzer like standard analyzer but may keep some punctuation
> configured.
>
> 2017-03-06 18:03 GMT+08:00 Ahmet Arslan <[hidden email]>:
>
> > Hi,
> >
> > Whitespace analyser/tokenizer for example.
> >
> > Ahmet
> >
> >
> >
> > On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <[hidden email]>
> > wrote:
> > Lucene standard anlyzer will remove almost all punctuation.
> > In some cases, we want to keep some punctuation, for example in music
> > search, some singer name and album name could be a punctuation.
> >
> > Is there any analyzer that we can customized punctuation to be removed?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: any analyzer will keep punctuation?

Ahmet Arslan
In reply to this post by Yonghui Zhao
Hi Zhao,

WhiteSpace tokeniser followed by a customised word delimiter filter factory would be solution.
Please see types attribute of the word delimiter filter for customising characters.

ahmet



On Monday, March 6, 2017 12:22 PM, Yonghui Zhao <[hidden email]> wrote:
Yes whitespace analyzer will keep punctuation, but it only breaks word by
space.


I didn’t explain my requirement clearly.

I want to an analyzer like standard analyzer but may keep some punctuation
configured.


2017-03-06 18:03 GMT+08:00 Ahmet Arslan <[hidden email]>:

> Hi,
>
> Whitespace analyser/tokenizer for example.
>
> Ahmet
>
>
>
> On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <[hidden email]>
> wrote:
> Lucene standard anlyzer will remove almost all punctuation.
> In some cases, we want to keep some punctuation, for example in music
> search, some singer name and album name could be a punctuation.
>
> Is there any analyzer that we can customized punctuation to be removed?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: any analyzer will keep punctuation?

Ralph Soika
In reply to this post by Yonghui Zhao
What you can do, is adding a custom search field with the singer name
into your document to be indexed :

     doc.add(new StringField("singername", myValue, Store.NO));

Than you query you index like this:

    String myquery="(singername:\" + searchphrase + "\") or (" +
searchphrase + ")";

in this case the value of the field singername will not be analyzed by
the standard analyzer.


On 06.03.2017 09:15, Yonghui Zhao wrote:
> Lucene standard anlyzer will remove almost all punctuation.
> In some cases, we want to keep some punctuation, for example in music
> search, some singer name and album name could be a punctuation.
>
> Is there any analyzer that we can customized punctuation to be removed?
>


--
*Imixs*...extends the way people work together
We are an open source company, read more at: www.imixs.org
<http://www.imixs.org>
------------------------------------------------------------------------
Imixs Software Solutions GmbH
Agnes-Pockels-Bogen 1, 80992 München
*Web:* www.imixs.com <http://www.imixs.com>
*Office:* +49 (0)89-452136 16 *Mobil:* +49-177-4128245
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: any analyzer will keep punctuation?

Yonghui Zhao
In reply to this post by Ahmet Arslan
Hi Ahmet,

Thanks for your reply, but I didn't quite get your idea.
I want to get an analyzer like standard analyzer but with punctuation
customized.
I think one way is customizing an analyzer  with a customizer  tokenizer
like StandardTokenizer.
In my tokenizer I will re-write StandardTokenizerImpl which seems a little
complicate.
I don't understand how "a customised word delimiter filter factory" works
in tokenizer.


2017-03-06 22:26 GMT+08:00 Ahmet Arslan <[hidden email]>:

> Hi Zhao,
>
> WhiteSpace tokeniser followed by a customised word delimiter filter
> factory would be solution.
> Please see types attribute of the word delimiter filter for customising
> characters.
>
> ahmet
>
>
>
> On Monday, March 6, 2017 12:22 PM, Yonghui Zhao <[hidden email]>
> wrote:
> Yes whitespace analyzer will keep punctuation, but it only breaks word by
> space.
>
>
> I didn’t explain my requirement clearly.
>
> I want to an analyzer like standard analyzer but may keep some punctuation
> configured.
>
>
> 2017-03-06 18:03 GMT+08:00 Ahmet Arslan <[hidden email]>:
>
> > Hi,
> >
> > Whitespace analyser/tokenizer for example.
> >
> > Ahmet
> >
> >
> >
> > On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <[hidden email]>
> > wrote:
> > Lucene standard anlyzer will remove almost all punctuation.
> > In some cases, we want to keep some punctuation, for example in music
> > search, some singer name and album name could be a punctuation.
> >
> > Is there any analyzer that we can customized punctuation to be removed?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: any analyzer will keep punctuation?

380382856@qq.com
i think Ahmet is right. use WhiteSpace tokeniser will separate doc into token.and then you use custom filter can delete some punctuation you want to remove.Realization a custom filter is not very difficult.  



[hidden email]
 
发件人: Yonghui Zhao
发送时间: 2017-03-08 12:22
收件人: Ahmet Arslan
抄送: [hidden email]
主题: Re: any analyzer will keep punctuation?
Hi Ahmet,
 
Thanks for your reply, but I didn't quite get your idea.
I want to get an analyzer like standard analyzer but with punctuation
customized.
I think one way is customizing an analyzer  with a customizer  tokenizer
like StandardTokenizer.
In my tokenizer I will re-write StandardTokenizerImpl which seems a little
complicate.
I don't understand how "a customised word delimiter filter factory" works
in tokenizer.
 
 
2017-03-06 22:26 GMT+08:00 Ahmet Arslan <[hidden email]>:
 

> Hi Zhao,
>
> WhiteSpace tokeniser followed by a customised word delimiter filter
> factory would be solution.
> Please see types attribute of the word delimiter filter for customising
> characters.
>
> ahmet
>
>
>
> On Monday, March 6, 2017 12:22 PM, Yonghui Zhao <[hidden email]>
> wrote:
> Yes whitespace analyzer will keep punctuation, but it only breaks word by
> space.
>
>
> I didn’t explain my requirement clearly.
>
> I want to an analyzer like standard analyzer but may keep some punctuation
> configured.
>
>
> 2017-03-06 18:03 GMT+08:00 Ahmet Arslan <[hidden email]>:
>
> > Hi,
> >
> > Whitespace analyser/tokenizer for example.
> >
> > Ahmet
> >
> >
> >
> > On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <[hidden email]>
> > wrote:
> > Lucene standard anlyzer will remove almost all punctuation.
> > In some cases, we want to keep some punctuation, for example in music
> > search, some singer name and album name could be a punctuation.
> >
> > Is there any analyzer that we can customized punctuation to be removed?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: any analyzer will keep punctuation?

Ahmet Arslan
In reply to this post by Yonghui Zhao
Hi,

Please find wdftypes.txt in the source tree for an example.
It is an argument of word delimiter filter factory.
Also see hashtag example: https://issues.apache.org/jira/browse/SOLR-2059

Ahmet



On Wednesday, March 8, 2017 6:22 AM, Yonghui Zhao <[hidden email]> wrote:
Hi Ahmet,

Thanks for your reply, but I didn't quite get your idea.
I want to get an analyzer like standard analyzer but with punctuation
customized.
I think one way is customizing an analyzer  with a customizer  tokenizer
like StandardTokenizer.
In my tokenizer I will re-write StandardTokenizerImpl which seems a little
complicate.
I don't understand how "a customised word delimiter filter factory" works
in tokenizer.



2017-03-06 22:26 GMT+08:00 Ahmet Arslan <[hidden email]>:

> Hi Zhao,
>
> WhiteSpace tokeniser followed by a customised word delimiter filter
> factory would be solution.
> Please see types attribute of the word delimiter filter for customising
> characters.
>
> ahmet
>
>
>
> On Monday, March 6, 2017 12:22 PM, Yonghui Zhao <[hidden email]>
> wrote:
> Yes whitespace analyzer will keep punctuation, but it only breaks word by
> space.
>
>
> I didn’t explain my requirement clearly.
>
> I want to an analyzer like standard analyzer but may keep some punctuation
> configured.
>
>
> 2017-03-06 18:03 GMT+08:00 Ahmet Arslan <[hidden email]>:
>
> > Hi,
> >
> > Whitespace analyser/tokenizer for example.
> >
> > Ahmet
> >
> >
> >
> > On Monday, March 6, 2017 10:21 AM, Yonghui Zhao <[hidden email]>
> > wrote:
> > Lucene standard anlyzer will remove almost all punctuation.
> > In some cases, we want to keep some punctuation, for example in music
> > search, some singer name and album name could be a punctuation.
> >
> > Is there any analyzer that we can customized punctuation to be removed?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...