Solr - Remove specific punctuation marks

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr - Remove specific punctuation marks

Daisy
Hi;

I am working with apache-solr-3.6.0 on windows machine. I would like to remove all punctuation marks before indexing except the colon and the full-stop.

I tried:

<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="[\p{Punct}&&[^\.^\:]]" replacement="" replace="all"/>
      </analyzer>
    </fieldType>
But it didn't work. Any Ideas?
Reply | Threaded
Open this post in threaded view
|

RE: Solr - Remove specific punctuation marks

steve_rowe
Hi Daisy,

I can't see anything wrong with the regex or the XML syntax.

One possibility: if it's Arabic you're matching against, you may want to add ARABIC FULL STOP U+06D4 to the set you subtract from \p{Punct}.

If you give an example of your input and your expected output, I might be able to help more.

Steve

-----Original Message-----
From: Daisy [mailto:[hidden email]]
Sent: Monday, September 24, 2012 5:08 AM
To: [hidden email]
Subject: Solr - Remove specific punctuation marks

Hi;

I am working with apache-solr-3.6.0 on windows machine. I would like to
remove all punctuation marks before indexing except the colon and the
full-stop.

I tried:

<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="[\p{Punct}&&[^\.^\:]]" replacement="" replace="all"/>
      </analyzer>
    </fieldType>
But it didn't work. Any Ideas?



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795.html
Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

RE: Solr - Remove specific punctuation marks

Daisy
Yes I am trying to index Arabic document. There is a problem that the && regex couldn't be understood in the solr schema and it gives 500 - code error.
Here is an example:

input:

هذا مثال: للتوضيح (مثال علي علامات الترقيم) انتهي.

I tried also the regex:  pattern="([\(\)\}\{\,[^.:\s+\S+]])"
but I failed to remove the bracutes from the text above, when i searched for a bracket I found result.
Reply | Threaded
Open this post in threaded view
|

RE: Solr - Remove specific punctuation marks

Markus Jelsma-2


 
 
-----Original message-----
> From:Daisy <[hidden email]>
> Sent: Mon 24-Sep-2012 15:09
> To: [hidden email]
> Subject: RE: Solr - Remove specific punctuation marks
>
> Yes I am trying to index Arabic document. There is a problem that the &&
> regex couldn't be understood in the solr schema and it gives 500 - code
> error.

The config is XML. Try encoding the ampersand as &amp;

> Here is an example:
>
> input:
>
> هذا مثال: للتوضيح (مثال علي علامات الترقيم) انتهي.
>
> I tried also the regex:  pattern="([\(\)\}\{\,[^.:\s+\S+]])"
> but I failed to remove the bracutes from the text above, when i searched for
> a bracket I found result.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009830.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: Solr - Remove specific punctuation marks

Daisy
I tried & and it solved the 500 error code. But still it could find punctuation marks.
Although the parsed query didnt contain the punctuation mark,

<str name="rawquerystring">"{"</str>
<str name="querystring">"{"</str>
<str name="parsedquery">text:</str>
<str name="parsedquery_toString">text:</str>

 but still the numfound gives 1

<result name="response" numFound="1" start="0">

and the highlight shows the result of punctuation mark
 <em>{</em>
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file
Reply | Threaded
Open this post in threaded view
|

Re: Solr - Remove specific punctuation marks

Jack Krupansky-2
1. Which query parser are you using?
2. I see the following comment in the Java 6 doc for regex "\p{Punct}":
"POSIX character classes (US-ASCII only)", so if any of the punctuation is
some higher Unicode character code, it won't be matched/removed.
3. It seems very odd that the parsed query has empty terms - normally the
query parsers will ignore terms that analyze to zero tokens. Maybe your "{"
is not an ASCII left brace code and is (apparently) unprintable in the
parsed query. Or, maybe there is some encoding problem in the analyzer.

-- Jack Krupansky

-----Original Message-----
From: Daisy
Sent: Monday, September 24, 2012 9:26 AM
To: [hidden email]
Subject: RE: Solr - Remove specific punctuation marks

I tried &amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

<str name="rawquerystring">"{"</str>
<str name="querystring">"{"</str>
<str name="parsedquery">text:</str>
<str name="parsedquery_toString">text:</str>

but still the numfound gives 1

<result name="response" numFound="1" start="0">

and the highlight shows the result of punctuation mark
<em>{</em>
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Solr - Remove specific punctuation marks

Jack Krupansky-2
I tried it and PRFF is indeed generating an empty token. I don't know how
Lucene will index or query an empty term. I mean, what it "should" do. In
any case, it is best to avoid them.

You should be using a "charFilter" to simply filter raw characters before
tokenizing. So, try:

<charFilter class="solr.PatternReplaceCharFilterFactory"/>

It has the same pattern and replacement attributes.

-- Jack Krupansky

-----Original Message-----
From: Jack Krupansky
Sent: Monday, September 24, 2012 12:43 PM
To: [hidden email]
Subject: Re: Solr - Remove specific punctuation marks

1. Which query parser are you using?
2. I see the following comment in the Java 6 doc for regex "\p{Punct}":
"POSIX character classes (US-ASCII only)", so if any of the punctuation is
some higher Unicode character code, it won't be matched/removed.
3. It seems very odd that the parsed query has empty terms - normally the
query parsers will ignore terms that analyze to zero tokens. Maybe your "{"
is not an ASCII left brace code and is (apparently) unprintable in the
parsed query. Or, maybe there is some encoding problem in the analyzer.

-- Jack Krupansky

-----Original Message-----
From: Daisy
Sent: Monday, September 24, 2012 9:26 AM
To: [hidden email]
Subject: RE: Solr - Remove specific punctuation marks

I tried &amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

<str name="rawquerystring">"{"</str>
<str name="querystring">"{"</str>
<str name="parsedquery">text:</str>
<str name="parsedquery_toString">text:</str>

but still the numfound gives 1

<result name="response" numFound="1" start="0">

and the highlight shows the result of punctuation mark
<em>{</em>
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Solr - Remove specific punctuation marks

Daisy
In reply to this post by Jack Krupansky-2
How could I know which query parser I am using?
Here is the part of my schema that I am using


   
    <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>       
                <filter class="solr.PatternReplaceFilterFactory" pattern="(\()" replacement="" replace="all"/>
      </analyzer>
    </fieldType>

 
   <field name="text" type="text_ar" indexed="true" stored="true" termVectors="true" multiValued="true"/>

As shown even if I tried to remove "(" the same happened for parsed query and for numFound.
Reply | Threaded
Open this post in threaded view
|

Re: Solr - Remove specific punctuation marks

Daisy
In reply to this post by Jack Krupansky-2
Thanks. Finally it works using

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\()" replacement="" replace="all"/> 

I wonder what is the reason for that, and what is the difference between the filter and the charFilter?
Reply | Threaded
Open this post in threaded view
|

Re: Solr - Remove specific punctuation marks

Jonathan Rochkind
In reply to this post by Jack Krupansky-2
When I do things like this and want to avoid empty tokens even though
previous analysis might result in some--I just throw one of these at the
end of my analysis chain:

         <!-- get rid of empty string tokens. max is required, although
              we don't really care. -->
         <filter class="solr.LengthFilterFactory" min="1" max="9999"/>

A charfilter to filter raw characters can certainly still result in an
empty token, if an initial token was composed solely of chars you wanted
to filter out!  In which case you probably want the token to be deleted
entirely, not still there as an empty token. The above length filter is
one way to do that, although unfortunately requires specifying a 'max'
even though I didn't actually want to filter out on the high end, oh well.


On 9/24/2012 1:07 PM, Jack Krupansky wrote:

> I tried it and PRFF is indeed generating an empty token. I don't know
> how Lucene will index or query an empty term. I mean, what it "should"
> do. In any case, it is best to avoid them.
>
> You should be using a "charFilter" to simply filter raw characters
> before tokenizing. So, try:
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"/>
>
> It has the same pattern and replacement attributes.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Jack Krupansky
> Sent: Monday, September 24, 2012 12:43 PM
> To: [hidden email]
> Subject: Re: Solr - Remove specific punctuation marks
>
> 1. Which query parser are you using?
> 2. I see the following comment in the Java 6 doc for regex "\p{Punct}":
> "POSIX character classes (US-ASCII only)", so if any of the punctuation is
> some higher Unicode character code, it won't be matched/removed.
> 3. It seems very odd that the parsed query has empty terms - normally the
> query parsers will ignore terms that analyze to zero tokens. Maybe your "{"
> is not an ASCII left brace code and is (apparently) unprintable in the
> parsed query. Or, maybe there is some encoding problem in the analyzer.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Daisy
> Sent: Monday, September 24, 2012 9:26 AM
> To: [hidden email]
> Subject: RE: Solr - Remove specific punctuation marks
>
> I tried &amp; and it solved the 500 error code. But still it could find
> punctuation marks.
> Although the parsed query didnt contain the punctuation mark,
>
> <str name="rawquerystring">"{"</str>
> <str name="querystring">"{"</str>
> <str name="parsedquery">text:</str>
> <str name="parsedquery_toString">text:</str>
>
> but still the numfound gives 1
>
> <result name="response" numFound="1" start="0">
>
> and the highlight shows the result of punctuation mark
> <em>{</em>
> The steps I did:
> 1- editing the schema
> 2- restart the server
> 3-delete the file
> 4-index the file
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
>
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Solr - Remove specific punctuation marks

Walter Underwood
In reply to this post by Jack Krupansky-2
I've had problems with empty tokens. You can remove those with this as a step in the analyzer chain.

        <filter class="solr.LengthFilterFactory" min="1" max="1024"/>

wunder

On Sep 24, 2012, at 10:07 AM, Jack Krupansky wrote:

> I tried it and PRFF is indeed generating an empty token. I don't know how Lucene will index or query an empty term. I mean, what it "should" do. In any case, it is best to avoid them.
>
> You should be using a "charFilter" to simply filter raw characters before tokenizing. So, try:
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"/>
>
> It has the same pattern and replacement attributes.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Jack Krupansky
> Sent: Monday, September 24, 2012 12:43 PM
> To: [hidden email]
> Subject: Re: Solr - Remove specific punctuation marks
>
> 1. Which query parser are you using?
> 2. I see the following comment in the Java 6 doc for regex "\p{Punct}":
> "POSIX character classes (US-ASCII only)", so if any of the punctuation is
> some higher Unicode character code, it won't be matched/removed.
> 3. It seems very odd that the parsed query has empty terms - normally the
> query parsers will ignore terms that analyze to zero tokens. Maybe your "{"
> is not an ASCII left brace code and is (apparently) unprintable in the
> parsed query. Or, maybe there is some encoding problem in the analyzer.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Daisy
> Sent: Monday, September 24, 2012 9:26 AM
> To: [hidden email]
> Subject: RE: Solr - Remove specific punctuation marks
>
> I tried &amp; and it solved the 500 error code. But still it could find
> punctuation marks.
> Although the parsed query didnt contain the punctuation mark,
>
> <str name="rawquerystring">"{"</str>
> <str name="querystring">"{"</str>
> <str name="parsedquery">text:</str>
> <str name="parsedquery_toString">text:</str>
>
> but still the numfound gives 1
>
> <result name="response" numFound="1" start="0">
>
> and the highlight shows the result of punctuation mark
> <em>{</em>
> The steps I did:
> 1- editing the schema
> 2- restart the server
> 3-delete the file
> 4-index the file
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
> Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
[hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Solr - Remove specific punctuation marks

Daisy
Using "solr.LengthFilterFactory" was great and also solve the problem of using PatternReplaceFilter. So now I have two solutions. Thanks all for helping me. One thing I would like to know what is the diffrence between PatternReplaceFilter and PatternReplaceCharFilter?
Reply | Threaded
Open this post in threaded view
|

Re: Solr - Remove specific punctuation marks

Shawn Heisey-4
On 9/24/2012 11:37 AM, Daisy wrote:
> One thing I would like to know what is the diffrence between
> PatternReplaceFilter and PatternReplaceCharFilter?

The CharFilter version gets applied before anything else, including the
Tokenizer.  The Filter version gets applied in the order specified in
the schema file.  I would imagine that if you are allowed to specify
multiple CharFilter entries (which I have never tested), they would be
applied in the order they occur, all of them before the Tokenizer.

Thanks,
Shawn