[jira] [Created] (SOLR-13242) RegexReplaceProcessorFactory not making accurate replacement

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (SOLR-13242) RegexReplaceProcessorFactory not making accurate replacement

JIRA jira@apache.org
Edwin Yeo Zheng Lin created SOLR-13242:
------------------------------------------

             Summary: RegexReplaceProcessorFactory not making accurate replacement
                 Key: SOLR-13242
                 URL: https://issues.apache.org/jira/browse/SOLR-13242
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
    Affects Versions: 7.6
            Reporter: Edwin Yeo Zheng Lin


We are using the RegexReplaceProcessorFactory with the following configuration

 

 <processor class="solr.RegexReplaceProcessorFactory">

   <str name="fieldName">content</str>

   <str name="pattern">(\s*\n)\{2,}</str>

   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>

 </processor>

 

The regex pattern of (\s*\n)\{2,} is working perfectly in [regex101.com|http://regex101.com/], in which all the \n will be replaced by only two <br>

However, in Solr, there are cases (in Example 2 and 3 below) that has four <br> in a row. This should not be the case, as we have already set it to replace by two <br> regardless of how many \n are there in a row.

 

 

Example 1: The sentence that the above regex pattern is working correctly 

*Original content in EML file:*  

Dear Sir, 

 

I am terminating 

*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating

*Index content:*     Dear Sir,  <br><br>I am terminating 

 

Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

*Original content in EML file:*    

_exalted_

_Psalm 89:17_

 

3 Choa Chu Kang Avenue 4    

*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu Kang Avenue 4, Singapore

*Index content:* exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu Kang Avenue 4, Singapore

 

Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

*Original content in EML file:*    

[http://www.concordpri.moe.edu.sg/]

 

 

 

 

On Tue, Dec 18, 2018 at 10:07 AM    

*Original content:* [http://www.concordpri.moe.edu.sg/]   \n\n   \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at 10:07 AM 

*Index content:* [http://www.concordpri.moe.edu.sg/]   <br><br>  <br><br>On Tue, Dec 18, 2018 at 10:07 AM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]