RegexReplaceProcessorFactory pattern to detect multiple \n

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

RegexReplaceProcessorFactory pattern to detect multiple \n

Zheng Lin Edwin Yeo
Hi,

I am trying to use the RegexReplaceProcessorFactory to remove more than two
\n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
and replace it with two <br>.

I use the following regex pattern and it is working when I test it in
regex101.com. But it is not working when I put it inside the
RegexReplaceProcessorFactory as below:

<updateRequestProcessorChain name="removeCode">
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">"(\\n\s*){2,}"</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>
          </updateRequestProcessorChain>

To explain further about my regex pattern, \s* is instructing the regex to
match any \n that have space after and {2,} is instructing the regex to
match 2 or more occurrence of such pattern (\n).

Please kindly let me know what is wrong and how should I do it?

I am using Solr 7.6.0.

Regards,
Edwin
Reply | Threaded
Open this post in threaded view
|

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

paul.dodd
You don’t say what happens, just that it is not working. I assume nothing is replaced? Perhaps the pattern should be



   <str name="pattern">"(\n\s*){2,}"</str>



??



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
Gesendet: Donnerstag, 7. Februar 2019 14:08
An: [hidden email]<mailto:[hidden email]>
Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi,

I am trying to use the RegexReplaceProcessorFactory to remove more than two
\n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
and replace it with two <br>.

I use the following regex pattern and it is working when I test it in
regex101.com. But it is not working when I put it inside the
RegexReplaceProcessorFactory as below:

<updateRequestProcessorChain name="removeCode">
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">"(\\n\s*){2,}"</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>
          </updateRequestProcessorChain>

To explain further about my regex pattern, \s* is instructing the regex to
match any \n that have space after and {2,} is instructing the regex to
match 2 or more occurrence of such pattern (\n).

Please kindly let me know what is wrong and how should I do it?

I am using Solr 7.6.0.

Regards,
Edwin
Reply | Threaded
Open this post in threaded view
|

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Zheng Lin Edwin Yeo
Hi Paul,

Thanks for your reply.

When I use this pattern:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n+\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

It is working for some sentence within the same content and not working for
some sentences. Please see below for the one that is working and another
that is not working (partially working):

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

We would appreciate your help to see what is wrong?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 21:24, <[hidden email]> wrote:

> You don’t say what happens, just that it is not working. I assume nothing
> is replaced? Perhaps the pattern should be
>
>
>
>    <str name="pattern">"(\n\s*){2,}"</str>
>
>
>
> ??
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> Gesendet: Donnerstag, 7. Februar 2019 14:08
> An: [hidden email]<mailto:[hidden email]>
> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> I am trying to use the RegexReplaceProcessorFactory to remove more than two
> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
> and replace it with two <br>.
>
> I use the following regex pattern and it is working when I test it in
> regex101.com. But it is not working when I put it inside the
> RegexReplaceProcessorFactory as below:
>
> <updateRequestProcessorChain name="removeCode">
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">"(\\n\s*){2,}"</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>           </updateRequestProcessorChain>
>
> To explain further about my regex pattern, \s* is instructing the regex to
> match any \n that have space after and {2,} is instructing the regex to
> match 2 or more occurrence of such pattern (\n).
>
> Please kindly let me know what is wrong and how should I do it?
>
> I am using Solr 7.6.0.
>
> Regards,
> Edwin
>
Reply | Threaded
Open this post in threaded view
|

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

paul.dodd
To avoid the «\n+\s*» matching too many \n and then failing on the {2,} part you could try



<str name="pattern">(\n\s*){2,}</str>



If you also want to match CRLF then

<str name="pattern">(\r?\n\s*){2,}</str>





Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
Gesendet: Donnerstag, 7. Februar 2019 15:10
An: [hidden email]<mailto:[hidden email]>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

Thanks for your reply.

When I use this pattern:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n+\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

It is working for some sentence within the same content and not working for
some sentences. Please see below for the one that is working and another
that is not working (partially working):

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

We would appreciate your help to see what is wrong?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 21:24, <[hidden email]> wrote:

> You don’t say what happens, just that it is not working. I assume nothing
> is replaced? Perhaps the pattern should be
>
>
>
>    <str name="pattern">"(\n\s*){2,}"</str>
>
>
>
> ??
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> Gesendet: Donnerstag, 7. Februar 2019 14:08
> An: [hidden email]<mailto:[hidden email]>
> Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi,
>
> I am trying to use the RegexReplaceProcessorFactory to remove more than two
> \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n \n),
> and replace it with two <br>.
>
> I use the following regex pattern and it is working when I test it in
> regex101.com. But it is not working when I put it inside the
> RegexReplaceProcessorFactory as below:
>
> <updateRequestProcessorChain name="removeCode">
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">"(\\n\s*){2,}"</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>           </updateRequestProcessorChain>
>
> To explain further about my regex pattern, \s* is instructing the regex to
> match any \n that have space after and {2,} is instructing the regex to
> match 2 or more occurrence of such pattern (\n).
>
> Please kindly let me know what is wrong and how should I do it?
>
> I am using Solr 7.6.0.
>
> Regards,
> Edwin
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Zheng Lin Edwin Yeo
Hi Paul,

We have tried this suggested regex pattern as follow:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

But we still have exactly the same problem of Example 1,2 and 3 below.

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

Any further suggestion?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:20, <[hidden email]> wrote:

> To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> part you could try
>
>
>
> <str name="pattern">(\n\s*){2,}</str>
>
>
>
> If you also want to match CRLF then
>
> <str name="pattern">(\r?\n\s*){2,}</str>
>
>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> Gesendet: Donnerstag, 7. Februar 2019 15:10
> An: [hidden email]<mailto:[hidden email]>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> Thanks for your reply.
>
> When I use this pattern:
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n+\s*){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> It is working for some sentence within the same content and not working for
> some sentences. Please see below for the one that is working and another
> that is not working (partially working):
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
> We would appreciate your help to see what is wrong?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 21:24, <[hidden email]> wrote:
>
> > You don’t say what happens, just that it is not working. I assume nothing
> > is replaced? Perhaps the pattern should be
> >
> >
> >
> >    <str name="pattern">"(\n\s*){2,}"</str>
> >
> >
> >
> > ??
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > An: [hidden email]<mailto:[hidden email]>
> > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi,
> >
> > I am trying to use the RegexReplaceProcessorFactory to remove more than
> two
> > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> \n),
> > and replace it with two <br>.
> >
> > I use the following regex pattern and it is working when I test it in
> > regex101.com. But it is not working when I put it inside the
> > RegexReplaceProcessorFactory as below:
> >
> > <updateRequestProcessorChain name="removeCode">
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">"(\\n\s*){2,}"</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > </processor>
> >           </updateRequestProcessorChain>
> >
> > To explain further about my regex pattern, \s* is instructing the regex
> to
> > match any \n that have space after and {2,} is instructing the regex to
> > match 2 or more occurrence of such pattern (\n).
> >
> > Please kindly let me know what is wrong and how should I do it?
> >
> > I am using Solr 7.6.0.
> >
> > Regards,
> > Edwin
> >
>
Reply | Threaded
Open this post in threaded view
|

AW: RegexReplaceProcessorFactory pattern to detect multiple \n

paul.dodd
Hi Edwin



  1.  Sorry, the pattern was wrong, the space should preceed the \n i.e. <str name="pattern">(\s*\n){2,}</str>
  2.  Perhaps in the data you have other (non printing) characters than \n?



Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für Windows 10



Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
Gesendet: Donnerstag, 7. Februar 2019 15:23
An: [hidden email]<mailto:[hidden email]>
Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n



Hi Paul,

We have tried this suggested regex pattern as follow:
<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\n\s*){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

But we still have exactly the same problem of Example 1,2 and 3 below.

Example 1: The sentence that the above regex pattern is working correctly
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM

Any further suggestion?

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:20, <[hidden email]> wrote:

> To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> part you could try
>
>
>
> <str name="pattern">(\n\s*){2,}</str>
>
>
>
> If you also want to match CRLF then
>
> <str name="pattern">(\r?\n\s*){2,}</str>
>
>
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> Gesendet: Donnerstag, 7. Februar 2019 15:10
> An: [hidden email]<mailto:[hidden email]>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> Thanks for your reply.
>
> When I use this pattern:
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n+\s*){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> It is working for some sentence within the same content and not working for
> some sentences. Please see below for the one that is working and another
> that is not working (partially working):
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
> We would appreciate your help to see what is wrong?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 21:24, <[hidden email]> wrote:
>
> > You don’t say what happens, just that it is not working. I assume nothing
> > is replaced? Perhaps the pattern should be
> >
> >
> >
> >    <str name="pattern">"(\n\s*){2,}"</str>
> >
> >
> >
> > ??
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > An: [hidden email]<mailto:[hidden email]>
> > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi,
> >
> > I am trying to use the RegexReplaceProcessorFactory to remove more than
> two
> > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> \n),
> > and replace it with two <br>.
> >
> > I use the following regex pattern and it is working when I test it in
> > regex101.com. But it is not working when I put it inside the
> > RegexReplaceProcessorFactory as below:
> >
> > <updateRequestProcessorChain name="removeCode">
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">"(\\n\s*){2,}"</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > </processor>
> >           </updateRequestProcessorChain>
> >
> > To explain further about my regex pattern, \s* is instructing the regex
> to
> > match any \n that have space after and {2,} is instructing the regex to
> > match 2 or more occurrence of such pattern (\n).
> >
> > Please kindly let me know what is wrong and how should I do it?
> >
> > I am using Solr 7.6.0.
> >
> > Regards,
> > Edwin
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Zheng Lin Edwin Yeo
Hi Paul,

We have tried it with the space preceeding the \n i.e. <str
name="pattern">(\s*\n){2,}</str>, with the following regex pattern:

<processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(\s*\n){2,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
</processor>

However, we are also getting the exact same results as the earlier Example
1, 2 and 3.

As for your point 2 on perhaps in the data you have other (non printing)
characters than \n, we have find that there are no non printing characters.
It is just next line with a space. You can refer to the original content in
the same examples below.


Example 1: The sentence that the above regex pattern is working correctly
*Original content in EML file:*
Dear Sir,


I am terminating
*Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
*Index content: *    Dear Sir,  <br><br>I am terminating

Example 2: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

*exalted*

*Psalm 89:17*


3 Choa Chu Kang Avenue 4
*Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
Chu Kang Avenue 4, Singapore
*Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
Chu Kang Avenue 4, Singapore

Example 3: The sentence that the above regex pattern is partially working
(as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*

http://www.concordpri.moe.edu.sg/








On Tue, Dec 18, 2018 at 10:07 AM
*Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n
\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
at 10:07 AM
*Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
Tue, Dec 18, 2018 at 10:07 AM


Appreciate any other ideas or suggestions that you may have.

Thank you.

Regards,
Edwin

On Thu, 7 Feb 2019 at 22:49, <[hidden email]> wrote:

> Hi Edwin
>
>
>
>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
> <str name="pattern">(\s*\n){2,}</str>
>   2.  Perhaps in the data you have other (non printing) characters than \n?
>
>
>
> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> Windows 10
>
>
>
> Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> Gesendet: Donnerstag, 7. Februar 2019 15:23
> An: [hidden email]<mailto:[hidden email]>
> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>
>
>
> Hi Paul,
>
> We have tried this suggested regex pattern as follow:
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\n\s*){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> But we still have exactly the same problem of Example 1,2 and 3 below.
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018
> at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
> Any further suggestion?
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:20, <[hidden email]> wrote:
>
> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
> > part you could try
> >
> >
> >
> > <str name="pattern">(\n\s*){2,}</str>
> >
> >
> >
> > If you also want to match CRLF then
> >
> > <str name="pattern">(\r?\n\s*){2,}</str>
> >
> >
> >
> >
> >
> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > Windows 10
> >
> >
> >
> > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> > Gesendet: Donnerstag, 7. Februar 2019 15:10
> > An: [hidden email]<mailto:[hidden email]>
> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
> >
> >
> >
> > Hi Paul,
> >
> > Thanks for your reply.
> >
> > When I use this pattern:
> > <processor class="solr.RegexReplaceProcessorFactory">
> >    <str name="fieldName">content</str>
> >    <str name="pattern">(\n+\s*){2,}</str>
> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > </processor>
> >
> > It is working for some sentence within the same content and not working
> for
> > some sentences. Please see below for the one that is working and another
> > that is not working (partially working):
> >
> > Example 1: The sentence that the above regex pattern is working correctly
> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> > *Index content: *    Dear Sir,  <br><br>I am terminating
> >
> > Example 2: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> > Chu Kang Avenue 4, Singapore
> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> > Chu Kang Avenue 4, Singapore
> >
> > Example 3: The sentence that the above regex pattern is partially working
> > (as you can see, instead of 2 <br>, there are 4 <br>)
> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> > \n\n
> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> 2018
> > at 10:07 AM
> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
> <br><br>On
> > Tue, Dec 18, 2018 at 10:07 AM
> >
> > We would appreciate your help to see what is wrong?
> >
> > Thank you.
> >
> > Regards,
> > Edwin
> >
> > On Thu, 7 Feb 2019 at 21:24, <[hidden email]> wrote:
> >
> > > You don’t say what happens, just that it is not working. I assume
> nothing
> > > is replaced? Perhaps the pattern should be
> > >
> > >
> > >
> > >    <str name="pattern">"(\n\s*){2,}"</str>
> > >
> > >
> > >
> > > ??
> > >
> > >
> > >
> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
> > > Windows 10
> > >
> > >
> > >
> > > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
> > > An: [hidden email]<mailto:[hidden email]>
> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
> > >
> > >
> > >
> > > Hi,
> > >
> > > I am trying to use the RegexReplaceProcessorFactory to remove more than
> > two
> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
> > \n),
> > > and replace it with two <br>.
> > >
> > > I use the following regex pattern and it is working when I test it in
> > > regex101.com. But it is not working when I put it inside the
> > > RegexReplaceProcessorFactory as below:
> > >
> > > <updateRequestProcessorChain name="removeCode">
> > > <processor class="solr.RegexReplaceProcessorFactory">
> > >    <str name="fieldName">content</str>
> > >    <str name="pattern">"(\\n\s*){2,}"</str>
> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> > > </processor>
> > >           </updateRequestProcessorChain>
> > >
> > > To explain further about my regex pattern, \s* is instructing the regex
> > to
> > > match any \n that have space after and {2,} is instructing the regex to
> > > match 2 or more occurrence of such pattern (\n).
> > >
> > > Please kindly let me know what is wrong and how should I do it?
> > >
> > > I am using Solr 7.6.0.
> > >
> > > Regards,
> > > Edwin
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Zheng Lin Edwin Yeo
Hi Paul,

Regarding the regex (\n\s*){2,} that we are using, when we try in on
https://regex101.com/, it is able to give us the correct result for all the
examples (ie: All of them will only have <br><br>, and not more than that
like what we are getting in Solr in our earlier examples).

Could there be a possibility of a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <[hidden email]>
wrote:

> Hi Paul,
>
> We have tried it with the space preceeding the \n i.e. <str
> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\s*\n){2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
> </processor>
>
> However, we are also getting the exact same results as the earlier Example
> 1, 2 and 3.
>
> As for your point 2 on perhaps in the data you have other (non printing)
> characters than \n, we have find that there are no non printing characters.
> It is just next line with a space. You can refer to the original content in
> the same examples below.
>
>
> Example 1: The sentence that the above regex pattern is working correctly
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content: *    Dear Sir,  <br><br>I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
> Chu Kang Avenue 4, Singapore
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> http://www.concordpri.moe.edu.sg/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
> 2018 at 10:07 AM
> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
> Tue, Dec 18, 2018 at 10:07 AM
>
>
> Appreciate any other ideas or suggestions that you may have.
>
> Thank you.
>
> Regards,
> Edwin
>
> On Thu, 7 Feb 2019 at 22:49, <[hidden email]> wrote:
>
>> Hi Edwin
>>
>>
>>
>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>> <str name="pattern">(\s*\n){2,}</str>
>>   2.  Perhaps in the data you have other (non printing) characters than
>> \n?
>>
>>
>>
>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> Windows 10
>>
>>
>>
>> Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>> An: [hidden email]<mailto:[hidden email]>
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>>
>>
>> Hi Paul,
>>
>> We have tried this suggested regex pattern as follow:
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\n\s*){2,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> </processor>
>>
>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018
>> at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On
>> Tue, Dec 18, 2018 at 10:07 AM
>>
>> Any further suggestion?
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:20, <[hidden email]> wrote:
>>
>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
>> > part you could try
>> >
>> >
>> >
>> > <str name="pattern">(\n\s*){2,}</str>
>> >
>> >
>> >
>> > If you also want to match CRLF then
>> >
>> > <str name="pattern">(\r?\n\s*){2,}</str>
>> >
>> >
>> >
>> >
>> >
>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> > Windows 10
>> >
>> >
>> >
>> > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>> > An: [hidden email]<mailto:[hidden email]>
>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>> >
>> >
>> >
>> > Hi Paul,
>> >
>> > Thanks for your reply.
>> >
>> > When I use this pattern:
>> > <processor class="solr.RegexReplaceProcessorFactory">
>> >    <str name="fieldName">content</str>
>> >    <str name="pattern">(\n+\s*){2,}</str>
>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > </processor>
>> >
>> > It is working for some sentence within the same content and not working
>> for
>> > some sentences. Please see below for the one that is working and another
>> > that is not working (partially working):
>> >
>> > Example 1: The sentence that the above regex pattern is working
>> correctly
>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>> >
>> > Example 2: The sentence that the above regex pattern is partially
>> working
>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> > Chu Kang Avenue 4, Singapore
>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>> > Chu Kang Avenue 4, Singapore
>> >
>> > Example 3: The sentence that the above regex pattern is partially
>> working
>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> > \n\n
>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018
>> > at 10:07 AM
>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> <br><br>On
>> > Tue, Dec 18, 2018 at 10:07 AM
>> >
>> > We would appreciate your help to see what is wrong?
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Thu, 7 Feb 2019 at 21:24, <[hidden email]> wrote:
>> >
>> > > You don’t say what happens, just that it is not working. I assume
>> nothing
>> > > is replaced? Perhaps the pattern should be
>> > >
>> > >
>> > >
>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>> > >
>> > >
>> > >
>> > > ??
>> > >
>> > >
>> > >
>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>> > > Windows 10
>> > >
>> > >
>> > >
>> > > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>> > > An: [hidden email]<mailto:[hidden email]>
>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>> > >
>> > >
>> > >
>> > > Hi,
>> > >
>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>> than
>> > two
>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
>> > \n),
>> > > and replace it with two <br>.
>> > >
>> > > I use the following regex pattern and it is working when I test it in
>> > > regex101.com. But it is not working when I put it inside the
>> > > RegexReplaceProcessorFactory as below:
>> > >
>> > > <updateRequestProcessorChain name="removeCode">
>> > > <processor class="solr.RegexReplaceProcessorFactory">
>> > >    <str name="fieldName">content</str>
>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> > > </processor>
>> > >           </updateRequestProcessorChain>
>> > >
>> > > To explain further about my regex pattern, \s* is instructing the
>> regex
>> > to
>> > > match any \n that have space after and {2,} is instructing the regex
>> to
>> > > match 2 or more occurrence of such pattern (\n).
>> > >
>> > > Please kindly let me know what is wrong and how should I do it?
>> > >
>> > > I am using Solr 7.6.0.
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Zheng Lin Edwin Yeo
Hi,

Should we report this as a bug in Solr?

Regards,
Edwin

On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <[hidden email]>
wrote:

> Hi Paul,
>
> Regarding the regex (\n\s*){2,} that we are using, when we try in on
> https://regex101.com/, it is able to give us the correct result for all
> the examples (ie: All of them will only have <br><br>, and not more than
> that like what we are getting in Solr in our earlier examples).
>
> Could there be a possibility of a bug in Solr?
>
> Regards,
> Edwin
>
> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <[hidden email]>
> wrote:
>
>> Hi Paul,
>>
>> We have tried it with the space preceeding the \n i.e. <str
>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(\s*\n){2,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>> </processor>
>>
>> However, we are also getting the exact same results as the earlier
>> Example 1, 2 and 3.
>>
>> As for your point 2 on perhaps in the data you have other (non printing)
>> characters than \n, we have find that there are no non printing characters.
>> It is just next line with a space. You can refer to the original content in
>> the same examples below.
>>
>>
>> Example 1: The sentence that the above regex pattern is working correctly
>> *Original content in EML file:*
>> Dear Sir,
>>
>>
>> I am terminating
>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>
>> Example 2: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content in EML file:*
>>
>> *exalted*
>>
>> *Psalm 89:17*
>>
>>
>> 3 Choa Chu Kang Avenue 4
>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>> Chu Kang Avenue 4, Singapore
>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>> Chu Kang Avenue 4, Singapore
>>
>> Example 3: The sentence that the above regex pattern is partially working
>> (as you can see, instead of 2 <br>, there are 4 <br>)
>> *Original content in EML file:*
>>
>> http://www.concordpri.moe.edu.sg/
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Dec 18, 2018 at 10:07 AM
>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>> 2018 at 10:07 AM
>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>
>>
>> Appreciate any other ideas or suggestions that you may have.
>>
>> Thank you.
>>
>> Regards,
>> Edwin
>>
>> On Thu, 7 Feb 2019 at 22:49, <[hidden email]> wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n i.e.
>>> <str name="pattern">(\s*\n){2,}</str>
>>>   2.  Perhaps in the data you have other (non printing) characters than
>>> \n?
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> An: [hidden email]<mailto:[hidden email]>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi Paul,
>>>
>>> We have tried this suggested regex pattern as follow:
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(\n\s*){2,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> </processor>
>>>
>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>
>>> Example 1: The sentence that the above regex pattern is working correctly
>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>
>>> Example 2: The sentence that the above regex pattern is partially working
>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>>> Chu Kang Avenue 4, Singapore
>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>>> Chu Kang Avenue 4, Singapore
>>>
>>> Example 3: The sentence that the above regex pattern is partially working
>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> \n\n
>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018
>>> at 10:07 AM
>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On
>>> Tue, Dec 18, 2018 at 10:07 AM
>>>
>>> Any further suggestion?
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Thu, 7 Feb 2019 at 22:20, <[hidden email]> wrote:
>>>
>>> > To avoid the «\n+\s*» matching too many \n and then failing on the {2,}
>>> > part you could try
>>> >
>>> >
>>> >
>>> > <str name="pattern">(\n\s*){2,}</str>
>>> >
>>> >
>>> >
>>> > If you also want to match CRLF then
>>> >
>>> > <str name="pattern">(\r?\n\s*){2,}</str>
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> > Windows 10
>>> >
>>> >
>>> >
>>> > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> > An: [hidden email]<mailto:[hidden email]>
>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> >
>>> >
>>> >
>>> > Hi Paul,
>>> >
>>> > Thanks for your reply.
>>> >
>>> > When I use this pattern:
>>> > <processor class="solr.RegexReplaceProcessorFactory">
>>> >    <str name="fieldName">content</str>
>>> >    <str name="pattern">(\n+\s*){2,}</str>
>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > </processor>
>>> >
>>> > It is working for some sentence within the same content and not
>>> working for
>>> > some sentences. Please see below for the one that is working and
>>> another
>>> > that is not working (partially working):
>>> >
>>> > Example 1: The sentence that the above regex pattern is working
>>> correctly
>>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >
>>> > Example 2: The sentence that the above regex pattern is partially
>>> working
>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>> Choa
>>> > Chu Kang Avenue 4, Singapore
>>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>> Choa
>>> > Chu Kang Avenue 4, Singapore
>>> >
>>> > Example 3: The sentence that the above regex pattern is partially
>>> working
>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> > \n\n
>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018
>>> > at 10:07 AM
>>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On
>>> > Tue, Dec 18, 2018 at 10:07 AM
>>> >
>>> > We would appreciate your help to see what is wrong?
>>> >
>>> > Thank you.
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> > On Thu, 7 Feb 2019 at 21:24, <[hidden email]> wrote:
>>> >
>>> > > You don’t say what happens, just that it is not working. I assume
>>> nothing
>>> > > is replaced? Perhaps the pattern should be
>>> > >
>>> > >
>>> > >
>>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>>> > >
>>> > >
>>> > >
>>> > > ??
>>> > >
>>> > >
>>> > >
>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>> für
>>> > > Windows 10
>>> > >
>>> > >
>>> > >
>>> > > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> > > An: [hidden email]<mailto:[hidden email]>
>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>>> > >
>>> > >
>>> > >
>>> > > Hi,
>>> > >
>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>>> than
>>> > two
>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n  \n
>>> > \n),
>>> > > and replace it with two <br>.
>>> > >
>>> > > I use the following regex pattern and it is working when I test it in
>>> > > regex101.com. But it is not working when I put it inside the
>>> > > RegexReplaceProcessorFactory as below:
>>> > >
>>> > > <updateRequestProcessorChain name="removeCode">
>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>> > >    <str name="fieldName">content</str>
>>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> > > </processor>
>>> > >           </updateRequestProcessorChain>
>>> > >
>>> > > To explain further about my regex pattern, \s* is instructing the
>>> regex
>>> > to
>>> > > match any \n that have space after and {2,} is instructing the regex
>>> to
>>> > > match 2 or more occurrence of such pattern (\n).
>>> > >
>>> > > Please kindly let me know what is wrong and how should I do it?
>>> > >
>>> > > I am using Solr 7.6.0.
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> >
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: RegexReplaceProcessorFactory pattern to detect multiple \n

Zheng Lin Edwin Yeo
Hi,

For your info, this issue is occurring in Solr 7.7.0 as well.

Regards,
Edwin

On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <[hidden email]>
wrote:

> Hi,
>
> Should we report this as a bug in Solr?
>
> Regards,
> Edwin
>
> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <[hidden email]>
> wrote:
>
>> Hi Paul,
>>
>> Regarding the regex (\n\s*){2,} that we are using, when we try in on
>> https://regex101.com/, it is able to give us the correct result for all
>> the examples (ie: All of them will only have <br><br>, and not more than
>> that like what we are getting in Solr in our earlier examples).
>>
>> Could there be a possibility of a bug in Solr?
>>
>> Regards,
>> Edwin
>>
>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <[hidden email]>
>> wrote:
>>
>>> Hi Paul,
>>>
>>> We have tried it with the space preceeding the \n i.e. <str
>>> name="pattern">(\s*\n){2,}</str>, with the following regex pattern:
>>>
>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>    <str name="fieldName">content</str>
>>>    <str name="pattern">(\s*\n){2,}</str>
>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> </processor>
>>>
>>> However, we are also getting the exact same results as the earlier
>>> Example 1, 2 and 3.
>>>
>>> As for your point 2 on perhaps in the data you have other (non printing)
>>> characters than \n, we have find that there are no non printing characters.
>>> It is just next line with a space. You can refer to the original content in
>>> the same examples below.
>>>
>>>
>>> Example 1: The sentence that the above regex pattern is working
>>> correctly
>>> *Original content in EML file:*
>>> Dear Sir,
>>>
>>>
>>> I am terminating
>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>
>>> Example 2: The sentence that the above regex pattern is partially
>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content in EML file:*
>>>
>>> *exalted*
>>>
>>> *Psalm 89:17*
>>>
>>>
>>> 3 Choa Chu Kang Avenue 4
>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>> Choa Chu Kang Avenue 4, Singapore
>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>> Choa Chu Kang Avenue 4, Singapore
>>>
>>> Example 3: The sentence that the above regex pattern is partially
>>> working (as you can see, instead of 2 <br>, there are 4 <br>)
>>> *Original content in EML file:*
>>>
>>> http://www.concordpri.moe.edu.sg/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>> 2018 at 10:07 AM
>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>>
>>>
>>> Appreciate any other ideas or suggestions that you may have.
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Thu, 7 Feb 2019 at 22:49, <[hidden email]> wrote:
>>>
>>>> Hi Edwin
>>>>
>>>>
>>>>
>>>>   1.  Sorry, the pattern was wrong, the space should preceed the \n
>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>>>   2.  Perhaps in the data you have other (non printing) characters than
>>>> \n?
>>>>
>>>>
>>>>
>>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> Windows 10
>>>>
>>>>
>>>>
>>>> Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>>> An: [hidden email]<mailto:[hidden email]>
>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>>
>>>>
>>>>
>>>> Hi Paul,
>>>>
>>>> We have tried this suggested regex pattern as follow:
>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>>>    <str name="fieldName">content</str>
>>>>    <str name="pattern">(\n\s*){2,}</str>
>>>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> </processor>
>>>>
>>>> But we still have exactly the same problem of Example 1,2 and 3 below.
>>>>
>>>> Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>>>
>>>> Example 2: The sentence that the above regex pattern is partially
>>>> working
>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
>>>> Chu Kang Avenue 4, Singapore
>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa
>>>> Chu Kang Avenue 4, Singapore
>>>>
>>>> Example 3: The sentence that the above regex pattern is partially
>>>> working
>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n
>>>> \n\n
>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18,
>>>> 2018
>>>> at 10:07 AM
>>>> *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>> <br><br>On
>>>> Tue, Dec 18, 2018 at 10:07 AM
>>>>
>>>> Any further suggestion?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>> On Thu, 7 Feb 2019 at 22:20, <[hidden email]> wrote:
>>>>
>>>> > To avoid the «\n+\s*» matching too many \n and then failing on the
>>>> {2,}
>>>> > part you could try
>>>> >
>>>> >
>>>> >
>>>> > <str name="pattern">(\n\s*){2,}</str>
>>>> >
>>>> >
>>>> >
>>>> > If you also want to match CRLF then
>>>> >
>>>> > <str name="pattern">(\r?\n\s*){2,}</str>
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>>> > Windows 10
>>>> >
>>>> >
>>>> >
>>>> > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>>>> > Gesendet: Donnerstag, 7. Februar 2019 15:10
>>>> > An: [hidden email]<mailto:[hidden email]>
>>>> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>>> \n
>>>> >
>>>> >
>>>> >
>>>> > Hi Paul,
>>>> >
>>>> > Thanks for your reply.
>>>> >
>>>> > When I use this pattern:
>>>> > <processor class="solr.RegexReplaceProcessorFactory">
>>>> >    <str name="fieldName">content</str>
>>>> >    <str name="pattern">(\n+\s*){2,}</str>
>>>> >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > </processor>
>>>> >
>>>> > It is working for some sentence within the same content and not
>>>> working for
>>>> > some sentences. Please see below for the one that is working and
>>>> another
>>>> > that is not working (partially working):
>>>> >
>>>> > Example 1: The sentence that the above regex pattern is working
>>>> correctly
>>>> > *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
>>>> > *Index content: *    Dear Sir,  <br><br>I am terminating
>>>> >
>>>> > Example 2: The sentence that the above regex pattern is partially
>>>> working
>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> > *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3
>>>> Choa
>>>> > Chu Kang Avenue 4, Singapore
>>>> > *Index content: *exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3
>>>> Choa
>>>> > Chu Kang Avenue 4, Singapore
>>>> >
>>>> > Example 3: The sentence that the above regex pattern is partially
>>>> working
>>>> > (as you can see, instead of 2 <br>, there are 4 <br>)
>>>> > *Original content:* http://www.concordpri.moe.edu.sg/   \n\n   \n\n
>>>> \n
>>>> > \n\n
>>>> > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec
>>>> 18, 2018
>>>> > at 10:07 AM
>>>> > *Index content: *http://www.concordpri.moe.edu.sg/   <br><br>
>>>> <br><br>On
>>>> > Tue, Dec 18, 2018 at 10:07 AM
>>>> >
>>>> > We would appreciate your help to see what is wrong?
>>>> >
>>>> > Thank you.
>>>> >
>>>> > Regards,
>>>> > Edwin
>>>> >
>>>> > On Thu, 7 Feb 2019 at 21:24, <[hidden email]> wrote:
>>>> >
>>>> > > You don’t say what happens, just that it is not working. I assume
>>>> nothing
>>>> > > is replaced? Perhaps the pattern should be
>>>> > >
>>>> > >
>>>> > >
>>>> > >    <str name="pattern">"(\n\s*){2,}"</str>
>>>> > >
>>>> > >
>>>> > >
>>>> > > ??
>>>> > >
>>>> > >
>>>> > >
>>>> > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>>> für
>>>> > > Windows 10
>>>> > >
>>>> > >
>>>> > >
>>>> > > Von: Zheng Lin Edwin Yeo<mailto:[hidden email]>
>>>> > > Gesendet: Donnerstag, 7. Februar 2019 14:08
>>>> > > An: [hidden email]<mailto:[hidden email]>
>>>> > > Betreff: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>> > >
>>>> > >
>>>> > >
>>>> > > Hi,
>>>> > >
>>>> > > I am trying to use the RegexReplaceProcessorFactory to remove more
>>>> than
>>>> > two
>>>> > > \n with any number of spaces between them (Eg: \n\n, \n \n, \n \n
>>>> \n
>>>> > \n),
>>>> > > and replace it with two <br>.
>>>> > >
>>>> > > I use the following regex pattern and it is working when I test it
>>>> in
>>>> > > regex101.com. But it is not working when I put it inside the
>>>> > > RegexReplaceProcessorFactory as below:
>>>> > >
>>>> > > <updateRequestProcessorChain name="removeCode">
>>>> > > <processor class="solr.RegexReplaceProcessorFactory">
>>>> > >    <str name="fieldName">content</str>
>>>> > >    <str name="pattern">"(\\n\s*){2,}"</str>
>>>> > >    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>> > > </processor>
>>>> > >           </updateRequestProcessorChain>
>>>> > >
>>>> > > To explain further about my regex pattern, \s* is instructing the
>>>> regex
>>>> > to
>>>> > > match any \n that have space after and {2,} is instructing the
>>>> regex to
>>>> > > match 2 or more occurrence of such pattern (\n).
>>>> > >
>>>> > > Please kindly let me know what is wrong and how should I do it?
>>>> > >
>>>> > > I am using Solr 7.6.0.
>>>> > >
>>>> > > Regards,
>>>> > > Edwin
>>>> > >
>>>> >
>>>>
>>>