Profanity

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Profanity

Sadiki Latty
Hey

I would like to find a solution to flag (at index-time) profanity. Optimally, it would be good if it function similar to stopwords in the sense that I can have a predefined list that is read and if token is on the list that document is 'flagged' in a different field. Does anyone know of solution (outside of configuring my own). If none exists and I end up configuring my own would I be doing this in the updateprcoessor phase. I am still fairly new to Solr, but from what I've read, that seems to be the best place to look.


Thanks,

Sid
Reply | Threaded
Open this post in threaded view
|

Re: Profanity

John Blythe-2
you could use the keepwords functionality. have a field that only keeps
profanity and then you can query against that field having its default
value vs. profane text

--
John Blythe

On Mon, Jan 8, 2018 at 3:12 PM, Sadiki Latty <[hidden email]> wrote:

> Hey
>
> I would like to find a solution to flag (at index-time) profanity.
> Optimally, it would be good if it function similar to stopwords in the
> sense that I can have a predefined list that is read and if token is on the
> list that document is 'flagged' in a different field. Does anyone know of
> solution (outside of configuring my own). If none exists and I end up
> configuring my own would I be doing this in the updateprcoessor phase. I am
> still fairly new to Solr, but from what I've read, that seems to be the
> best place to look.
>
>
> Thanks,
>
> Sid
>
Reply | Threaded
Open this post in threaded view
|

RE: Profanity

Markus Jelsma-2
In reply to this post by Sadiki Latty
Yes, an UpdateRequestProcessor is the API to implement for these sorts of requirements. In the URP you have access to a SolrDocument object that carries the input data. You can inspect the fields, and add, remove or modify fields if you want, or discard the input altogether.

So, check your text input field for 'profanity' and set another boolean field if it matches or doesn't. If you are using a list of words - or an SVM or another machine learning algorithm - to detect provanity is up to you.

Cheers,
Markus
 
-----Original message-----

> From:Sadiki Latty <[hidden email]>
> Sent: Monday 8th January 2018 22:12
> To: [hidden email]
> Subject: Profanity
>
> Hey
>
> I would like to find a solution to flag (at index-time) profanity. Optimally, it would be good if it function similar to stopwords in the sense that I can have a predefined list that is read and if token is on the list that document is 'flagged' in a different field. Does anyone know of solution (outside of configuring my own). If none exists and I end up configuring my own would I be doing this in the updateprcoessor phase. I am still fairly new to Solr, but from what I've read, that seems to be the best place to look.
>
>
> Thanks,
>
> Sid
>
Reply | Threaded
Open this post in threaded view
|

RE: Profanity

Davis, Daniel (NIH/NLM) [C]
Fun topic.   Same complicated issues as normal search:

Multilingual support?    Is "Merde" profanity too, or just in French.
Multi-word synonyms?       Does "God Damn" becomes "goddamn", or do you treat "Damn" and "God damn" the same because you drop "God"
                                     "Merde Alors" is same as "Merde" or again multi-word synonyms

-----Original Message-----
From: Markus Jelsma [mailto:[hidden email]]
Sent: Monday, January 8, 2018 4:42 PM
To: [hidden email]
Subject: RE: Profanity

Yes, an UpdateRequestProcessor is the API to implement for these sorts of requirements. In the URP you have access to a SolrDocument object that carries the input data. You can inspect the fields, and add, remove or modify fields if you want, or discard the input altogether.

So, check your text input field for 'profanity' and set another boolean field if it matches or doesn't. If you are using a list of words - or an SVM or another machine learning algorithm - to detect provanity is up to you.

Cheers,
Markus
 
-----Original message-----

> From:Sadiki Latty <[hidden email]>
> Sent: Monday 8th January 2018 22:12
> To: [hidden email]
> Subject: Profanity
>
> Hey
>
> I would like to find a solution to flag (at index-time) profanity. Optimally, it would be good if it function similar to stopwords in the sense that I can have a predefined list that is read and if token is on the list that document is 'flagged' in a different field. Does anyone know of solution (outside of configuring my own). If none exists and I end up configuring my own would I be doing this in the updateprcoessor phase. I am still fairly new to Solr, but from what I've read, that seems to be the best place to look.
>
>
> Thanks,
>
> Sid
>
Reply | Threaded
Open this post in threaded view
|

RE: Profanity

Markus Jelsma-2
In reply to this post by Sadiki Latty
Indeed, hence the small suggestion to use ML for this instead of a dumb set of terms, which is useless in almost any real solution. We have had very good results with SVM's for text processing, although in the end it depends on your input data, and the care for selecting edge cases.

Regards,
Markus
 
-----Original message-----

> From:Davis, Daniel (NIH/NLM) [C] <[hidden email]>
> Sent: Monday 8th January 2018 23:12
> To: [hidden email]
> Subject: RE: Profanity
>
> Fun topic.   Same complicated issues as normal search:
>
> Multilingual support?    Is "Merde" profanity too, or just in French.
> Multi-word synonyms?       Does "God Damn" becomes "goddamn", or do you treat "Damn" and "God damn" the same because you drop "God"
>                                      "Merde Alors" is same as "Merde" or again multi-word synonyms
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Monday, January 8, 2018 4:42 PM
> To: [hidden email]
> Subject: RE: Profanity
>
> Yes, an UpdateRequestProcessor is the API to implement for these sorts of requirements. In the URP you have access to a SolrDocument object that carries the input data. You can inspect the fields, and add, remove or modify fields if you want, or discard the input altogether.
>
> So, check your text input field for 'profanity' and set another boolean field if it matches or doesn't. If you are using a list of words - or an SVM or another machine learning algorithm - to detect provanity is up to you.
>
> Cheers,
> Markus

> -----Original message-----
> > From:Sadiki Latty <[hidden email]>
> > Sent: Monday 8th January 2018 22:12
> > To: [hidden email]
> > Subject: Profanity
> >
> > Hey
> >
> > I would like to find a solution to flag (at index-time) profanity. Optimally, it would be good if it function similar to stopwords in the sense that I can have a predefined list that is read and if token is on the list that document is 'flagged' in a different field. Does anyone know of solution (outside of configuring my own). If none exists and I end up configuring my own would I be doing this in the updateprcoessor phase. I am still fairly new to Solr, but from what I've read, that seems to be the best place to look.
> >
> >
> > Thanks,
> >
> > Sid
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Profanity

Sadiki Latty
Thanks a lot guys. Multilingual will also be a hurdle tbh. The data will only be coming From 2 languages but it will prove to be potentially challenging nonetheless. French and English so “merde” will be making that list. This requirement is in itself an edge case for my project so ML may be overkill hence why I was thinking the list. The data being inserted is from sources that we have “control” over. This requirement is simply for the worst case scenario that we miss something. We might also want to allow this profanity which is why we need to flag it rather than strip it all together.

This provides me with great direction.

Sent from my iPhone

> On Jan 8, 2018, at 5:17 PM, Markus Jelsma <[hidden email]> wrote:
>
> Indeed, hence the small suggestion to use ML for this instead of a dumb set of terms, which is useless in almost any real solution. We have had very good results with SVM's for text processing, although in the end it depends on your input data, and the care for selecting edge cases.
>
> Regards,
> Markus
>
> -----Original message-----
>> From:Davis, Daniel (NIH/NLM) [C] <[hidden email]>
>> Sent: Monday 8th January 2018 23:12
>> To: [hidden email]
>> Subject: RE: Profanity
>>
>> Fun topic.   Same complicated issues as normal search:
>>
>> Multilingual support?    Is "Merde" profanity too, or just in French.
>> Multi-word synonyms?       Does "God Damn" becomes "goddamn", or do you treat "Damn" and "God damn" the same because you drop "God"
>>                                       "Merde Alors" is same as "Merde" or again multi-word synonyms
>>
>> -----Original Message-----
>> From: Markus Jelsma [mailto:[hidden email]]
>> Sent: Monday, January 8, 2018 4:42 PM
>> To: [hidden email]
>> Subject: RE: Profanity
>>
>> Yes, an UpdateRequestProcessor is the API to implement for these sorts of requirements. In the URP you have access to a SolrDocument object that carries the input data. You can inspect the fields, and add, remove or modify fields if you want, or discard the input altogether.
>>
>> So, check your text input field for 'profanity' and set another boolean field if it matches or doesn't. If you are using a list of words - or an SVM or another machine learning algorithm - to detect provanity is up to you.
>>
>> Cheers,
>> Markus
>>  
>> -----Original message-----
>>> From:Sadiki Latty <[hidden email]>
>>> Sent: Monday 8th January 2018 22:12
>>> To: [hidden email]
>>> Subject: Profanity
>>>
>>> Hey
>>>
>>> I would like to find a solution to flag (at index-time) profanity. Optimally, it would be good if it function similar to stopwords in the sense that I can have a predefined list that is read and if token is on the list that document is 'flagged' in a different field. Does anyone know of solution (outside of configuring my own). If none exists and I end up configuring my own would I be doing this in the updateprcoessor phase. I am still fairly new to Solr, but from what I've read, that seems to be the best place to look.
>>>
>>>
>>> Thanks,
>>>
>>> Sid
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Profanity

Sadiki Latty
In reply to this post by John Blythe-2
Thanks for the feedback John,

This is a genius idea if I don’t want to create my own processor. I could simply check that field for data for my reports. Either the field will have data or it won’t.

Thanks

Sid

Sent from my iPhone

> On Jan 8, 2018, at 4:38 PM, John Blythe <[hidden email]> wrote:
>
> you could use the keepwords functionality. have a field that only keeps
> profanity and then you can query against that field having its default
> value vs. profane text
>
> --
> John Blythe
>
>> On Mon, Jan 8, 2018 at 3:12 PM, Sadiki Latty <[hidden email]> wrote:
>>
>> Hey
>>
>> I would like to find a solution to flag (at index-time) profanity.
>> Optimally, it would be good if it function similar to stopwords in the
>> sense that I can have a predefined list that is read and if token is on the
>> list that document is 'flagged' in a different field. Does anyone know of
>> solution (outside of configuring my own). If none exists and I end up
>> configuring my own would I be doing this in the updateprcoessor phase. I am
>> still fairly new to Solr, but from what I've read, that seems to be the
>> best place to look.
>>
>>
>> Thanks,
>>
>> Sid
>>
Reply | Threaded
Open this post in threaded view
|

Re: Profanity

John Blythe-2
Gladly. Good luck!

On Mon, Jan 8, 2018 at 8:27 PM Sadiki Latty <[hidden email]> wrote:

> Thanks for the feedback John,
>
> This is a genius idea if I don’t want to create my own processor. I could
> simply check that field for data for my reports. Either the field will have
> data or it won’t.
>
> Thanks
>
> Sid
>
> Sent from my iPhone
>
> > On Jan 8, 2018, at 4:38 PM, John Blythe <[hidden email]> wrote:
> >
> > you could use the keepwords functionality. have a field that only keeps
> > profanity and then you can query against that field having its default
> > value vs. profane text
> >
> > --
> > John Blythe
> >
> >> On Mon, Jan 8, 2018 at 3:12 PM, Sadiki Latty <[hidden email]> wrote:
> >>
> >> Hey
> >>
> >> I would like to find a solution to flag (at index-time) profanity.
> >> Optimally, it would be good if it function similar to stopwords in the
> >> sense that I can have a predefined list that is read and if token is on
> the
> >> list that document is 'flagged' in a different field. Does anyone know
> of
> >> solution (outside of configuring my own). If none exists and I end up
> >> configuring my own would I be doing this in the updateprcoessor phase.
> I am
> >> still fairly new to Solr, but from what I've read, that seems to be the
> >> best place to look.
> >>
> >>
> >> Thanks,
> >>
> >> Sid
> >>
>
--
John Blythe