Inappropriate content detection

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Inappropriate content detection

mackuby
I am trying to figure out whether or not Lucene is an appropriate solution
for a problem that our site faces. Our site

allows users to post their opinions on various topics. Due to various
government legislations around the world our management would like us to
scan each users post against various keywords that would indicate
inappropriate content

in the users posting. We are looking for racial slurs, profanity and attacks
against sexual orientation. Each users posting is

generally not more that a few paragraphs.

 

I would like to analyze each users post for various words and expressions
before publishing their post to the DB. I am reading through the Lucene in
action book and it looks as if I cannot analyze a string without first
indexing it. If this is true will indexing each post be a performance hit to
the site? I was wondering if someone could shed some light on the best way
to tackle this problem with Lucene or another api if doing so makes more
sense?

 

Thanks,

Jeff

 

Reply | Threaded
Open this post in threaded view
|

Re: Inappropriate content detection

gekkokid
Hi, what scale is this website? millions of posts or under?

wouldn't it be easiler to use a bayesian algorithm to scan each new post
before it is posted to detect whether it is acceptable or not? just a quick
idea of my head



_gk

----- Original Message -----
From: "Jeff Thorne" <[hidden email]>
To: <[hidden email]>
Sent: Monday, February 06, 2006 3:56 AM
Subject: Inappropriate content detection


>I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inappropriate content detection

Daniel Noll-3
In reply to this post by mackuby
Jeff Thorne wrote:
> I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces.
<cut>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?

You can definitely use Lucene's analyser classes without indexing.  Our
own application does this when it needs to do things like highlighting
text on the screen.

The idea would be you'd have a bunch of terms which are considered
nasty, and then every new document would get analysed, and you would
look through the terms returned from the analyser for the suspicious ones.

But no, it certainly isn't something that Lucene as a whole is very good
at solving.  Lucene is fast for executing a single query against
multiple documents, but what you really need is something fast for
executing multiple queries against a single document.

Daniel


--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inappropriate content detection

jrodenburg
In reply to this post by mackuby
You can generate a token stream for a block of text without having to index
it. Take a look at the highlighter code, it does this very thing.



On 2/5/06, Jeff Thorne <[hidden email]> wrote:

>
> I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Inappropriate content detection

mackuby
In reply to this post by gekkokid
The site will have million+ posts. I am not familiar with Bayesian
algorithms. Is there an off the shelf API that can provide this type of
capability. As for performance would Bayesian be the way to go over Lucene?

Thanks for the help,
Jeff

-----Original Message-----
From: gekkokid [mailto:[hidden email]]
Sent: Sunday, February 05, 2006 8:40 PM
To: [hidden email]
Subject: Re: Inappropriate content detection

Hi, what scale is this website? millions of posts or under?

wouldn't it be easiler to use a bayesian algorithm to scan each new post
before it is posted to detect whether it is acceptable or not? just a quick
idea of my head



_gk

----- Original Message -----
From: "Jeff Thorne" <[hidden email]>
To: <[hidden email]>
Sent: Monday, February 06, 2006 3:56 AM
Subject: Inappropriate content detection


>I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Inappropriate content detection

Gwyn Carwardine
The good bit about Bayesian is that it continuously learns.

The downside is that you have to teach it.

Not quite as simple as a list of rude words.

There's an open source Bayesian mail filter called spambayes
(http://spambayes.sourceforge.net) which may lead you to interesting places.

-Gwyn

-----Original Message-----
From: Jeff Thorne [mailto:[hidden email]]
Sent: 06 February 2006 13:30
To: [hidden email]
Subject: RE: Inappropriate content detection

The site will have million+ posts. I am not familiar with Bayesian
algorithms. Is there an off the shelf API that can provide this type of
capability. As for performance would Bayesian be the way to go over Lucene?

Thanks for the help,
Jeff

-----Original Message-----
From: gekkokid [mailto:[hidden email]]
Sent: Sunday, February 05, 2006 8:40 PM
To: [hidden email]
Subject: Re: Inappropriate content detection

Hi, what scale is this website? millions of posts or under?

wouldn't it be easiler to use a bayesian algorithm to scan each new post
before it is posted to detect whether it is acceptable or not? just a quick
idea of my head



_gk

----- Original Message -----
From: "Jeff Thorne" <[hidden email]>
To: <[hidden email]>
Sent: Monday, February 06, 2006 3:56 AM
Subject: Inappropriate content detection


>I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inappropriate content detection

Jason Polites-2
There is also an open source java anti spam api which does a baysian scan of
email content (plus other stuff).

You could retro-fit to work with raw text.

www.jasen.org

(get the latest HEAD from CVS as the current release is a bit old... new
version imminent)

----- Original Message -----
From: "Gwyn Carwardine" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, February 07, 2006 12:58 AM
Subject: RE: Inappropriate content detection


> The good bit about Bayesian is that it continuously learns.
>
> The downside is that you have to teach it.
>
> Not quite as simple as a list of rude words.
>
> There's an open source Bayesian mail filter called spambayes
> (http://spambayes.sourceforge.net) which may lead you to interesting
> places.
>
> -Gwyn
>
> -----Original Message-----
> From: Jeff Thorne [mailto:[hidden email]]
> Sent: 06 February 2006 13:30
> To: [hidden email]
> Subject: RE: Inappropriate content detection
>
> The site will have million+ posts. I am not familiar with Bayesian
> algorithms. Is there an off the shelf API that can provide this type of
> capability. As for performance would Bayesian be the way to go over
> Lucene?
>
> Thanks for the help,
> Jeff
>
> -----Original Message-----
> From: gekkokid [mailto:[hidden email]]
> Sent: Sunday, February 05, 2006 8:40 PM
> To: [hidden email]
> Subject: Re: Inappropriate content detection
>
> Hi, what scale is this website? millions of posts or under?
>
> wouldn't it be easiler to use a bayesian algorithm to scan each new post
> before it is posted to detect whether it is acceptable or not? just a
> quick
> idea of my head
>
>
>
> _gk
>
> ----- Original Message -----
> From: "Jeff Thorne" <[hidden email]>
> To: <[hidden email]>
> Sent: Monday, February 06, 2006 3:56 AM
> Subject: Inappropriate content detection
>
>
>>I am trying to figure out whether or not Lucene is an appropriate solution
>> for a problem that our site faces. Our site
>>
>> allows users to post their opinions on various topics. Due to various
>> government legislations around the world our management would like us to
>> scan each users post against various keywords that would indicate
>> inappropriate content
>>
>> in the users posting. We are looking for racial slurs, profanity and
>> attacks
>> against sexual orientation. Each users posting is
>>
>> generally not more that a few paragraphs.
>>
>>
>>
>> I would like to analyze each users post for various words and expressions
>> before publishing their post to the DB. I am reading through the Lucene
>> in
>> action book and it looks as if I cannot analyze a string without first
>> indexing it. If this is true will indexing each post be a performance hit
>> to
>> the site? I was wondering if someone could shed some light on the best
>> way
>> to tackle this problem with Lucene or another api if doing so makes more
>> sense?
>>
>>
>>
>> Thanks,
>>
>> Jeff
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inappropriate content detection

Daniel Noll-3
Jason Polites wrote:
> There is also an open source java anti spam api which does a baysian
> scan of
> email content (plus other stuff).
>
> You could retro-fit to work with raw text.

There is also Classifier4J, which is more geared toward pure
classification (comes with a Bayesian classifier but others can be
implemented.)  Perhaps it's better than retro-fitting something more
powerful, perhaps not.

http://classifier4j.sourceforge.net/

Daniel

--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]