Filter spam URLs

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Filter spam URLs

Ned Rockson-3
I've been searching for a bit on the forums to see if anyone is in the
process of producing a spam filter heuristic for URLs.  I assume that
most spam is nondeterministic, but after a crawl of ~50M URLs, there are
a bunch that are obviously spam because their URLs are simply
nonsensical (like 01118273.domain.com I would automatically filter
out).  Is anyone currently working on this or has there been any effort
in the past?  Also, does anyone know of any literature published about
this?  A quick google search netted only email spam filters using naive
bayes.
Reply | Threaded
Open this post in threaded view
|

Re: Filter spam URLs

Andrzej Białecki-2
Ned Rockson wrote:
> I've been searching for a bit on the forums to see if anyone is in the
> process of producing a spam filter heuristic for URLs.  I assume that
> most spam is nondeterministic, but after a crawl of ~50M URLs, there are
> a bunch that are obviously spam because their URLs are simply
> nonsensical (like 01118273.domain.com I would automatically filter
> out).  Is anyone currently working on this or has there been any effort
> in the past?  Also, does anyone know of any literature published about
> this?  A quick google search netted only email spam filters using naive
> bayes.

If you have an ACM Library subscription, this is a good source for
published papers on the subject. Similarly Citeseer, although the papers
there tend to be older.

Apart from that, I sometimes use a heuristic that all-numeric (or mostly
numeric) url components indicate spam links. Example: www.12345.com,
www.example.com/12345/index.html. This is easy to implement as a
URLFilter, in addition to other simple checks (e.g. max. url length, max
number of path levels, presence of special characters, abundance of
non-plain-text looking sections, ...)

Other techniques depend on link graph analysis - especially interesting
is to collect per-host or per-domain or per-subdomain link statistics,
both the outgoing and incoming. This requires writing a relatively
simple map-reduce job to aggregate the results per-host from an existing
linkdb.

You can use the results in many interesting ways - here's a broad
overview of some strategies: you could detect dense link communities,
which may indicate spammy reciprocal linking, or detect domains with
abundance of links to / from known spam sites, etc ... This data could
be then used on the fly (in a URLFilter), or to flag existing urls in
crawldb as spammy. Then, you could implement a scoring filter, which
uses such flags to carry around a "spam score", in order to poison the
score of pages linked from known spam pages.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com