Link Farms

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Link Farms

Rod Taylor-2
We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.

I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.

Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Link Farms

kkrugler
>We've managed to dig ourselves into a couple of link farms with tens of
>thousands of sub-domains.
>
>I didn't notice until they blocked our DNS requests and the Nutch error
>rates shot way up.
>
>Are there any methods for detecting these things (more than 100
>sub-domains) or a master list somewhere that we can filter?

I've read a paper on detecting link farms, but from what I remember,
it wasn't a slam-dunk to implement.

So far we've relied on manually detecting these, and then pruning the
results from the crawldb and adding them to the regex-urlfilter file.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: Link Farms

waterwheel
I don't think it is a slam dunk either, even Google doesn't do a super
job of detecting these.  I think a lot of it's still done manually.

I think you'd have to look at detecting closed networks or mostly closed
networks (since the link farm would be relatively clustered from a link
perspective).  As noted, not too easy to implement - and why people
working in SEO still use this technique to game the SE's.

Besides, it gets crazy fast trying to pin this stuff down.  I spoke to
someone who was complaining about managing 400+ webhosting accounts.  
Tough to nail folks going to that level.




Ken Krugler wrote:

>> We've managed to dig ourselves into a couple of link farms with tens of
>> thousands of sub-domains.
>>
>> I didn't notice until they blocked our DNS requests and the Nutch error
>> rates shot way up.
>>
>> Are there any methods for detecting these things (more than 100
>> sub-domains) or a master list somewhere that we can filter?
>
>
> I've read a paper on detecting link farms, but from what I remember,
> it wasn't a slam-dunk to implement.
>
> So far we've relied on manually detecting these, and then pruning the
> results from the crawldb and adding them to the regex-urlfilter file.
>
> -- Ken

Reply | Threaded
Open this post in threaded view
|

Re: Link Farms

Stefan Groschupf-2
In reply to this post by kkrugler
Hi,

is the content of the pages 'mostly' identically?
Since we can now provide custom hash implementations to the crawlDB,  
what people think about local sensitive hashing?

http://citeseer.ist.psu.edu/haveliwala00scalable.html

As far I understand the paper we can implement the hashing in a style  
that it allows to handle 'similar' (just change one word ) pages  as  
once.
My experience of link farms is that pages are identically except of  
one number or word or data or something like that.
In such a case LSH may could be a interesting try to get the problem  
solved.

Any thoughts?

Stefan


Am 07.03.2006 um 22:38 schrieb Ken Krugler:

>> We've managed to dig ourselves into a couple of link farms with  
>> tens of
>> thousands of sub-domains.
>>
>> I didn't notice until they blocked our DNS requests and the Nutch  
>> error
>> rates shot way up.
>>
>> Are there any methods for detecting these things (more than 100
>> sub-domains) or a master list somewhere that we can filter?
>
> I've read a paper on detecting link farms, but from what I  
> remember, it wasn't a slam-dunk to implement.
>
> So far we've relied on manually detecting these, and then pruning  
> the results from the crawldb and adding them to the regex-urlfilter  
> file.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

Re: Link Farms

kangas
Hi folks,

Offhand, I'm not aware of any slam-dunk solution to link farms  
either. One thing that could help mitigate the problem is a pre-built  
blacklist of some sort. For example:

http://www.squidguard.org/blacklist/

That one is really meant for blocking user-access to porn, known  
warez providers, etc, but it may have some value for you.

Another source of link farms are parked-domain providers. Many of  
these can be identified by their DNS server name. Some of the top  
offenders (afaik) include:
- dns(\d+).name-services.com
- ns(\d+).directnic.com
- ns(\d+).itsyourdomain.com
- park(\d+).secureserver.net
- ns.buydomains.com
- this-domain-for-sale.com

A reasonable first-pass at this list can be achieved by getting the  
Verisign COM Zone file, getting a count of domains per DNS server,  
then checking the top 100 or so. (that's what i did, anyway! :)

Rob, does that help you? Or are you hitting a different type of link  
farm?

--Matt

On Mar 7, 2006, at 5:13 PM, Stefan Groschupf wrote:

> Hi,
>
> is the content of the pages 'mostly' identically?
> Since we can now provide custom hash implementations to the  
> crawlDB, what people think about local sensitive hashing?
>
> http://citeseer.ist.psu.edu/haveliwala00scalable.html
>
> As far I understand the paper we can implement the hashing in a  
> style that it allows to handle 'similar' (just change one word )  
> pages  as once.
> My experience of link farms is that pages are identically except of  
> one number or word or data or something like that.
> In such a case LSH may could be a interesting try to get the  
> problem solved.
>
> Any thoughts?
>
> Stefan
>
>
> Am 07.03.2006 um 22:38 schrieb Ken Krugler:
>
>>> We've managed to dig ourselves into a couple of link farms with  
>>> tens of
>>> thousands of sub-domains.
>>>
>>> I didn't notice until they blocked our DNS requests and the Nutch  
>>> error
>>> rates shot way up.
>>>
>>> Are there any methods for detecting these things (more than 100
>>> sub-domains) or a master list somewhere that we can filter?
>>
>> I've read a paper on detecting link farms, but from what I  
>> remember, it wasn't a slam-dunk to implement.
>>
>> So far we've relied on manually detecting these, and then pruning  
>> the results from the crawldb and adding them to the regex-
>> urlfilter file.
>>
>> -- Ken
>
>

--
Matt Kangas / [hidden email]