Blacklisting TLDs

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Blacklisting TLDs

Michael Coffey
I want to blacklist certain top-level domains for a very large web crawl. I tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't seem to work.

My domainblacklist-urlfilter.txt contains lines like the following.

cn
jp
line.me
albooked.com
booked.co.il


The TLDs do not get blocked, but the other listed domains do get blocked.

I suppose I could compose regexes, but that is trick to do accurately because I don't want to block urls that happen to have ".cn" or '.jp" in the middle of them.

Would I need to change the source code of DomainBlacklistUrlFilter, or is there an easier solution?
Reply | Threaded
Open this post in threaded view
|

Re: Blacklisting TLDs

Sebastian Nagel-2
Hi Michael,

on the Common Crawl Nutch fork there is a plugin "fast-urlfilter" which does this, see

https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java

It uses exactly this concept of "domain", i.e., a suffix of host name parts. You would have to write
your rules as

Domain cn
 DenyPath .*

Domain line.me
 DenyPath .*


The name "fast" is not really informative. It's usually faster than regex-urlfilter for two reasons:
- regex rules are per host or "domain"
- matching regex patterns can be expensive on long strings. By limit the match on path only
  strings get somewhat shorter

I'll plan to push this url filter to the main branch of Nutch. Currently, the filter can hold
several 100,000s of denied domains. I've also used it with 2 million but then at least 2 GB
of memory are recommended for the Nutch tasks.  I hope to scale it up by replacing the hash
to hold the domains by a trie or automaton.


Best,
Sebastian




On 06/14/2018 07:46 PM, Michael Coffey wrote:

> I want to blacklist certain top-level domains for a very large web crawl. I tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't seem to work.
>
> My domainblacklist-urlfilter.txt contains lines like the following.
>
> cn
> jp
> line.me
> albooked.com
> booked.co.il
>
>
> The TLDs do not get blocked, but the other listed domains do get blocked.
>
> I suppose I could compose regexes, but that is trick to do accurately because I don't want to block urls that happen to have ".cn" or '.jp" in the middle of them.
>
> Would I need to change the source code of DomainBlacklistUrlFilter, or is there an easier solution?
>