It uses exactly this concept of "domain", i.e., a suffix of host name parts. You would have to write
your rules as
The name "fast" is not really informative. It's usually faster than regex-urlfilter for two reasons:
- regex rules are per host or "domain"
- matching regex patterns can be expensive on long strings. By limit the match on path only
strings get somewhat shorter
I'll plan to push this url filter to the main branch of Nutch. Currently, the filter can hold
several 100,000s of denied domains. I've also used it with 2 million but then at least 2 GB
of memory are recommended for the Nutch tasks. I hope to scale it up by replacing the hash
to hold the domains by a trie or automaton.
On 06/14/2018 07:46 PM, Michael Coffey wrote:
> I want to blacklist certain top-level domains for a very large web crawl. I tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't seem to work.
> My domainblacklist-urlfilter.txt contains lines like the following.
> The TLDs do not get blocked, but the other listed domains do get blocked.
> I suppose I could compose regexes, but that is trick to do accurately because I don't want to block urls that happen to have ".cn" or '.jp" in the middle of them.
> Would I need to change the source code of DomainBlacklistUrlFilter, or is there an easier solution?