How do I block/ban a specific domain name or a tld?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How do I block/ban a specific domain name or a tld?

opsec
I've added this to my conf/crawl-urlfilter.txt and conf/regex-urlfilter.txt yet when I start a crawl this domain is heavily spidered. I would like to remove it from my search results entirely and prevent it from being crawled in the future and possibly all *.int tlds, how can i accomplish this?

-^http://([a-z0-9]*\.)*who.int/

Thanks for your time and any assistance,

-Warren
Reply | Threaded
Open this post in threaded view
|

Re: How do I block/ban a specific domain name or a tld?

reinhard
opsec schrieb:
> I've added this to my conf/crawl-urlfilter.txt and conf/regex-urlfilter.txt
> yet when I start a crawl this domain is heavily spidered. I would like to
> remove it from my search results entirely and prevent it from being crawled
> in the future and possibly all *.int tlds, how can i accomplish this?
>
> -^http://([a-z0-9]*\.)*who.int/
>  
why not

-^http://[^/]*\.int/



> Thanks for your time and any assistance,
>
> -Warren
>  

Reply | Threaded
Open this post in threaded view
|

Re: How do I block/ban a specific domain name or a tld?

opsec
Hello,

 Thanks for the reply, but this doesn't seem to work either. I removed the crawl dir, added the regex you posted, removed the one I had in regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My crawls spend about 90% of their time on who.int .. I have no idea how to remove this domain or all .int domains from being crawled. Do I have the regex in the wrong conf file?

Thanks,

-Warren
reinhard schwab wrote
opsec schrieb:
> I've added this to my conf/crawl-urlfilter.txt and conf/regex-urlfilter.txt
> yet when I start a crawl this domain is heavily spidered. I would like to
> remove it from my search results entirely and prevent it from being crawled
> in the future and possibly all *.int tlds, how can i accomplish this?
>
> -^http://([a-z0-9]*\.)*who.int/
>  
why not

-^http://[^/]*\.int/



> Thanks for your time and any assistance,
>
> -Warren
>  
Reply | Threaded
Open this post in threaded view
|

Re: How do I block/ban a specific domain name or a tld?

reinhard
hello,

the first matching rule wins.
may be you have a rule before, which matches.
can you send me your filter files by private mail?

regards
reinhard

opsec schrieb:

> Hello,
>
>  Thanks for the reply, but this doesn't seem to work either. I removed the
> crawl dir, added the regex you posted, removed the one I had in
> regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My
> crawls spend about 90% of their time on who.int .. I have no idea how to
> remove this domain or all .int domains from being crawled. Do I have the
> regex in the wrong conf file?
>
> Thanks,
>
> -Warren
>
> reinhard schwab wrote:
>  
>> opsec schrieb:
>>    
>>> I've added this to my conf/crawl-urlfilter.txt and
>>> conf/regex-urlfilter.txt
>>> yet when I start a crawl this domain is heavily spidered. I would like to
>>> remove it from my search results entirely and prevent it from being
>>> crawled
>>> in the future and possibly all *.int tlds, how can i accomplish this?
>>>
>>> -^http://([a-z0-9]*\.)*who.int/
>>>  
>>>      
>> why not
>>
>> -^http://[^/]*\.int/
>>
>>
>>
>>    
>>> Thanks for your time and any assistance,
>>>
>>> -Warren
>>>  
>>>      
>>
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: How do I block/ban a specific domain name or a tld?

Subhojit Roy
In reply to this post by opsec
Try:

1. In order to prevent crawling of URLs with pattern* who.int* ,you can add
the following in following files:

 a) if you are the using "bin/nutch crawl" command then add the following
line inside conf/crawl-urlfilter.txt

           -^http://( <http://%28/>[a-z0-9]*\.)*who.int/

 b) if you are using individual commands (generate,fetch,update) then add
the following line in conf/regex-urlfilter.txt

         -^http://( <http://%28/>[a-z0-9]*\.)*who.int/

2. If you would like to delete URL's with the pattern who.int from the index
you could prune those URLs from the index by using the following command:

    *bin/nutch org.apache.nutch.tools.PruneIndexTool  indexdir/index
-queries query.txt -output output.txt -showfields url*

     where you can add *+url:who +url:int  *in the query.txt file

thanks,
-sroy

On Thu, Nov 12, 2009 at 12:16 AM, opsec <[hidden email]> wrote:

>
> Hello,
>
>  Thanks for the reply, but this doesn't seem to work either. I removed the
> crawl dir, added the regex you posted, removed the one I had in
> regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My
> crawls spend about 90% of their time on who.int .. I have no idea how to
> remove this domain or all .int domains from being crawled. Do I have the
> regex in the wrong conf file?
>
> Thanks,
>
> -Warren
>
> reinhard schwab wrote:
> >
> > opsec schrieb:
> >> I've added this to my conf/crawl-urlfilter.txt and
> >> conf/regex-urlfilter.txt
> >> yet when I start a crawl this domain is heavily spidered. I would like
> to
> >> remove it from my search results entirely and prevent it from being
> >> crawled
> >> in the future and possibly all *.int tlds, how can i accomplish this?
> >>
> >> -^http://([a-z0-9]*\.)*who.int/
> >>
> > why not
> >
> > -^http://[^/]*\.int/
> >
> >
> >
> >> Thanks for your time and any assistance,
> >>
> >> -Warren
> >>
> >
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/How-do-I-block-ban-a-specific-domain-name-or-a-tld--tp26289091p26306461.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


--
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: [hidden email]
http://www.profound.in
Reply | Threaded
Open this post in threaded view
|

Re: How do I block/ban a specific domain name or a tld?

Subhojit Roy-2
Sorry...

The regular expressions should be:

  -^http://( <http://%28/>[a-z0-9]*\.)*who.int/

Had made an error in the previous email. Wonder whether gmail is playing
with the characters in the set emails...

-sroy

On Wed, Nov 25, 2009 at 12:00 PM, Subhojit Roy <[hidden email]> wrote:

> Try:
>
> 1. In order to prevent crawling of URLs with pattern* who.int* ,you can
> add
> the following in following files:
>
>  a) if you are the using "bin/nutch crawl" command then add the following
> line inside conf/crawl-urlfilter.txt
>
>           -^http://( <http://%28/>[a-z0-9]*\.)*who.int/
>
>  b) if you are using individual commands (generate,fetch,update) then add
> the following line in conf/regex-urlfilter.txt
>
>         -^http://( <http://%28/>[a-z0-9]*\.)*who.int/
>
> 2. If you would like to delete URL's with the pattern who.int from the
> index
> you could prune those URLs from the index by using the following command:
>
>    *bin/nutch org.apache.nutch.tools.PruneIndexTool  indexdir/index
> -queries query.txt -output output.txt -showfields url*
>
>     where you can add *+url:who +url:int  *in the query.txt file
>
> thanks,
> -sroy
>
> On Thu, Nov 12, 2009 at 12:16 AM, opsec <[hidden email]> wrote:
>
> >
> > Hello,
> >
> >  Thanks for the reply, but this doesn't seem to work either. I removed
> the
> > crawl dir, added the regex you posted, removed the one I had in
> > regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My
> > crawls spend about 90% of their time on who.int .. I have no idea how to
> > remove this domain or all .int domains from being crawled. Do I have the
> > regex in the wrong conf file?
> >
> > Thanks,
> >
> > -Warren
> >
> > reinhard schwab wrote:
> > >
> > > opsec schrieb:
> > >> I've added this to my conf/crawl-urlfilter.txt and
> > >> conf/regex-urlfilter.txt
> > >> yet when I start a crawl this domain is heavily spidered. I would like
> > to
> > >> remove it from my search results entirely and prevent it from being
> > >> crawled
> > >> in the future and possibly all *.int tlds, how can i accomplish this?
> > >>
> > >> -^http://([a-z0-9]*\.)*who.int/
> > >>
> > > why not
> > >
> > > -^http://[^/]*\.int/
> > >
> > >
> > >
> > >> Thanks for your time and any assistance,
> > >>
> > >> -Warren
> > >>
> > >
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://old.nabble.com/How-do-I-block-ban-a-specific-domain-name-or-a-tld--tp26289091p26306461.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>
>
> --
> Subhojit Roy
> Profound Technologies
> (Search Solutions based on Open Source)
> email: [hidden email]
> http://www.profound.in
>



--
Subhojit Roy
ProFound Technologies,
(Search Solutions based on Lucene & Nutch)
Pune,
India.