WhiteListBlackList

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

WhiteListBlackList

Murat Ali Bayir
Hi, I have problem when I am using black-white list url filtering. I have two directiory for filtering
called NegativeURLS and PositiveURLS

*****************************************************************************************
in NegativeURLS, I have
www.hurriyet.com.tr

in PostiveURLS, I have
www.milliyet.com.tr

*****************************************************************************************
In the input directory for Crawl operation, I have
www.hurriyet.com.tr
www.milliyet.com.tr

I run the following commands from shell.

$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/PositiveURLS/ -white

$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/NegativeURLS/ -black

Then I run inject,generate and Fetch, After that I run following

$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/trace/output/segments/20060522115951/

Finally I run GenericReader and I print the output, it contains the URLs that are in the blacklist,
what can be the problem?





Reply | Threaded
Open this post in threaded view
|

Re: WhiteListBlackList

Marko Bauhardt-2

Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:

> Hi, I have problem when I am using black-white list url filtering.  
> I have two directiory for filtering
> called NegativeURLS and PositiveURLS
>
> **********************************************************************
> *******************
> in NegativeURLS, I have
> www.hurriyet.com.tr
>
> in PostiveURLS, I have www.milliyet.com.tr
>
> **********************************************************************
> *******************
> In the input directory for Crawl operation, I have
> www.hurriyet.com.tr
> www.milliyet.com.tr
>
> I run the following commands from shell.
>
> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/
> PositiveURLS/ -white
>
> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/
> NegativeURLS/ -black
>
> Then I run inject,generate and Fetch, After that I run following
> $ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/
> trace/output/segments/20060522115951/
>
> Finally I run GenericReader and I print the output, it contains the  
> URLs that are in the blacklist,
> what can be the problem?

The Black/White List works only in the update process (BWUpdateDb),  
not by fetching or generating. Only the white Urls will be updated to  
the crawldb.

Are only www.hurriyet.com.tr in your crawldb or other html sites from  
this host? And what is the status of this urls (STATUS_DB_FETCHED or  
STATUS_DB_UNFETCHED  )?

Marko

Reply | Threaded
Open this post in threaded view
|

Re: WhiteListBlackList

Murat Ali Bayir
Marko Bauhardt wrote:

>
> Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:
>
>> Hi, I have problem when I am using black-white list url filtering.  I
>> have two directiory for filtering
>> called NegativeURLS and PositiveURLS
>>
>> **********************************************************************
>> *******************
>> in NegativeURLS, I have
>> www.hurriyet.com.tr
>>
>> in PostiveURLS, I have www.milliyet.com.tr
>>
>> **********************************************************************
>> *******************
>> In the input directory for Crawl operation, I have
>> www.hurriyet.com.tr
>> www.milliyet.com.tr
>>
>> I run the following commands from shell.
>>
>> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/
>> PositiveURLS/ -white
>>
>> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/
>> NegativeURLS/ -black
>>
>> Then I run inject,generate and Fetch, After that I run following
>> $ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/
>> trace/output/segments/20060522115951/
>>
>> Finally I run GenericReader and I print the output, it contains the  
>> URLs that are in the blacklist,
>> what can be the problem?
>
>
> The Black/White List works only in the update process (BWUpdateDb),  
> not by fetching or generating. Only the white Urls will be updated to  
> the crawldb.
>
> Are only www.hurriyet.com.tr in your crawldb or other html sites from  
> this host? And what is the status of this urls (STATUS_DB_FETCHED or  
> STATUS_DB_UNFETCHED  )?
>
> Marko
>
>
>
> The crawldb contains the following

http://hurriyet.com.tr/ Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null

http://milliyet.com.tr/ Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null


both of them is DB_unfetched.

PostiveURL is http://milliyet.com.tr
it is in ~/URL/PositiveURLS/Positive.txt

NegativeURL is http://hurriyet.com.tr
it is in ~/URL/NegativeURLS/Negative.txt

I run the following inject command

 ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/PositiveURLS/
-white
 ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/NegativeURLS/
-black

After fetch command with parsing option

I run the following

$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/
trace/output/segments/20060522115951/


Any suggestion for two DB_unfetched entry? I expect one them is fetched.

Reply | Threaded
Open this post in threaded view
|

Run-Time Error

Murat Ali Bayir
In reply to this post by Murat Ali Bayir
Hi everbody, I am running Nuth 0.8 under windows by using Eclipse
I got the following error.  I added conf directory to my classpath. I
changed
nuth-site.xml added regex-url filter there. What can be reason for the
following mistake?

java.lang.RuntimeException:
org.apache.nutch.net.URLFilter not found.
        at
org.apache.nutch.net.URLFilters.<init>(URLFilters.java:47)
        at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:55)
        at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
        at
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
        at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
        at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
Exception in thread "main" java.io.IOException: Job
failed!
        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
        at
org.apache.nutch.crawl.Injector.inject(Injector.java:130)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)




Reply | Threaded
Open this post in threaded view
|

Changing db data

Bogdan Kecman
Hi,
I'm writing a small utility to ammend the data in nutch database. I managed
to read the nutch database, also I can delete document from the database but
is there a way to change a value of the field in nutch db?

If you can just point me in right direction, spent lot of time reading
lucene and nutch api, I can create db from scratch and add data but cannot
change anything... Any ideas ?

10x in advance
Bogdan

Reply | Threaded
Open this post in threaded view
|

Re: Run-Time Error

Thomas Delnoij-3
In reply to this post by Murat Ali Bayir
Did you add the plugins directory to your classpath and does it
contain all of your plugins?

Rgrds, Thomas

On 5/23/06, Murat Ali Bayir <[hidden email]> wrote:

> Hi everbody, I am running Nuth 0.8 under windows by using Eclipse
> I got the following error.  I added conf directory to my classpath. I
> changed
> nuth-site.xml added regex-url filter there. What can be reason for the
> following mistake?
>
> java.lang.RuntimeException:
> org.apache.nutch.net.URLFilter not found.
>         at
> org.apache.nutch.net.URLFilters.<init>(URLFilters.java:47)
>         at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:55)
>         at
> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
>         at
> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>         at
> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
>         at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> Exception in thread "main" java.io.IOException: Job
> failed!
>         at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
>         at
> org.apache.nutch.crawl.Injector.inject(Injector.java:130)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Run-Time Error

Dennis Kubes
On the launcher under classpath you will need to add the directory above
plugins.  Make sure this is on the eclipse laucher though.  Setting it
on the project won't help

TDLN wrote:

> Did you add the plugins directory to your classpath and does it
> contain all of your plugins?
>
> Rgrds, Thomas
>
> On 5/23/06, Murat Ali Bayir <[hidden email]> wrote:
>> Hi everbody, I am running Nuth 0.8 under windows by using Eclipse
>> I got the following error.  I added conf directory to my classpath. I
>> changed
>> nuth-site.xml added regex-url filter there. What can be reason for the
>> following mistake?
>>
>> java.lang.RuntimeException:
>> org.apache.nutch.net.URLFilter not found.
>>         at
>> org.apache.nutch.net.URLFilters.<init>(URLFilters.java:47)
>>         at
>> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:55)
>>         at
>> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
>>         at
>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>>         at
>> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
>>         at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
>>         at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
>> Exception in thread "main" java.io.IOException: Job
>> failed!
>>         at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
>>         at
>> org.apache.nutch.crawl.Injector.inject(Injector.java:130)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
>>
>>
>>
>>
>>