Link db (traversal + modification)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Link db (traversal + modification)

Otis Gospodnetic-2-2
Hi,

What's the best way to travere the graph of all fetched pages and optionally modify it (e.g. remove a page because you know it's spam)?
I looked at various Nutch classes, and only LinksDbReader looks like it let's you iterate through all links (and for each link get its inlinks).  Is this right?

But how would one go about modifying the links db?
Perhaps I should be asking about where/how the links db is stored on disk, and whether one should just access and modify that data directly on disk?

Thanks,
Otis


Reply | Threaded
Open this post in threaded view
|

Re: Link db (traversal + modification)

Stefan Groschupf-2
Hi Otis,

the link graph live in the linkdb.
I suggest to write a small map reduce tool that reads the existing  
linkDb filter the pages you want to remove and write the result back  
to disk.
This will be just a couble lines of code.
The hadoop package comes with some nice map reduce examples.

Stefan


On 06.07.2006, at 22:47, <[hidden email]> wrote:

> Hi,
>
> What's the best way to travere the graph of all fetched pages and  
> optionally modify it (e.g. remove a page because you know it's spam)?
> I looked at various Nutch classes, and only LinksDbReader looks  
> like it let's you iterate through all links (and for each link get  
> its inlinks).  Is this right?
>
> But how would one go about modifying the links db?
> Perhaps I should be asking about where/how the links db is stored  
> on disk, and whether one should just access and modify that data  
> directly on disk?
>
> Thanks,
> Otis
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Link db (traversal + modification)

Otis Gospodnetic-2-2
Thanks Stefan.
So one has to iterate and re-write the whole graph, and there is no way to just modify it on the fly by, for example, removing specific links/pages?

Thanks,
Otis

----- Original Message ----
From: Stefan Groschupf <[hidden email]>
To: [hidden email]
Sent: Friday, July 7, 2006 1:52:24 AM
Subject: Re: [Nutch-general] Link db (traversal + modification)

Hi Otis,

the link graph live in the linkdb.
I suggest to write a small map reduce tool that reads the existing  
linkDb filter the pages you want to remove and write the result back  
to disk.
This will be just a couble lines of code.
The hadoop package comes with some nice map reduce examples.

Stefan


On 06.07.2006, at 22:47, <[hidden email]> wrote:

> Hi,
>
> What's the best way to travere the graph of all fetched pages and  
> optionally modify it (e.g. remove a page because you know it's spam)?
> I looked at various Nutch classes, and only LinksDbReader looks  
> like it let's you iterate through all links (and for each link get  
> its inlinks).  Is this right?
>
> But how would one go about modifying the links db?
> Perhaps I should be asking about where/how the links db is stored  
> on disk, and whether one should just access and modify that data  
> directly on disk?
>
> Thanks,
> Otis
>
>
>


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/nutch-general



Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Link db (traversal + modification)

Honda-Search Administrator
Otis,

Check out the purge tool (bin/nutch purge)

It's easy to remove URLS individually or based on regular expressions, but
you'll need to learn lucrene syntax to do it.

It will remove certain pages from the index, but won't exclude them from
being recrawled the next time around.  For that you'll need to change your
filters in your conf directory.

----- Original Message -----
From: <[hidden email]>
To: <[hidden email]>
Sent: Friday, July 07, 2006 9:50 AM
Subject: Re: [Nutch-general] Link db (traversal + modification)


> Thanks Stefan.
> So one has to iterate and re-write the whole graph, and there is no way to
> just modify it on the fly by, for example, removing specific links/pages?
>
> Thanks,
> Otis
>
> ----- Original Message ----
> From: Stefan Groschupf <[hidden email]>
> To: [hidden email]
> Sent: Friday, July 7, 2006 1:52:24 AM
> Subject: Re: [Nutch-general] Link db (traversal + modification)
>
> Hi Otis,
>
> the link graph live in the linkdb.
> I suggest to write a small map reduce tool that reads the existing
> linkDb filter the pages you want to remove and write the result back
> to disk.
> This will be just a couble lines of code.
> The hadoop package comes with some nice map reduce examples.
>
> Stefan
>
>
> On 06.07.2006, at 22:47, <[hidden email]> wrote:
>
>> Hi,
>>
>> What's the best way to travere the graph of all fetched pages and
>> optionally modify it (e.g. remove a page because you know it's spam)?
>> I looked at various Nutch classes, and only LinksDbReader looks
>> like it let's you iterate through all links (and for each link get
>> its inlinks).  Is this right?
>>
>> But how would one go about modifying the links db?
>> Perhaps I should be asking about where/how the links db is stored
>> on disk, and whether one should just access and modify that data
>> directly on disk?
>>
>> Thanks,
>> Otis
>>
>>
>>
>
>
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Nutch-general mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Link db (traversal + modification)

Stefan Groschupf-2
In reply to this post by Otis Gospodnetic-2-2
Hi,

the hadoop io system is read only, so you can not update a file.
However if I'm sure you can hack the link db creation code and add  
the url filter that is already used for the crawldb.
May be this is already in the code, if not it would be good since it  
minimize spam links to take effect in the ranking.

Stefan

On 07.07.2006, at 09:50, <[hidden email]> wrote:

> Thanks Stefan.
> So one has to iterate and re-write the whole graph, and there is no  
> way to just modify it on the fly by, for example, removing specific  
> links/pages?
>
> Thanks,
> Otis
>
> ----- Original Message ----
> From: Stefan Groschupf <[hidden email]>
> To: [hidden email]
> Sent: Friday, July 7, 2006 1:52:24 AM
> Subject: Re: [Nutch-general] Link db (traversal + modification)
>
> Hi Otis,
>
> the link graph live in the linkdb.
> I suggest to write a small map reduce tool that reads the existing
> linkDb filter the pages you want to remove and write the result back
> to disk.
> This will be just a couble lines of code.
> The hadoop package comes with some nice map reduce examples.
>
> Stefan
>
>
> On 06.07.2006, at 22:47, <[hidden email]> wrote:
>
>> Hi,
>>
>> What's the best way to travere the graph of all fetched pages and
>> optionally modify it (e.g. remove a page because you know it's spam)?
>> I looked at various Nutch classes, and only LinksDbReader looks
>> like it let's you iterate through all links (and for each link get
>> its inlinks).  Is this right?
>>
>> But how would one go about modifying the links db?
>> Perhaps I should be asking about where/how the links db is stored
>> on disk, and whether one should just access and modify that data
>> directly on disk?
>>
>> Thanks,
>> Otis
>>
>>
>>
>
>
> Using Tomcat but need to do more? Need to support web services,  
> security?
> Get stuff done quickly with pre-integrated technology to make your  
> job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache  
> Geronimo
> http://sel.as-us.falkag.net/sel?
> cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Nutch-general mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Link db (traversal + modification)

kevin pang
In reply to this post by Honda-Search Administrator
Hi,

I ran nutch using this command:
$ ./nutch crawl urlfq2.txt -dir fqjoke2 -depth 20 -threads 10 >& fq2.log
during the crawling ,the following exception occured:

060708 182413 status: segment 20060708181314, 471 pages, 69 errors,
5655871 bytes, 657469 ms
060708 182413 status: 0.7163836 pages/s, 67.20696 kb/s, 12008.219 bytes/page
060708 182414 Updating D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db
Exception in thread "main" java.io.IOException: Impossible condition:
directories D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db\webdb.old and
D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db\webdb cannot exist simultaneously
    at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1484)
    at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1457)
    at
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:360)
    at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)



why this happened ? any solution available? many thanks!
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Link db (traversal + modification)

Honda-Search Administrator
Looks like you cancelled a previous crawl in the middle of it, or something
else did.

delete the D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db\webdb.old directory and
recrawl.  You should be fine.


----- Original Message -----
From: "kevin" <[hidden email]>
To: <[hidden email]>
Sent: Saturday, July 08, 2006 3:34 AM
Subject: Re: [Nutch-general] Link db (traversal + modification)


> Hi,
>
> I ran nutch using this command:
> $ ./nutch crawl urlfq2.txt -dir fqjoke2 -depth 20 -threads 10 >& fq2.log
> during the crawling ,the following exception occured:
>
> 060708 182413 status: segment 20060708181314, 471 pages, 69 errors,
> 5655871 bytes, 657469 ms
> 060708 182413 status: 0.7163836 pages/s, 67.20696 kb/s, 12008.219
> bytes/page
> 060708 182414 Updating D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db
> Exception in thread "main" java.io.IOException: Impossible condition:
> directories D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db\webdb.old and
> D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db\webdb cannot exist
> simultaneously
>    at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1484)
>    at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1457)
>    at
> org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:360)
>    at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
>
>
>
> why this happened ? any solution available? many thanks!
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Link db (traversal + modification)

kevin pang
Thanks!
But if I just delete the webdb.old directory and recrawl, the following
exception will happen:
run java in J:\Java\jdk1.5.0_06
060709 094218 parsing
file:/D:/cygwin/home/nutch-0.7.2/conf/nutch-default.xml
060709 094218 parsing file:/D:/cygwin/home/nutch-0.7.2/conf/crawl-tool.xml
060709 094218 parsing file:/D:/cygwin/home/nutch-0.7.2/conf/nutch-site.xml
060709 094218 No FS indicated, using default:local
Exception in thread "main" java.lang.RuntimeException: fqjoke2 already
exists.
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:121)


It's very strange! All the crawl jobs I can do before now can't work,
always show the above exception.

Regards!


Honda-Search Administrator 写道:

> Looks like you cancelled a previous crawl in the middle of it, or
> something else did.
>
> delete the D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db\webdb.old
> directory and recrawl. You should be fine.
>
>
> ----- Original Message ----- From: "kevin" <[hidden email]>
> To: <[hidden email]>
> Sent: Saturday, July 08, 2006 3:34 AM
> Subject: Re: [Nutch-general] Link db (traversal + modification)
>
>
>> Hi,
>>
>> I ran nutch using this command:
>> $ ./nutch crawl urlfq2.txt -dir fqjoke2 -depth 20 -threads 10 >& fq2.log
>> during the crawling ,the following exception occured:
>>
>> 060708 182413 status: segment 20060708181314, 471 pages, 69 errors,
>> 5655871 bytes, 657469 ms
>> 060708 182413 status: 0.7163836 pages/s, 67.20696 kb/s, 12008.219
>> bytes/page
>> 060708 182414 Updating D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db
>> Exception in thread "main" java.io.IOException: Impossible condition:
>> directories D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db\webdb.old and
>> D:\cygwin\home\nutch-0.7.2\bin\fqjoke2\db\webdb cannot exist
>> simultaneously
>> at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1484)
>> at org.apache.nutch.db.WebDBWriter.<init>(WebDBWriter.java:1457)
>> at
>> org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:360)
>>
>> at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
>>
>>
>>
>> why this happened ? any solution available? many thanks!
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Link db (traversal + modification)

kevin pang
In reply to this post by Otis Gospodnetic-2-2
Hi,

Can nutch run in two processes(instance ?) with two different
crawl-urlfilter files at the same time ? I want to do the crewl jobs
at the same time on my pc. if yes then how to?

Thanks!