Looking to count links with Nutch

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Looking to count links with Nutch

Kevin MacDonald-3
Hello,

I have what I hope is a very simple task that I would like to accomplish
using Nutch. Given an Url I need to produce a list of links that come off
that page. Once I have that list I will be counting the number of links that
remain on that domain and the number of links that lead off the domain. At
the moment I don't need to crawl those links, just enumerate them.

I'm assuming I can do this by writing a simple plugin. Could anyone give me
some hints as to which type of plugin I need to write? I would also like to
configure Nutch so that only this one operation is performed as quickly as
possible, i.e. I would like to switch off as many other types of processing
and parsing as possible.

Any suggestions are much appreciated.

Thanks

Kevin
Reply | Threaded
Open this post in threaded view
|

Re: Looking to count links with Nutch

kevin chen-6
A very simple way to do this is to dump the segments and use a shell to
grep outlinks.

On Fri, 2008-09-05 at 16:07 -0700, Kevin MacDonald wrote:

> Hello,
>
> I have what I hope is a very simple task that I would like to accomplish
> using Nutch. Given an Url I need to produce a list of links that come off
> that page. Once I have that list I will be counting the number of links that
> remain on that domain and the number of links that lead off the domain. At
> the moment I don't need to crawl those links, just enumerate them.
>
> I'm assuming I can do this by writing a simple plugin. Could anyone give me
> some hints as to which type of plugin I need to write? I would also like to
> configure Nutch so that only this one operation is performed as quickly as
> possible, i.e. I would like to switch off as many other types of processing
> and parsing as possible.
>
> Any suggestions are much appreciated.
>
> Thanks
>
> Kevin

Reply | Threaded
Open this post in threaded view
|

Re: Looking to count links with Nutch

Kevin MacDonald-3
Could you elaborate on what it means to dump the segments? I'm new to Nutch.


On Sat, Sep 6, 2008 at 8:19 AM, kevin chen <[hidden email]> wrote:

> A very simple way to do this is to dump the segments and use a shell to
> grep outlinks.
>
> On Fri, 2008-09-05 at 16:07 -0700, Kevin MacDonald wrote:
> > Hello,
> >
> > I have what I hope is a very simple task that I would like to accomplish
> > using Nutch. Given an Url I need to produce a list of links that come off
> > that page. Once I have that list I will be counting the number of links
> that
> > remain on that domain and the number of links that lead off the domain.
> At
> > the moment I don't need to crawl those links, just enumerate them.
> >
> > I'm assuming I can do this by writing a simple plugin. Could anyone give
> me
> > some hints as to which type of plugin I need to write? I would also like
> to
> > configure Nutch so that only this one operation is performed as quickly
> as
> > possible, i.e. I would like to switch off as many other types of
> processing
> > and parsing as possible.
> >
> > Any suggestions are much appreciated.
> >
> > Thanks
> >
> > Kevin
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Looking to count links with Nutch

Dennis Kubes-2
Take a look at NUTCH-635:

https://issues.apache.org/jira/browse/NUTCH-635

I don't think the webgraph does exactly what you are looking for,
although it the NodeDb does have a count for inlinks and outlinks but it
ignores internal links.  But it will give you an idea of how to pull out
the outlinks from segments and do the counts you want.

Dennis

Kevin MacDonald wrote:

> Could you elaborate on what it means to dump the segments? I'm new to Nutch.
>
>
> On Sat, Sep 6, 2008 at 8:19 AM, kevin chen <[hidden email]> wrote:
>
>> A very simple way to do this is to dump the segments and use a shell to
>> grep outlinks.
>>
>> On Fri, 2008-09-05 at 16:07 -0700, Kevin MacDonald wrote:
>>> Hello,
>>>
>>> I have what I hope is a very simple task that I would like to accomplish
>>> using Nutch. Given an Url I need to produce a list of links that come off
>>> that page. Once I have that list I will be counting the number of links
>> that
>>> remain on that domain and the number of links that lead off the domain.
>> At
>>> the moment I don't need to crawl those links, just enumerate them.
>>>
>>> I'm assuming I can do this by writing a simple plugin. Could anyone give
>> me
>>> some hints as to which type of plugin I need to write? I would also like
>> to
>>> configure Nutch so that only this one operation is performed as quickly
>> as
>>> possible, i.e. I would like to switch off as many other types of
>> processing
>>> and parsing as possible.
>>>
>>> Any suggestions are much appreciated.
>>>
>>> Thanks
>>>
>>> Kevin
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Looking to count links with Nutch

Kevin MacDonald-3
Pardon my ignorance. Is NUTCH-635 an add-on or plugin to Nutch of some kind?
I see a note that it is a link analysis tool. Do I need to download it from
somewhere or are these existing command line tools that come with Nutch 0.9?
Perhaps there is some core documentation that I have missed. I followed the
tutorial to install and run Nutch and have successfully completed that, and
I've done some reading on plugins and extension points, but that's as far as
I have gotten.

Thanks again for the help.

Kevin

On Sat, Sep 6, 2008 at 7:13 PM, Dennis Kubes <[hidden email]> wrote:

> Take a look at NUTCH-635:
>
> https://issues.apache.org/jira/browse/NUTCH-635
>
> I don't think the webgraph does exactly what you are looking for, although
> it the NodeDb does have a count for inlinks and outlinks but it ignores
> internal links.  But it will give you an idea of how to pull out the
> outlinks from segments and do the counts you want.
>
> Dennis
>
>
> Kevin MacDonald wrote:
>
>> Could you elaborate on what it means to dump the segments? I'm new to
>> Nutch.
>>
>>
>> On Sat, Sep 6, 2008 at 8:19 AM, kevin chen <[hidden email]> wrote:
>>
>>  A very simple way to do this is to dump the segments and use a shell to
>>> grep outlinks.
>>>
>>> On Fri, 2008-09-05 at 16:07 -0700, Kevin MacDonald wrote:
>>>
>>>> Hello,
>>>>
>>>> I have what I hope is a very simple task that I would like to accomplish
>>>> using Nutch. Given an Url I need to produce a list of links that come
>>>> off
>>>> that page. Once I have that list I will be counting the number of links
>>>>
>>> that
>>>
>>>> remain on that domain and the number of links that lead off the domain.
>>>>
>>> At
>>>
>>>> the moment I don't need to crawl those links, just enumerate them.
>>>>
>>>> I'm assuming I can do this by writing a simple plugin. Could anyone give
>>>>
>>> me
>>>
>>>> some hints as to which type of plugin I need to write? I would also like
>>>>
>>> to
>>>
>>>> configure Nutch so that only this one operation is performed as quickly
>>>>
>>> as
>>>
>>>> possible, i.e. I would like to switch off as many other types of
>>>>
>>> processing
>>>
>>>> and parsing as possible.
>>>>
>>>> Any suggestions are much appreciated.
>>>>
>>>> Thanks
>>>>
>>>> Kevin
>>>>
>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Looking to count links with Nutch

Dennis Kubes-2
Ah sorry.  It is a patch in JIRA, the bug tracking tool that we use for
Nutch.  You would need to download the patch and apply it to a local
version of the Nutch Trunk, the most recent source code from subversion.
  That patch contains a number of tools including a WebGraph tool.  The
package it should create is org.apache.nutch.scoring.webgraph.  The
WebGraph tool is under that package.  Let me know if this doesn't make
sense.

Dennis

Kevin MacDonald wrote:

> Pardon my ignorance. Is NUTCH-635 an add-on or plugin to Nutch of some kind?
> I see a note that it is a link analysis tool. Do I need to download it from
> somewhere or are these existing command line tools that come with Nutch 0.9?
> Perhaps there is some core documentation that I have missed. I followed the
> tutorial to install and run Nutch and have successfully completed that, and
> I've done some reading on plugins and extension points, but that's as far as
> I have gotten.
>
> Thanks again for the help.
>
> Kevin
>
> On Sat, Sep 6, 2008 at 7:13 PM, Dennis Kubes <[hidden email]> wrote:
>
>> Take a look at NUTCH-635:
>>
>> https://issues.apache.org/jira/browse/NUTCH-635
>>
>> I don't think the webgraph does exactly what you are looking for, although
>> it the NodeDb does have a count for inlinks and outlinks but it ignores
>> internal links.  But it will give you an idea of how to pull out the
>> outlinks from segments and do the counts you want.
>>
>> Dennis
>>
>>
>> Kevin MacDonald wrote:
>>
>>> Could you elaborate on what it means to dump the segments? I'm new to
>>> Nutch.
>>>
>>>
>>> On Sat, Sep 6, 2008 at 8:19 AM, kevin chen <[hidden email]> wrote:
>>>
>>>  A very simple way to do this is to dump the segments and use a shell to
>>>> grep outlinks.
>>>>
>>>> On Fri, 2008-09-05 at 16:07 -0700, Kevin MacDonald wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have what I hope is a very simple task that I would like to accomplish
>>>>> using Nutch. Given an Url I need to produce a list of links that come
>>>>> off
>>>>> that page. Once I have that list I will be counting the number of links
>>>>>
>>>> that
>>>>
>>>>> remain on that domain and the number of links that lead off the domain.
>>>>>
>>>> At
>>>>
>>>>> the moment I don't need to crawl those links, just enumerate them.
>>>>>
>>>>> I'm assuming I can do this by writing a simple plugin. Could anyone give
>>>>>
>>>> me
>>>>
>>>>> some hints as to which type of plugin I need to write? I would also like
>>>>>
>>>> to
>>>>
>>>>> configure Nutch so that only this one operation is performed as quickly
>>>>>
>>>> as
>>>>
>>>>> possible, i.e. I would like to switch off as many other types of
>>>>>
>>>> processing
>>>>
>>>>> and parsing as possible.
>>>>>
>>>>> Any suggestions are much appreciated.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Kevin
>>>>>
>>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Looking to count links with Nutch

Kevin MacDonald-3
I see that there are numerous patches under NUTCH-635. Would I apply only
the latest one, or all of them in order?

Kevin

On Sat, Sep 6, 2008 at 9:43 PM, Dennis Kubes <[hidden email]> wrote:

> Ah sorry.  It is a patch in JIRA, the bug tracking tool that we use for
> Nutch.  You would need to download the patch and apply it to a local version
> of the Nutch Trunk, the most recent source code from subversion.  That patch
> contains a number of tools including a WebGraph tool.  The package it should
> create is org.apache.nutch.scoring.webgraph.  The WebGraph tool is under
> that package.  Let me know if this doesn't make sense.
>
> Dennis
>
>
> Kevin MacDonald wrote:
>
>> Pardon my ignorance. Is NUTCH-635 an add-on or plugin to Nutch of some
>> kind?
>> I see a note that it is a link analysis tool. Do I need to download it
>> from
>> somewhere or are these existing command line tools that come with Nutch
>> 0.9?
>> Perhaps there is some core documentation that I have missed. I followed
>> the
>> tutorial to install and run Nutch and have successfully completed that,
>> and
>> I've done some reading on plugins and extension points, but that's as far
>> as
>> I have gotten.
>>
>> Thanks again for the help.
>>
>> Kevin
>>
>> On Sat, Sep 6, 2008 at 7:13 PM, Dennis Kubes <[hidden email]> wrote:
>>
>>  Take a look at NUTCH-635:
>>>
>>> https://issues.apache.org/jira/browse/NUTCH-635
>>>
>>> I don't think the webgraph does exactly what you are looking for,
>>> although
>>> it the NodeDb does have a count for inlinks and outlinks but it ignores
>>> internal links.  But it will give you an idea of how to pull out the
>>> outlinks from segments and do the counts you want.
>>>
>>> Dennis
>>>
>>>
>>> Kevin MacDonald wrote:
>>>
>>>  Could you elaborate on what it means to dump the segments? I'm new to
>>>> Nutch.
>>>>
>>>>
>>>> On Sat, Sep 6, 2008 at 8:19 AM, kevin chen <[hidden email]>
>>>> wrote:
>>>>
>>>>  A very simple way to do this is to dump the segments and use a shell to
>>>>
>>>>> grep outlinks.
>>>>>
>>>>> On Fri, 2008-09-05 at 16:07 -0700, Kevin MacDonald wrote:
>>>>>
>>>>>  Hello,
>>>>>>
>>>>>> I have what I hope is a very simple task that I would like to
>>>>>> accomplish
>>>>>> using Nutch. Given an Url I need to produce a list of links that come
>>>>>> off
>>>>>> that page. Once I have that list I will be counting the number of
>>>>>> links
>>>>>>
>>>>>>  that
>>>>>
>>>>>  remain on that domain and the number of links that lead off the
>>>>>> domain.
>>>>>>
>>>>>>  At
>>>>>
>>>>>  the moment I don't need to crawl those links, just enumerate them.
>>>>>>
>>>>>> I'm assuming I can do this by writing a simple plugin. Could anyone
>>>>>> give
>>>>>>
>>>>>>  me
>>>>>
>>>>>  some hints as to which type of plugin I need to write? I would also
>>>>>> like
>>>>>>
>>>>>>  to
>>>>>
>>>>>  configure Nutch so that only this one operation is performed as
>>>>>> quickly
>>>>>>
>>>>>>  as
>>>>>
>>>>>  possible, i.e. I would like to switch off as many other types of
>>>>>>
>>>>>>  processing
>>>>>
>>>>>  and parsing as possible.
>>>>>>
>>>>>> Any suggestions are much appreciated.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>>
>>>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Looking to count links with Nutch

Dennis Kubes-2
Just the last one, I belive it is 8.

Dennis

Kevin MacDonald wrote:

> I see that there are numerous patches under NUTCH-635. Would I apply only
> the latest one, or all of them in order?
>
> Kevin
>
> On Sat, Sep 6, 2008 at 9:43 PM, Dennis Kubes <[hidden email]> wrote:
>
>> Ah sorry.  It is a patch in JIRA, the bug tracking tool that we use for
>> Nutch.  You would need to download the patch and apply it to a local version
>> of the Nutch Trunk, the most recent source code from subversion.  That patch
>> contains a number of tools including a WebGraph tool.  The package it should
>> create is org.apache.nutch.scoring.webgraph.  The WebGraph tool is under
>> that package.  Let me know if this doesn't make sense.
>>
>> Dennis
>>
>>
>> Kevin MacDonald wrote:
>>
>>> Pardon my ignorance. Is NUTCH-635 an add-on or plugin to Nutch of some
>>> kind?
>>> I see a note that it is a link analysis tool. Do I need to download it
>>> from
>>> somewhere or are these existing command line tools that come with Nutch
>>> 0.9?
>>> Perhaps there is some core documentation that I have missed. I followed
>>> the
>>> tutorial to install and run Nutch and have successfully completed that,
>>> and
>>> I've done some reading on plugins and extension points, but that's as far
>>> as
>>> I have gotten.
>>>
>>> Thanks again for the help.
>>>
>>> Kevin
>>>
>>> On Sat, Sep 6, 2008 at 7:13 PM, Dennis Kubes <[hidden email]> wrote:
>>>
>>>  Take a look at NUTCH-635:
>>>> https://issues.apache.org/jira/browse/NUTCH-635
>>>>
>>>> I don't think the webgraph does exactly what you are looking for,
>>>> although
>>>> it the NodeDb does have a count for inlinks and outlinks but it ignores
>>>> internal links.  But it will give you an idea of how to pull out the
>>>> outlinks from segments and do the counts you want.
>>>>
>>>> Dennis
>>>>
>>>>
>>>> Kevin MacDonald wrote:
>>>>
>>>>  Could you elaborate on what it means to dump the segments? I'm new to
>>>>> Nutch.
>>>>>
>>>>>
>>>>> On Sat, Sep 6, 2008 at 8:19 AM, kevin chen <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>  A very simple way to do this is to dump the segments and use a shell to
>>>>>
>>>>>> grep outlinks.
>>>>>>
>>>>>> On Fri, 2008-09-05 at 16:07 -0700, Kevin MacDonald wrote:
>>>>>>
>>>>>>  Hello,
>>>>>>> I have what I hope is a very simple task that I would like to
>>>>>>> accomplish
>>>>>>> using Nutch. Given an Url I need to produce a list of links that come
>>>>>>> off
>>>>>>> that page. Once I have that list I will be counting the number of
>>>>>>> links
>>>>>>>
>>>>>>>  that
>>>>>>  remain on that domain and the number of links that lead off the
>>>>>>> domain.
>>>>>>>
>>>>>>>  At
>>>>>>  the moment I don't need to crawl those links, just enumerate them.
>>>>>>> I'm assuming I can do this by writing a simple plugin. Could anyone
>>>>>>> give
>>>>>>>
>>>>>>>  me
>>>>>>  some hints as to which type of plugin I need to write? I would also
>>>>>>> like
>>>>>>>
>>>>>>>  to
>>>>>>  configure Nutch so that only this one operation is performed as
>>>>>>> quickly
>>>>>>>
>>>>>>>  as
>>>>>>  possible, i.e. I would like to switch off as many other types of
>>>>>>>  processing
>>>>>>  and parsing as possible.
>>>>>>> Any suggestions are much appreciated.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Kevin
>>>>>>>
>>>>>>>
>