Adding Level to Website Parse Data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding Level to Website Parse Data

Dennis Kubes
I am trying to modify Nutch to add level to the website parse data.  
What I mean by this is suppose you start parsing a website at its
homepage that would be level one.  Any links in the same site from the
homepage would be level two, links from those pages would be level three
and so on.  I am only counting links in the same site.

How would I go about modifying Nutch to handle this?  I was thinking
that I would have to modify Fetcher to do this, adding the level to the
parse metadata.  What I am not gettings is how would I get the link
level initially?  I was thinking I would have to modify something in the
generator but didn't know what.

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: Adding Level to Website Parse Data

sudhendra seshachala
Dennis,
  I am in the same dilemma as you are.
  Here are my thoughts.
   
  1. I am planning to write the Plugin to do it where in the plugin can be modified based on the site map and levels
  2. The Fetcher itself can be modified. But again code merging with latest contributons fixes and enhancement from community will be very hard.
  3. Other way is to write a prefetcher which will fetch all the urls from a site, populate the file. Then the Nutch Crawler can be triggered to crawl the prefetched urls. Within the prefetched url pages, any unnecessary URLs not to be crawled, will have to be ignored. I am still trying a way to do this.
   
  Please share your thoughts..
  Thanks
   
 

Dennis Kubes <[hidden email]> wrote:
  I am trying to modify Nutch to add level to the website parse data.
What I mean by this is suppose you start parsing a website at its
homepage that would be level one. Any links in the same site from the
homepage would be level two, links from those pages would be level three
and so on. I am only counting links in the same site.

How would I go about modifying Nutch to handle this? I was thinking
that I would have to modify Fetcher to do this, adding the level to the
parse metadata. What I am not gettings is how would I get the link
level initially? I was thinking I would have to modify something in the
generator but didn't know what.

Dennis



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


               
---------------------------------
Blab-away for as little as 1ยข/min. Make  PC-to-Phone Calls using Yahoo! Messenger with Voice.