Maintaining source url data (father) during runtime

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Maintaining source url data (father) during runtime

Tranquil
Hello nutch developers,

Is there a way to know at runtime, where did the current url being fetched
was originated from?
let me give an example.

let's say that the site: 'www.site.com' was in the initial url.txt file that
nutch started to crawl from.
the 1st page to fetch is 'www.site.com/index.htm'. in that page there are
links to:

   - www.site.com/news/top.exe
   - www.site.com/help.html
   - www.page.com/home/about.html ( inside www.page.com/home/about.html,
   there is a link to www.cnn.com/news.htm )

I want to know if there is a way to know during fetching that each one of
these files originated from www.site.com/index.htm.

*[ 1 e.g. www.site.com/index.htm-->www.page.com/home/about.html ]
**[ 2 e.g. www.site.com/index.htm-->www.cnn.com/news.html ]

*I know that nutch can build the linksdb database which contains InLink
data, but this is done only after nutch crawl is finished.
Is there a way to know this information duting the fetching process ?

thanks,


--
Eyal Edri
Reply | Threaded
Open this post in threaded view
|

Re: Maintaining source url data (father) during runtime

Tranquil
Hi,

i will try addressing the issue from a different approach.
where in the code, can i add information to each url about it's source
father?

I mean, when url is being parsed, and all it's outlinks are registered,
is there a way to add a new field to the segment which will say to each
outlink who was his father?

i hope that it make sense.

Eyal.



On Nov 25, 2007 1:34 PM, eyal edri <[hidden email]> wrote:

> Hello nutch developers,
>
> Is there a way to know at runtime, where did the current url being fetched
> was originated from?
> let me give an example.
>
> let's say that the site: ' www.site.com' was in the initial url.txt file
> that nutch started to crawl from.
> the 1st page to fetch is 'www.site.com/index.htm'. in that page there are
> links to:
>
>    - www.site.com/news/top.exe
>    - www.site.com/help.html
>    - www.page.com/home/about.html ( inside www.page.com/home/about.html,
>    there is a link to www.cnn.com/news.htm )
>
> I want to know if there is a way to know during fetching that each one of
> these files originated from www.site.com/index.htm.
>
> *[ 1 e.g. www.site.com/index.htm--> www.page.com/home/about.html ]
> **[ 2 e.g. www.site.com/index.htm-->www.cnn.com/news.html ]
>
> *I know that nutch can build the linksdb database which contains InLink
> data, but this is done only after nutch crawl is finished.
> Is there a way to know this information duting the fetching process ?
>
> thanks,
>
>
> --
> Eyal Edri




--
Eyal Edri
Reply | Threaded
Open this post in threaded view
|

Re: Maintaining source url data (father) during runtime

jian chen
Hi, Eyal,

I think you could change the html parsing part and record the parent
url into the meta data in the parse data. I am not that familiar with
the latest Nutch code, so, a guess at best.

Cheers,

Jian

On Nov 26, 2007 1:48 AM, eyal edri <[hidden email]> wrote:

> Hi,
>
> i will try addressing the issue from a different approach.
> where in the code, can i add information to each url about it's source
> father?
>
> I mean, when url is being parsed, and all it's outlinks are registered,
> is there a way to add a new field to the segment which will say to each
> outlink who was his father?
>
> i hope that it make sense.
>
> Eyal.
>
>
>
>
> On Nov 25, 2007 1:34 PM, eyal edri <[hidden email]> wrote:
>
> > Hello nutch developers,
> >
> > Is there a way to know at runtime, where did the current url being fetched
> > was originated from?
> > let me give an example.
> >
> > let's say that the site: ' www.site.com' was in the initial url.txt file
> > that nutch started to crawl from.
> > the 1st page to fetch is 'www.site.com/index.htm'. in that page there are
> > links to:
> >
> >    - www.site.com/news/top.exe
> >    - www.site.com/help.html
> >    - www.page.com/home/about.html ( inside www.page.com/home/about.html,
> >    there is a link to www.cnn.com/news.htm )
> >
> > I want to know if there is a way to know during fetching that each one of
> > these files originated from www.site.com/index.htm.
> >
> > *[ 1 e.g. www.site.com/index.htm--> www.page.com/home/about.html ]
> > **[ 2 e.g. www.site.com/index.htm-->www.cnn.com/news.html ]
> >
> > *I know that nutch can build the linksdb database which contains InLink
> > data, but this is done only after nutch crawl is finished.
> > Is there a way to know this information duting the fetching process ?
> >
> > thanks,
> >
> >
> > --
> > Eyal Edri
>
>
>
>
> --
> Eyal Edri
>
Reply | Threaded
Open this post in threaded view
|

Re: Maintaining source url data (father) during runtime

Dennis Kubes-2
In reply to this post by Tranquil
The best place to put this is in parsedata.metadata.  You would need to
parse out the data yourself.  I believe an html fitler would work best
for this.

Dennis

eyal edri wrote:

> Hi,
>
> i will try addressing the issue from a different approach.
> where in the code, can i add information to each url about it's source
> father?
>
> I mean, when url is being parsed, and all it's outlinks are registered,
> is there a way to add a new field to the segment which will say to each
> outlink who was his father?
>
> i hope that it make sense.
>
> Eyal.
>
>
>
> On Nov 25, 2007 1:34 PM, eyal edri <[hidden email]> wrote:
>
>> Hello nutch developers,
>>
>> Is there a way to know at runtime, where did the current url being fetched
>> was originated from?
>> let me give an example.
>>
>> let's say that the site: ' www.site.com' was in the initial url.txt file
>> that nutch started to crawl from.
>> the 1st page to fetch is 'www.site.com/index.htm'. in that page there are
>> links to:
>>
>>    - www.site.com/news/top.exe
>>    - www.site.com/help.html
>>    - www.page.com/home/about.html ( inside www.page.com/home/about.html,
>>    there is a link to www.cnn.com/news.htm )
>>
>> I want to know if there is a way to know during fetching that each one of
>> these files originated from www.site.com/index.htm.
>>
>> *[ 1 e.g. www.site.com/index.htm--> www.page.com/home/about.html ]
>> **[ 2 e.g. www.site.com/index.htm-->www.cnn.com/news.html ]
>>
>> *I know that nutch can build the linksdb database which contains InLink
>> data, but this is done only after nutch crawl is finished.
>> Is there a way to know this information duting the fetching process ?
>>
>> thanks,
>>
>>
>> --
>> Eyal Edri
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Maintaining source url data (father) during runtime

Tranquil
Hey,

thanks for the info, can you help me with the line of code that writes to
the metadata and where (which java class) should i inject this code?

thanks

On Nov 26, 2007 8:20 PM, Dennis Kubes <[hidden email]> wrote:

> The best place to put this is in parsedata.metadata.  You would need to
> parse out the data yourself.  I believe an html fitler would work best
> for this.
>
> Dennis
>
> eyal edri wrote:
> > Hi,
> >
> > i will try addressing the issue from a different approach.
> > where in the code, can i add information to each url about it's source
> > father?
> >
> > I mean, when url is being parsed, and all it's outlinks are registered,
> > is there a way to add a new field to the segment which will say to each
> > outlink who was his father?
> >
> > i hope that it make sense.
> >
> > Eyal.
> >
> >
> >
> > On Nov 25, 2007 1:34 PM, eyal edri <[hidden email]> wrote:
> >
> >> Hello nutch developers,
> >>
> >> Is there a way to know at runtime, where did the current url being
> fetched
> >> was originated from?
> >> let me give an example.
> >>
> >> let's say that the site: ' www.site.com' was in the initial url.txtfile
> >> that nutch started to crawl from.
> >> the 1st page to fetch is 'www.site.com/index.htm'. in that page there
> are
> >> links to:
> >>
> >>    - www.site.com/news/top.exe
> >>    - www.site.com/help.html
> >>    - www.page.com/home/about.html ( inside www.page.com/home/about.html
> ,
> >>    there is a link to www.cnn.com/news.htm )
> >>
> >> I want to know if there is a way to know during fetching that each one
> of
> >> these files originated from www.site.com/index.htm.
> >>
> >> *[ 1 e.g. www.site.com/index.htm--> www.page.com/home/about.html ]
> >> **[ 2 e.g. www.site.com/index.htm-->www.cnn.com/news.html ]
> >>
> >> *I know that nutch can build the linksdb database which contains InLink
> >> data, but this is done only after nutch crawl is finished.
> >> Is there a way to know this information duting the fetching process ?
> >>
> >> thanks,
> >>
> >>
> >> --
> >> Eyal Edri
> >
> >
> >
> >
>



--
Eyal Edri