[Fwd: Re: get CrawlDatum]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Fwd: Re: get CrawlDatum]

Uroš Gruber-2
A while ago I posted this on dev list but without reply. I wonder if
this is right approach and If I continue to create this feature?
Do you think this idea would help nutch or maybe this is dead end and
you've already talked about this.

regards

Uros

Andrzej Bialecki wrote:

> Uroš Gruber wrote:
>> ParseData.metadata sounds nice, but I think I'm lost again :)
>> If I understand code flow the best place would be in Fetcher [262]
>>
>> but i'm not sure that datum holds info of url being fetched
>
> On the input to the fetcher you get a URL and a CrawlDatum (originally
> coming from the crawldb). Check for example how the segment name is
> passed around in metadata, you can use the same method.
>
Hi,

I made some draft patch. But there is still some problems I see. I know
code needs to be cleaned and test. But right now I don't know what
number set to external urls. For internal linking works great.

What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop
value is incremented by 1. (still no idea what to do with external
link). Then I can add config value max_hop etc. to limit fetcher and
generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.



Reply | Threaded
Open this post in threaded view
|

Re: [Fwd: Re: get CrawlDatum]

Andrzej Białecki-2
Uroš Gruber wrote:
> I made some draft patch. But there is still some problems I see. I
> know code needs to be cleaned and test. But right now I don't know
> what number set to external urls. For internal linking works great.

(the patch changes CrawlDatum itself, I think it would be better to put
the hop counter in CrawlDatum.metaData.)

>
> What is the whole idea of this changes.
>
> Injected urls always get hop 0. While fetching/updating/generating hop
> value is incremented by 1. (still no idea what to do with external
> link). Then I can add config value max_hop etc. to limit fetcher and
> generator to create more urls.
>
> This way it's possible to limit crawling vertically
>
> Comments are welcome.

Well, it really depends on what you want to do when you encounter an
external link. Do you want to restart the counter, i.e. crawl the new
site at full depth up to max_hop? Then set hop=0. Do you want to
terminate the crawl at that link? then set hop=max_hop.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: [Fwd: Re: get CrawlDatum]

Uroš Gruber-2
Andrzej Bialecki wrote:
> Uroš Gruber wrote:
>> I made some draft patch. But there is still some problems I see. I
>> know code needs to be cleaned and test. But right now I don't know
>> what number set to external urls. For internal linking works great.
>
> (the patch changes CrawlDatum itself, I think it would be better to
> put the hop counter in CrawlDatum.metaData.)
>
I can try to make with metaData

>>
>> What is the whole idea of this changes.
>>
>> Injected urls always get hop 0. While fetching/updating/generating
>> hop value is incremented by 1. (still no idea what to do with
>> external link). Then I can add config value max_hop etc. to limit
>> fetcher and generator to create more urls.
>>
>> This way it's possible to limit crawling vertically
>>
>> Comments are welcome.
>
> Well, it really depends on what you want to do when you encounter an
> external link. Do you want to restart the counter, i.e. crawl the new
> site at full depth up to max_hop? Then set hop=0. Do you want to
> terminate the crawl at that link? then set hop=max_hop.
>
I talk with my friend about this and here is what we've came up. Let say
URLs manualy injected are good and checked by human and probably you
wan't to start from it. So setting hop to 0 at injection is ok. While
crawling we have some sort of filtering by host (regexp etc.). We need
no worry about urls we don't have in our list so hop can be set whatever
it's, maybe to max_hop.

But here a scenario We add foo.com and bar.com from injection. After
crawling we find on site foo.com link to bar.com/hop/hop/index.html
We can set url hop to 0 or to max because we can update this after we
found this url on bar.com site.

Checking for hop needs to be done while updating I think, so we don't
end up with bunch of urls having hop greater than max_hop.

I will try to make a decent patch for this to check and if there is any
idea by others please make a comment on this.

regards

Uros