bug or feature

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

bug or feature

Uroš Gruber-2
Hi,

I've made some changes in CrawlDbReader to read from fetchlist made from
generate command. First I thought that I have problems with this script
because some urls from inject were missing. Then I test on only 6 urls.
I've manualy check file generated with inject and by generate and
generate made only 3 urls in fetch list.

I don't quite understand this. As far as I understand generate command
it collects urls from crawdb, do some sorting by score and puts it to
crawl_generate directory.

regards,

Uros
Reply | Threaded
Open this post in threaded view
|

Re: bug or feature

Andrzej Białecki-2
Uroš Gruber wrote:

> Hi,
>
> I've made some changes in CrawlDbReader to read from fetchlist made
> from generate command. First I thought that I have problems with this
> script because some urls from inject were missing. Then I test on only
> 6 urls. I've manualy check file generated with inject and by generate
> and generate made only 3 urls in fetch list.
>
> I don't quite understand this. As far as I understand generate command
> it collects urls from crawdb, do some sorting by score and puts it to
> crawl_generate directory.

Are you running in a local mode, or in map-reduce mode with several
tasktrackers? what is the number of reduce tasks in this "generate" job?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: bug or feature

Uroš Gruber-2
Andrzej Bialecki wrote:

> Uroš Gruber wrote:
>> Hi,
>>
>> I've made some changes in CrawlDbReader to read from fetchlist made
>> from generate command. First I thought that I have problems with this
>> script because some urls from inject were missing. Then I test on
>> only 6 urls. I've manualy check file generated with inject and by
>> generate and generate made only 3 urls in fetch list.
>>
>> I don't quite understand this. As far as I understand generate
>> command it collects urls from crawdb, do some sorting by score and
>> puts it to crawl_generate directory.
>
> Are you running in a local mode, or in map-reduce mode with several
> tasktrackers? what is the number of reduce tasks in this "generate" job?
>
I'm running local mode with mapred.reduce.tasks as default (1) and (2)
map.tasks.

regards

Uros
Reply | Threaded
Open this post in threaded view
|

Re: bug or feature

Uroš Gruber-2
Uroš Gruber wrote:

> Andrzej Bialecki wrote:
>> Uroš Gruber wrote:
>>> Hi,
>>>
>>> I've made some changes in CrawlDbReader to read from fetchlist made
>>> from generate command. First I thought that I have problems with
>>> this script because some urls from inject were missing. Then I test
>>> on only 6 urls. I've manualy check file generated with inject and by
>>> generate and generate made only 3 urls in fetch list.
>>>
>>> I don't quite understand this. As far as I understand generate
>>> command it collects urls from crawdb, do some sorting by score and
>>> puts it to crawl_generate directory.
>>
>> Are you running in a local mode, or in map-reduce mode with several
>> tasktrackers? what is the number of reduce tasks in this "generate" job?
>>
> I'm running local mode with mapred.reduce.tasks as default (1) and (2)
> map.tasks.
>
Debuging through map and reduce job (Generator$Selector [line: 147] -
reduce, Generator$Selector [line: 99] - map) looks ok and It collects
all urls from CrawlDB. I can't figure it out why data is lost when
moving it from /tmp to crawl/segments/***/crawl_generate

If anyone could point me in right direction where to look

regards

Uros
> regards
>
> Uros