Standalone vs distributed Nutch

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Standalone vs distributed Nutch

brainstorm-2-2
Hi !

I've been running nutch for a while in a 4-node cluster, and I'm quite
disappointed with my results... I'm quite sure that I'm doing
something wrong, but I've re-readed/tested tons of related
documentation to no avail :_(

Problem is that crawling in a single node setup is actually more
efficient than using clustered nutch+hadoop, for instance, given the
same URL input set:

standalone nutch+hadoop install (single node): dumped parsed_text is
425MB big, 2 days.
4-node cluster: 55MB, 2 days :_/

I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
pinpoint the problem that would be really useful to me. What really
annoys me is the time it takes to do some of the tasks: crawldb taking
3+ hours while in standalone was a matter of minutes :/

More details:

/state/partition1/hdfs is present on all nodes with actual data on it:

[hadoop@cluster ~]$ cluster-fork du -hs /state/partition1/hdfs
compute-0-1:
197M /state/partition1/hdfs
compute-0-2:
156M /state/partition1/hdfs
compute-0-3:
288M /state/partition1/hdfs

Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
to all nodes (note that DFS is on different *local* space, not
exported (/state...)).

Thanks in advance

hadoop-site.xml (3K) Download Attachment
nutch-site.xml (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Standalone vs distributed Nutch

brainstorm-2-2
/state/partition1/hdfs/{mapred|temp} is also being created
automatically each new crawl on DFS... is it ok ? Seems weird to me :/

On Thu, Jul 17, 2008 at 5:44 PM, brainstorm <[hidden email]> wrote:

> Hi !
>
> I've been running nutch for a while in a 4-node cluster, and I'm quite
> disappointed with my results... I'm quite sure that I'm doing
> something wrong, but I've re-readed/tested tons of related
> documentation to no avail :_(
>
> Problem is that crawling in a single node setup is actually more
> efficient than using clustered nutch+hadoop, for instance, given the
> same URL input set:
>
> standalone nutch+hadoop install (single node): dumped parsed_text is
> 425MB big, 2 days.
> 4-node cluster: 55MB, 2 days :_/
>
> I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
> pinpoint the problem that would be really useful to me. What really
> annoys me is the time it takes to do some of the tasks: crawldb taking
> 3+ hours while in standalone was a matter of minutes :/
>
> More details:
>
> /state/partition1/hdfs is present on all nodes with actual data on it:
>
> [hadoop@cluster ~]$ cluster-fork du -hs /state/partition1/hdfs
> compute-0-1:
> 197M    /state/partition1/hdfs
> compute-0-2:
> 156M    /state/partition1/hdfs
> compute-0-3:
> 288M    /state/partition1/hdfs
>
> Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
> to all nodes (note that DFS is on different *local* space, not
> exported (/state...)).
>
> Thanks in advance
>
Reply | Threaded
Open this post in threaded view
|

Re: Standalone vs distributed Nutch

brainstorm-2-2
sorry it's actually:

/state/partition1/hdfs/hadoop/mapred/system (not accessible):

org.apache.hadoop.fs.permission.AccessControlException: Permission
denied: user=webuser, access=READ_EXECUTE,
inode="system":hadoop:supergroup:rwx-wx-wx

and:

/state/partition1/hdfs/hadoop/mapred/temp/inject-temp-1098577375/part-00000
(168.92 MB in size)

Is it ok for nutch+hadoop to generate temp files on DFS ? I thought
that temp files should be generated on each individual *local* node
filesystem :/

Do I have a wrong directive on my hadoop-site.xml causing this ?

Thanks in advance !

On Thu, Jul 17, 2008 at 6:05 PM, brainstorm <[hidden email]> wrote:

> /state/partition1/hdfs/{mapred|temp} is also being created
> automatically each new crawl on DFS... is it ok ? Seems weird to me :/
>
> On Thu, Jul 17, 2008 at 5:44 PM, brainstorm <[hidden email]> wrote:
>> Hi !
>>
>> I've been running nutch for a while in a 4-node cluster, and I'm quite
>> disappointed with my results... I'm quite sure that I'm doing
>> something wrong, but I've re-readed/tested tons of related
>> documentation to no avail :_(
>>
>> Problem is that crawling in a single node setup is actually more
>> efficient than using clustered nutch+hadoop, for instance, given the
>> same URL input set:
>>
>> standalone nutch+hadoop install (single node): dumped parsed_text is
>> 425MB big, 2 days.
>> 4-node cluster: 55MB, 2 days :_/
>>
>> I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
>> pinpoint the problem that would be really useful to me. What really
>> annoys me is the time it takes to do some of the tasks: crawldb taking
>> 3+ hours while in standalone was a matter of minutes :/
>>
>> More details:
>>
>> /state/partition1/hdfs is present on all nodes with actual data on it:
>>
>> [hadoop@cluster ~]$ cluster-fork du -hs /state/partition1/hdfs
>> compute-0-1:
>> 197M    /state/partition1/hdfs
>> compute-0-2:
>> 156M    /state/partition1/hdfs
>> compute-0-3:
>> 288M    /state/partition1/hdfs
>>
>> Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
>> to all nodes (note that DFS is on different *local* space, not
>> exported (/state...)).
>>
>> Thanks in advance
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Standalone vs distributed Nutch

brainstorm-2-2
Problem solved !

It was about static /etc/hosts with wrong IP addresses referring to
frontend node :-! Sorry for crossposting, but that's what I sent to
hadoop mailing list in response to a related email:

Got this problem too, and fixed it just 5 minutes ago... there were
wrong IP entries on the nodes referring to the frontend, it was
slowing down the reduce process *a lot*... in numbers:

Wrong hosts file using wordcount example: 3hrs, 45mins, 41sec (4
minutes map, the rest, reduce)
Right hosts file using wordcount example: 6mins, 26sec

Moral of the history: AVOID static hosts file, always use DNS.

PD: Static hosts files were replicated by rocksclusters to all compute
nodes on install (kickstart) time, but not refreshed afterwards while
doing "rocks sync dns" nor "rocks sync config".

On Thu, Jul 17, 2008 at 6:37 PM, brainstorm <[hidden email]> wrote:

> sorry it's actually:
>
> /state/partition1/hdfs/hadoop/mapred/system (not accessible):
>
> org.apache.hadoop.fs.permission.AccessControlException: Permission
> denied: user=webuser, access=READ_EXECUTE,
> inode="system":hadoop:supergroup:rwx-wx-wx
>
> and:
>
> /state/partition1/hdfs/hadoop/mapred/temp/inject-temp-1098577375/part-00000
> (168.92 MB in size)
>
> Is it ok for nutch+hadoop to generate temp files on DFS ? I thought
> that temp files should be generated on each individual *local* node
> filesystem :/
>
> Do I have a wrong directive on my hadoop-site.xml causing this ?
>
> Thanks in advance !
>
> On Thu, Jul 17, 2008 at 6:05 PM, brainstorm <[hidden email]> wrote:
>> /state/partition1/hdfs/{mapred|temp} is also being created
>> automatically each new crawl on DFS... is it ok ? Seems weird to me :/
>>
>> On Thu, Jul 17, 2008 at 5:44 PM, brainstorm <[hidden email]> wrote:
>>> Hi !
>>>
>>> I've been running nutch for a while in a 4-node cluster, and I'm quite
>>> disappointed with my results... I'm quite sure that I'm doing
>>> something wrong, but I've re-readed/tested tons of related
>>> documentation to no avail :_(
>>>
>>> Problem is that crawling in a single node setup is actually more
>>> efficient than using clustered nutch+hadoop, for instance, given the
>>> same URL input set:
>>>
>>> standalone nutch+hadoop install (single node): dumped parsed_text is
>>> 425MB big, 2 days.
>>> 4-node cluster: 55MB, 2 days :_/
>>>
>>> I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
>>> pinpoint the problem that would be really useful to me. What really
>>> annoys me is the time it takes to do some of the tasks: crawldb taking
>>> 3+ hours while in standalone was a matter of minutes :/
>>>
>>> More details:
>>>
>>> /state/partition1/hdfs is present on all nodes with actual data on it:
>>>
>>> [hadoop@cluster ~]$ cluster-fork du -hs /state/partition1/hdfs
>>> compute-0-1:
>>> 197M    /state/partition1/hdfs
>>> compute-0-2:
>>> 156M    /state/partition1/hdfs
>>> compute-0-3:
>>> 288M    /state/partition1/hdfs
>>>
>>> Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
>>> to all nodes (note that DFS is on different *local* space, not
>>> exported (/state...)).
>>>
>>> Thanks in advance
>>>
>>
>