Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Stefan Groschupf-2

Am 11.11.2005 um 11:48 schrieb Apache Wiki:

> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Nutch Wiki"  
> for change notification.
>
> The following page has been changed by PaulBaclace:
> http://wiki.apache.org/nutch/OverviewDeploymentConfigs
>
> New page:
> == Overview of Deployment Configurations in Nutch 0.8 ==
> (11/2005 Paul Baclace)
>
> This page describes a range of deployment configurations, the  
> assumptions involved, and the relevant property settings.  The  
> primary focus is on a few canonical deployments scenarios and  
> surrounding issues.  Relevant properties are described, but a  
> complete description of all properties is not attempted here.
>
> The process startup sequence is also described in order to see  
> differences between different deployments.
>
> Flexibility of assumptions is noted with MUST (rigid) or SHOULD  
> (highly recommended, but could be different for the adventurous).
>
> === Configuration File Overview ===
>
> When building Nutch, the conf directory has 2 important property  
> files that are put into the classpath for lookup at runtime:
>
>  * ''nutch-default.xml'' the place for universal defaults as set by  
> the Nutch developers.
>  * ''nutch-site.xml'' the highest priority properties that override  
> all other.
>
> The Java System Properties are ''not'' consulted for Nutch  
> properties, so -D style commandline overriding is strongly  
> discouraged.  However, System Properties are used when standard  
> properties are to be found.
>
> The bin/nutch sh script places $NUTCH_HOME/conf at the beginning of  
> the classpath so that the xml property files can be found.
>
> === Nutch Shell Scripts ===
>
> A meta-assumption here is that the sh scripts in the nutch bin  
> directory are used to start and control the ensemble of processes  
> across many machines.
>
> The Nutch shell scripts are simple and elegant and they form a call  
> hierarchy, starting at the top level:
>  1. start_all.sh or stop_all.sh - start and stop whole ensemble.
>  2. nutch_daemons.sh - run a Nutch command on all slave hosts.
>  3. slaves.sh - run a shell command on all slave hosts.
>  4. nutch_daemon.sh - run a Nutch command as a daemon with a start|
> stop argument like a regular Unix/Linux /etc/rc.local script; the  
> process pid is stored during start and used during stop.  Runs  
> rsync at start.
>  5. nutch - run a Nutch command using the JVM.
>
> Depending upon the context of use, any level of these scripts can  
> be handy on the command line.
>
> === Configuration Assumptions ===
>
> For simplicity of configuration, filenames you pass to commands  
> SHOULD be pathnames that work on all hosts. When working with just  
> a few hosts, this seems to be a limitation, but it obviously makes  
> a lot of sense when hundreds or thousands of machines are involved.
>
>  1. property settings are meant to be the same across hosts; they  
> are SHOULD not be customized per host (they are not even settable  
> on the commandline, so per-process settings are discouraged).
>  2. filenames and paths are meant to be the same across hosts  
> (SHOULD).
>  3. Some file paths are ambivalent about NDFS/local filesystem and  
> are interpreted depending on which kind of filesystem is in use.
>  4. each machine SHOULD have (including the master) nutch installed  
> in the same filesystem path.
>  5. The ndfs.data.dir and mapred.local.dir properties list comma  
> separated directories.  Only those that exist are used.  So not all  
> machines are required to have exactly the same devices.
>
> === System Assumptions ===
>
>   1. The env var NUTCH_MASTER is set to the hostname of the master  
> machine.
>   2. The slave nodes are defined by putting list of hostnames, one  
> per line, in ~/.slaves  (alternatively, use NUTCH_SLAVES to refer  
> to a different file).
>   3. a cluster of machines is managed from a master machine,  
> without a firewall in bewteen any of the machines (MUST, for  
> simplicity).  Many tcp/ip ports are used.
>   4. the master machine MUST have a no-password login (ssh) to all  
> the slave machines, using the same username.
>   5. set environment variables in ~/.ssh/environment, since ssh  
> does not source your .bash_profile.  These include JAVA_HOME,  
> NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.
>   6. make sure that your NUTCH_LOG_DIR and the directories named in  
> ndfs.data.dir exist on all slaves.  This can be done most easily  
> with bin/slaves.sh.
>
> === Deployment Startup Sequences ===
>
>  A. Cluster deployment with too many machines to customize  
> (probably more than 4; 1000 machines should be possible):
>
>   6. bin/slaves.sh rsync-command is used as needed to update jars  
> and conf files from master.
>   7. the ensemble starts by running bin/start-all.sh on the master.
>   8. start-all.sh uses bin/nutch-daemons.sh run one datanode  
> process on each slave (in the background without waiting, one  
> daemon thread is started per comma-separated storage device, non-
> existent storage devices in the list are ignored).
>   9. start-all.sh runs one namenode and one jobtracker on the master.
>   10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker  
> process on each slave (in the background without waiting).
>
>
>  B. Cluster of a few machines:
>   1. ''Add more details here''
>
>  C. One developer debugging on one machine:
>   1. ''Add more details here''
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Stefan Groschupf-2
ups, sorry...
Paul, you may should mentioned that this scripts require ssh in a  
version higher than 3.8.
A great page!

Stefan

Am 11.11.2005 um 13:45 schrieb Stefan Groschupf:

>
> Am 11.11.2005 um 11:48 schrieb Apache Wiki:
>
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Nutch  
>> Wiki" for change notification.
>>
>> The following page has been changed by PaulBaclace:
>> http://wiki.apache.org/nutch/OverviewDeploymentConfigs
>>
>> New page:
>> == Overview of Deployment Configurations in Nutch 0.8 ==
>> (11/2005 Paul Baclace)
>>
>> This page describes a range of deployment configurations, the  
>> assumptions involved, and the relevant property settings.  The  
>> primary focus is on a few canonical deployments scenarios and  
>> surrounding issues.  Relevant properties are described, but a  
>> complete description of all properties is not attempted here.
>>
>> The process startup sequence is also described in order to see  
>> differences between different deployments.
>>
>> Flexibility of assumptions is noted with MUST (rigid) or SHOULD  
>> (highly recommended, but could be different for the adventurous).
>>
>> === Configuration File Overview ===
>>
>> When building Nutch, the conf directory has 2 important property  
>> files that are put into the classpath for lookup at runtime:
>>
>>  * ''nutch-default.xml'' the place for universal defaults as set  
>> by the Nutch developers.
>>  * ''nutch-site.xml'' the highest priority properties that  
>> override all other.
>>
>> The Java System Properties are ''not'' consulted for Nutch  
>> properties, so -D style commandline overriding is strongly  
>> discouraged.  However, System Properties are used when standard  
>> properties are to be found.
>>
>> The bin/nutch sh script places $NUTCH_HOME/conf at the beginning  
>> of the classpath so that the xml property files can be found.
>>
>> === Nutch Shell Scripts ===
>>
>> A meta-assumption here is that the sh scripts in the nutch bin  
>> directory are used to start and control the ensemble of processes  
>> across many machines.
>>
>> The Nutch shell scripts are simple and elegant and they form a  
>> call hierarchy, starting at the top level:
>>  1. start_all.sh or stop_all.sh - start and stop whole ensemble.
>>  2. nutch_daemons.sh - run a Nutch command on all slave hosts.
>>  3. slaves.sh - run a shell command on all slave hosts.
>>  4. nutch_daemon.sh - run a Nutch command as a daemon with a start|
>> stop argument like a regular Unix/Linux /etc/rc.local script; the  
>> process pid is stored during start and used during stop.  Runs  
>> rsync at start.
>>  5. nutch - run a Nutch command using the JVM.
>>
>> Depending upon the context of use, any level of these scripts can  
>> be handy on the command line.
>>
>> === Configuration Assumptions ===
>>
>> For simplicity of configuration, filenames you pass to commands  
>> SHOULD be pathnames that work on all hosts. When working with just  
>> a few hosts, this seems to be a limitation, but it obviously makes  
>> a lot of sense when hundreds or thousands of machines are involved.
>>
>>  1. property settings are meant to be the same across hosts; they  
>> are SHOULD not be customized per host (they are not even settable  
>> on the commandline, so per-process settings are discouraged).
>>  2. filenames and paths are meant to be the same across hosts  
>> (SHOULD).
>>  3. Some file paths are ambivalent about NDFS/local filesystem and  
>> are interpreted depending on which kind of filesystem is in use.
>>  4. each machine SHOULD have (including the master) nutch  
>> installed in the same filesystem path.
>>  5. The ndfs.data.dir and mapred.local.dir properties list comma  
>> separated directories.  Only those that exist are used.  So not  
>> all machines are required to have exactly the same devices.
>>
>> === System Assumptions ===
>>
>>   1. The env var NUTCH_MASTER is set to the hostname of the master  
>> machine.
>>   2. The slave nodes are defined by putting list of hostnames, one  
>> per line, in ~/.slaves  (alternatively, use NUTCH_SLAVES to refer  
>> to a different file).
>>   3. a cluster of machines is managed from a master machine,  
>> without a firewall in bewteen any of the machines (MUST, for  
>> simplicity).  Many tcp/ip ports are used.
>>   4. the master machine MUST have a no-password login (ssh) to all  
>> the slave machines, using the same username.
>>   5. set environment variables in ~/.ssh/environment, since ssh  
>> does not source your .bash_profile.  These include JAVA_HOME,  
>> NUTCH_LOG_DIR, NUTCH_SLAVES and NUTCH_MASTER.
>>   6. make sure that your NUTCH_LOG_DIR and the directories named  
>> in ndfs.data.dir exist on all slaves.  This can be done most  
>> easily with bin/slaves.sh.
>>
>> === Deployment Startup Sequences ===
>>
>>  A. Cluster deployment with too many machines to customize  
>> (probably more than 4; 1000 machines should be possible):
>>
>>   6. bin/slaves.sh rsync-command is used as needed to update jars  
>> and conf files from master.
>>   7. the ensemble starts by running bin/start-all.sh on the master.
>>   8. start-all.sh uses bin/nutch-daemons.sh run one datanode  
>> process on each slave (in the background without waiting, one  
>> daemon thread is started per comma-separated storage device, non-
>> existent storage devices in the list are ignored).
>>   9. start-all.sh runs one namenode and one jobtracker on the master.
>>   10. start-all.sh uses bin/nutch-daemons.sh run one tasktracker  
>> process on each slave (in the background without waiting).
>>
>>
>>  B. Cluster of a few machines:
>>   1. ''Add more details here''
>>
>>  C. One developer debugging on one machine:
>>   1. ''Add more details here''
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch Wiki] Update of "OverviewDeploymentConfigs" by PaulBaclace

Doug Cutting-2
In reply to this post by Stefan Groschupf-2
Great stuff, Paul!

A few minor corrections.

Apache Wiki wrote:
>   1. The env var NUTCH_MASTER is set to the hostname of the master machine.

This is optional.  The alternative is to mount a common home directory
with NFS, as many clusters do, and keep the Nutch software there.

Also, NUTCH_MASTER is an rsync path, so it should be set to something of
the form host:/path/to/nutch, e.g., "foo.bar.com:/home/$USER/src/nutch".

>   2. The slave nodes are defined by putting list of hostnames, one per line, in ~/.slaves  (alternatively, use NUTCH_SLAVES to refer to a different file).

This location can be altered with the environment variable NUTCH_SLAVES.

Thanks for writing this.

Doug