Quantcast

create and run a nutch crawler using aws emr on a schedule

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

create and run a nutch crawler using aws emr on a schedule

srinir
Hi Nutch users,

I am trying to run a nutch crawler periodically on a schedule (like a cron
job). I am running my nutch setup in  AWS EMR to avoid setting up and
maintaining infrastructure. I would like to export the crawled output to s3
(already have the seed file stored in s3) and then terminate the EMR
cluster as my nutch job would not run for more than half a day (atleast for
now).

Here is my question:

How can i automate the AWS EMR cluster creation with nutch installed and my
configurations  (both emr and nutch) updated and also terminate the cluster
once nutch finishes  ?

 Here are some ideas i can think of, purely based on my reading not tried
any of them yet.

- write a script using AWS CLI commands to create the emr cluster and run
the nutch job and terminate once its done
- use cloudformation to create the emr cluster with necessary application
(nutch in this case)
- use AWS data pipeline and create a schedule and pipeline for this flow (i
dont know whether data pipeline can achieve what i want)

I would be curious to hear how others approached similar requirement.

Thanks
Srini
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: create and run a nutch crawler using aws emr on a schedule

Sebastian Nagel
Hi,

> I would like to export the crawled output to s3
> (already have the seed file stored in s3)

Please, also have a look at
  https://issues.apache.org/jira/browse/NUTCH-2281
(would be great to have a second test for the patch / pull request)

At a first glance, all 3 approaches seem feasible.
Personally, I only have experience with shell scripting
and AWS CLI commands to launch the cluster. It's quite
flexible, but sometimes cumbersome.

Best,
Sebastian

On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote:

> Hi Nutch users,
>
> I am trying to run a nutch crawler periodically on a schedule (like a cron
> job). I am running my nutch setup in  AWS EMR to avoid setting up and
> maintaining infrastructure. I would like to export the crawled output to s3
> (already have the seed file stored in s3) and then terminate the EMR
> cluster as my nutch job would not run for more than half a day (atleast for
> now).
>
> Here is my question:
>
> How can i automate the AWS EMR cluster creation with nutch installed and my
> configurations  (both emr and nutch) updated and also terminate the cluster
> once nutch finishes  ?
>
>  Here are some ideas i can think of, purely based on my reading not tried
> any of them yet.
>
> - write a script using AWS CLI commands to create the emr cluster and run
> the nutch job and terminate once its done
> - use cloudformation to create the emr cluster with necessary application
> (nutch in this case)
> - use AWS data pipeline and create a schedule and pipeline for this flow (i
> dont know whether data pipeline can achieve what i want)
>
> I would be curious to hear how others approached similar requirement.
>
> Thanks
> Srini
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: create and run a nutch crawler using aws emr on a schedule

srinir
Thanks Sebastin, Its interesting to note that you have a patch to directly
write to S3. I will check it out. I am curious how you approached shutting
down the emr cluster for nutch ? did you do that using the shell script by
listening to the exit status of the crawl command ?

will cloudformation make my job easier or it will not have the flexibility
of using a shell script ? anyone tried that approach ?

Thanks
Srini




On Thu, Jan 26, 2017 at 1:58 AM, Sebastian Nagel <[hidden email]
> wrote:

> Hi,
>
> > I would like to export the crawled output to s3
> > (already have the seed file stored in s3)
>
> Please, also have a look at
>   https://issues.apache.org/jira/browse/NUTCH-2281
> (would be great to have a second test for the patch / pull request)
>
> At a first glance, all 3 approaches seem feasible.
> Personally, I only have experience with shell scripting
> and AWS CLI commands to launch the cluster. It's quite
> flexible, but sometimes cumbersome.
>
> Best,
> Sebastian
>
> On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote:
> > Hi Nutch users,
> >
> > I am trying to run a nutch crawler periodically on a schedule (like a
> cron
> > job). I am running my nutch setup in  AWS EMR to avoid setting up and
> > maintaining infrastructure. I would like to export the crawled output to
> s3
> > (already have the seed file stored in s3) and then terminate the EMR
> > cluster as my nutch job would not run for more than half a day (atleast
> for
> > now).
> >
> > Here is my question:
> >
> > How can i automate the AWS EMR cluster creation with nutch installed and
> my
> > configurations  (both emr and nutch) updated and also terminate the
> cluster
> > once nutch finishes  ?
> >
> >  Here are some ideas i can think of, purely based on my reading not tried
> > any of them yet.
> >
> > - write a script using AWS CLI commands to create the emr cluster and run
> > the nutch job and terminate once its done
> > - use cloudformation to create the emr cluster with necessary application
> > (nutch in this case)
> > - use AWS data pipeline and create a schedule and pipeline for this flow
> (i
> > dont know whether data pipeline can achieve what i want)
> >
> > I would be curious to hear how others approached similar requirement.
> >
> > Thanks
> > Srini
> >
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: create and run a nutch crawler using aws emr on a schedule

Sebastian Nagel
Hi Srini,

> I will check it out.
Thanks, would like to see whether it works.

> I am curious how you approached shutting
> down the emr cluster for nutch
I'm running Nutch on Cloudera CDH. When the crawl is done
(which is manually checked), a script terminates all EC2
instances of the cluster (they are identified by a tag).

Best,
Sebastian

On 01/26/2017 07:16 PM, Srinivasan Ramaswamy wrote:

> Thanks Sebastin, Its interesting to note that you have a patch to directly
> write to S3. I will check it out. I am curious how you approached shutting
> down the emr cluster for nutch ? did you do that using the shell script by
> listening to the exit status of the crawl command ?
>
> will cloudformation make my job easier or it will not have the flexibility
> of using a shell script ? anyone tried that approach ?
>
> Thanks
> Srini
>
>
>
>
> On Thu, Jan 26, 2017 at 1:58 AM, Sebastian Nagel <[hidden email]
>> wrote:
>
>> Hi,
>>
>>> I would like to export the crawled output to s3
>>> (already have the seed file stored in s3)
>>
>> Please, also have a look at
>>   https://issues.apache.org/jira/browse/NUTCH-2281
>> (would be great to have a second test for the patch / pull request)
>>
>> At a first glance, all 3 approaches seem feasible.
>> Personally, I only have experience with shell scripting
>> and AWS CLI commands to launch the cluster. It's quite
>> flexible, but sometimes cumbersome.
>>
>> Best,
>> Sebastian
>>
>> On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote:
>>> Hi Nutch users,
>>>
>>> I am trying to run a nutch crawler periodically on a schedule (like a
>> cron
>>> job). I am running my nutch setup in  AWS EMR to avoid setting up and
>>> maintaining infrastructure. I would like to export the crawled output to
>> s3
>>> (already have the seed file stored in s3) and then terminate the EMR
>>> cluster as my nutch job would not run for more than half a day (atleast
>> for
>>> now).
>>>
>>> Here is my question:
>>>
>>> How can i automate the AWS EMR cluster creation with nutch installed and
>> my
>>> configurations  (both emr and nutch) updated and also terminate the
>> cluster
>>> once nutch finishes  ?
>>>
>>>  Here are some ideas i can think of, purely based on my reading not tried
>>> any of them yet.
>>>
>>> - write a script using AWS CLI commands to create the emr cluster and run
>>> the nutch job and terminate once its done
>>> - use cloudformation to create the emr cluster with necessary application
>>> (nutch in this case)
>>> - use AWS data pipeline and create a schedule and pipeline for this flow
>> (i
>>> dont know whether data pipeline can achieve what i want)
>>>
>>> I would be curious to hear how others approached similar requirement.
>>>
>>> Thanks
>>> Srini
>>>
>>
>>
>

Loading...