webapp for Nutch deploy mode

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

webapp for Nutch deploy mode

Gajanan Watkar
Hi all,
I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of
Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to
solr-6.5.1.
I want to use *webapp* for creating, controlling and monitoring crawl jobs
in deploy mode.

With Hadoop cluster, Hbase and nutchserver started, when I tried to launch
Crawl Job through webapp interfaces InjectorJob failed.
It was happening  due to seed directory being created on local filesystem.
I fixed it by moving it to same path on HDFS by editing *createSeedFile*
method in *org.apache.nutch.api.resources.SeedResource.java*.

public String createSeedFile(SeedList seedList) {
    if (seedList == null) {
      throw new WebApplicationException(Response.status(Status.BAD_REQUEST)
          .entity("Seed list cannot be empty!").build());
    }
    File seedFile = createSeedFile();
    BufferedWriter writer = getWriter(seedFile);

    Collection<SeedUrl> seedUrls = seedList.getSeedUrls();
    if (CollectionUtils.isNotEmpty(seedUrls)) {
      for (SeedUrl seedUrl : seedUrls) {
        writeUrl(writer, seedUrl);
      }
    }


* //method to copy seed directory to HDFS: Gajanan*
*    copyDataToHDFS(seedFile);*

    return seedFile.getParent();
  }

Then I was able to go upto index phase where it complained of not having
set *solr.server.url* java property.
*I set JAVA_TOOL_OPTIONS to include -Dsolr.server.url property.*

*Crawl Job is is still failing with:*
18/10/11 10:07:03 ERROR impl.RemoteCommandExecutor: Remote command failed
java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:205)
    at
org.apache.nutch.webui.client.impl.RemoteCommandExecutor.executeRemoteJob(RemoteCommandExecutor.java:61)
    at
org.apache.nutch.webui.client.impl.CrawlingCycle.executeCrawlCycle(CrawlingCycle.java:58)
    at
org.apache.nutch.webui.service.impl.CrawlServiceImpl.startCrawl(CrawlServiceImpl.java:69)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
    at
org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
    at
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
    at
org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:97)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I tried to change default timeout in
*org.apache.nutch.webui.client.impl.RemoteCommandExecutor
*

private static final int *DEFAULT_TIMEOUT_SEC = 300;  *//Can be increased
if required

*Summary:*
*But in all this, what i am wondering about is:*
*1. No webpage table is being created in hbase corresponding to crawl ID.*
*2. How in that case it goes upto Index phase of crawl.*

*Finally actual question:*

*How do I get my crawl jobs running in deploy mode using nutch webapp.
What else I need to do. Am I missing something very basic.*
Reply | Threaded
Open this post in threaded view
|

Re: webapp for Nutch deploy mode

lewis john mcgibbney-2
Hi Gahanna,
Response inline

On 2018/10/12 07:40:50, Gajanan Watkar <[hidden email]> wrote:

> Hi all,
> I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of
> Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to
> solr-6.5.1.
> I want to use *webapp* for creating, controlling and monitoring crawl jobs
> in deploy mode.
>
> With Hadoop cluster, Hbase and nutchserver started, when I tried to launch
> Crawl Job through webapp interfaces InjectorJob failed.
> It was happening  due to seed directory being created on local filesystem.
> I fixed it by moving it to same path on HDFS by editing *createSeedFile*
> method in *org.apache.nutch.api.resources.SeedResource.java*.
>
> public String createSeedFile(SeedList seedList) {
>     if (seedList == null) {
>       throw new WebApplicationException(Response.status(Status.BAD_REQUEST)
>           .entity("Seed list cannot be empty!").build());
>     }
>     File seedFile = createSeedFile();
>     BufferedWriter writer = getWriter(seedFile);
>
>     Collection<SeedUrl> seedUrls = seedList.getSeedUrls();
>     if (CollectionUtils.isNotEmpty(seedUrls)) {
>       for (SeedUrl seedUrl : seedUrls) {
>         writeUrl(writer, seedUrl);
>       }
>     }
>
>
> * //method to copy seed directory to HDFS: Gajanan*
> *    copyDataToHDFS(seedFile);*
>
>     return seedFile.getParent();
>   }

I was aware of this some time ago and never found the time to fix it. I just checked JIRA as well and there is no ticket for addressing the task however I am certain that it has been discussed on this mailing list previously.
Anyway, can you please create an issue in JIRA labeling it as affecting 2.x and tag it with both "REST_api" and "web gui" and submit this as a pull request. It would be a huge help.

>
> Then I was able to go upto index phase where it complained of not having
> set *solr.server.url* java property.
> *I set JAVA_TOOL_OPTIONS to include -Dsolr.server.url property.*
>
> *Crawl Job is is still failing with:*
> 18/10/11 10:07:03 ERROR impl.RemoteCommandExecutor: Remote command failed
> java.util.concurrent.TimeoutException
>     at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>     at
> org.apache.nutch.webui.client.impl.RemoteCommandExecutor.executeRemoteJob(RemoteCommandExecutor.java:61)
>     at
> org.apache.nutch.webui.client.impl.CrawlingCycle.executeCrawlCycle(CrawlingCycle.java:58)
>     at
> org.apache.nutch.webui.service.impl.CrawlServiceImpl.startCrawl(CrawlServiceImpl.java:69)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
>     at
> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
>     at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
>     at
> org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:97)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
> I tried to change default timeout in
> *org.apache.nutch.webui.client.impl.RemoteCommandExecutor
> *
>
> private static final int *DEFAULT_TIMEOUT_SEC = 300;  *//Can be increased
> if required

There are various issues also about this in JIRA. Can you please check them out and let me know if you can find the correct on. Maybe the following? https://issues.apache.org/jira/browse/NUTCH-2313
>
> *Summary:*
> *But in all this, what i am wondering about is:*
> *1. No webpage table is being created in hbase corresponding to crawl ID.*

Again, please check JIRA for this information, there may already be something logged which will indicate what is wrong.

> *2. How in that case it goes upto Index phase of crawl.*

It shouldn't!

>
> *Finally actual question:*
>
> *How do I get my crawl jobs running in deploy mode using nutch webapp.
> What else I need to do. Am I missing something very basic.*

As far as I can remember this functionality has not been baked in... or else it may have been baked in but it is within 2.x from Git. Please check out the code from Git and try it there... your results may differ.

Lewis
Reply | Threaded
Open this post in threaded view
|

Re: webapp for Nutch deploy mode

Gajanan Watkar
Thanks Lewis.
I have created issue <https://issues.apache.org/jira/browse/NUTCH-2664> on
JIRA for creation of seed directory on HDFS in case of Nutch running in
deploy mode as per your suggestion.
As far as DEFAULT_TIMEOUT_SEC in
org.apache.nutch.webui.client.impl.RemoteCommandExecutor is concerned, I
could not comprehend its hard-coding as jobs may take variable amount of
time depending upon the setup and scale of crawling.
And in case of webapp,  webpage table not being created in Hbase is still a
misery for me.
I will come back with further findings and fixes, if any I could make, once
I find time to dwell into this issue deeper.

-Gajanan


On Fri, Oct 19, 2018 at 12:54 AM Lewis John McGibbney <[hidden email]>
wrote:

> Hi Gahanna,
> Response inline
>
> On 2018/10/12 07:40:50, Gajanan Watkar <[hidden email]> wrote:
> > Hi all,
> > I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of
> > Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to
> > solr-6.5.1.
> > I want to use *webapp* for creating, controlling and monitoring crawl
> jobs
> > in deploy mode.
> >
> > With Hadoop cluster, Hbase and nutchserver started, when I tried to
> launch
> > Crawl Job through webapp interfaces InjectorJob failed.
> > It was happening  due to seed directory being created on local
> filesystem.
> > I fixed it by moving it to same path on HDFS by editing *createSeedFile*
> > method in *org.apache.nutch.api.resources.SeedResource.java*.
> >
> > public String createSeedFile(SeedList seedList) {
> >     if (seedList == null) {
> >       throw new
> WebApplicationException(Response.status(Status.BAD_REQUEST)
> >           .entity("Seed list cannot be empty!").build());
> >     }
> >     File seedFile = createSeedFile();
> >     BufferedWriter writer = getWriter(seedFile);
> >
> >     Collection<SeedUrl> seedUrls = seedList.getSeedUrls();
> >     if (CollectionUtils.isNotEmpty(seedUrls)) {
> >       for (SeedUrl seedUrl : seedUrls) {
> >         writeUrl(writer, seedUrl);
> >       }
> >     }
> >
> >
> > * //method to copy seed directory to HDFS: Gajanan*
> > *    copyDataToHDFS(seedFile);*
> >
> >     return seedFile.getParent();
> >   }
>
> I was aware of this some time ago and never found the time to fix it. I
> just checked JIRA as well and there is no ticket for addressing the task
> however I am certain that it has been discussed on this mailing list
> previously.
> Anyway, can you please create an issue in JIRA labeling it as affecting
> 2.x and tag it with both "REST_api" and "web gui" and submit this as a pull
> request. It would be a huge help.
> >
> > Then I was able to go upto index phase where it complained of not having
> > set *solr.server.url* java property.
> > *I set JAVA_TOOL_OPTIONS to include -Dsolr.server.url property.*
> >
> > *Crawl Job is is still failing with:*
> > 18/10/11 10:07:03 ERROR impl.RemoteCommandExecutor: Remote command failed
> > java.util.concurrent.TimeoutException
> >     at java.util.concurrent.FutureTask.get(FutureTask.java:205)
> >     at
> >
> org.apache.nutch.webui.client.impl.RemoteCommandExecutor.executeRemoteJob(RemoteCommandExecutor.java:61)
> >     at
> >
> org.apache.nutch.webui.client.impl.CrawlingCycle.executeCrawlCycle(CrawlingCycle.java:58)
> >     at
> >
> org.apache.nutch.webui.service.impl.CrawlServiceImpl.startCrawl(CrawlServiceImpl.java:69)
> >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >     at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >     at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >     at java.lang.reflect.Method.invoke(Method.java:498)
> >     at
> >
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
> >     at
> >
> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
> >     at
> >
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
> >     at
> >
> org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:97)
> >     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >     at java.lang.Thread.run(Thread.java:748)
> >
> > I tried to change default timeout in
> > *org.apache.nutch.webui.client.impl.RemoteCommandExecutor
> > *
> >
> > private static final int *DEFAULT_TIMEOUT_SEC = 300;  *//Can be increased
> > if required
>
> There are various issues also about this in JIRA. Can you please check
> them out and let me know if you can find the correct on. Maybe the
> following? https://issues.apache.org/jira/browse/NUTCH-2313
> >
> > *Summary:*
> > *But in all this, what i am wondering about is:*
> > *1. No webpage table is being created in hbase corresponding to crawl
> ID.*
>
> Again, please check JIRA for this information, there may already be
> something logged which will indicate what is wrong.
>
> > *2. How in that case it goes upto Index phase of crawl.*
>
> It shouldn't!
>
> >
> > *Finally actual question:*
> >
> > *How do I get my crawl jobs running in deploy mode using nutch webapp.
> > What else I need to do. Am I missing something very basic.*
>
> As far as I can remember this functionality has not been baked in... or
> else it may have been baked in but it is within 2.x from Git. Please check
> out the code from Git and try it there... your results may differ.
>
> Lewis
>