Nutch 2.4 with selenium

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch 2.4 with selenium

Gajalakshmi G
Hi all,

I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0 with Firefox version 79. I am getting the below error in injector job itself.

java.lang.Exception: java.lang.NullPointerException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
    at java.io.Reader.<init>(Reader.java:78)
    at java.io.BufferedReader.<init>(BufferedReader.java:101)
    at java.io.BufferedReader.<init>(BufferedReader.java:116)
    at org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
    at org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
    at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
    at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
    at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Please guide me on resolving this issue.



Thanks & Regards,

Gajalakshmi.G

Assistant Consultant

Tata Consultancy Services
Mailto: [hidden email]<https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com>
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you


Reply | Threaded
Open this post in threaded view
|

Re: Nutch 2.4 with selenium

Shashanka Balakuntala
Hi Gajalakshmi,

The NPE can be thrown because of the file not found on the disk. So in the
working directory/current directory check if you have the file
conf/regex-urlfilter.txt


*Regards*
  Shashanka Balakuntala Srinivasa



On Wed, Oct 7, 2020 at 2:09 PM Gajalakshmi G <[hidden email]>
wrote:

> Hi all,
>
> I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0
> with Firefox version 79. I am getting the below error in injector job
> itself.
>
> java.lang.Exception: java.lang.NullPointerException
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
> Caused by: java.lang.NullPointerException
>     at java.io.Reader.<init>(Reader.java:78)
>     at java.io.BufferedReader.<init>(BufferedReader.java:101)
>     at java.io.BufferedReader.<init>(BufferedReader.java:116)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
>     at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
>     at
> org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
> Please guide me on resolving this issue.
>
>
>
> Thanks & Regards,
>
> Gajalakshmi.G
>
> Assistant Consultant
>
> Tata Consultancy Services
> Mailto: [hidden email]<
> https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com
> >
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 2.4 with selenium

Gajalakshmi G
Hi,

Thanks for the response, the 'conf/regex-urlfilter.txt' file was available inside the current working directory.

Please guide me or share me useful links on standalone  nutch crawling with selenium.



Thanks & Regards,

Gajalakshmi.G

Assistant Consultant

Tata Consultancy Services
Mailto: [hidden email]<https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com>

________________________________
From: Shashanka Balakuntala <[hidden email]>
Sent: Wednesday, October 7, 2020 5:49 PM
To: [hidden email] <[hidden email]>
Subject: Re: Nutch 2.4 with selenium

"External email. Open with Caution"

Hi Gajalakshmi,

The NPE can be thrown because of the file not found on the disk. So in the
working directory/current directory check if you have the file
conf/regex-urlfilter.txt


*Regards*
  Shashanka Balakuntala Srinivasa



On Wed, Oct 7, 2020 at 2:09 PM Gajalakshmi G <[hidden email]>
wrote:

> Hi all,
>
> I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0
> with Firefox version 79. I am getting the below error in injector job
> itself.
>
> java.lang.Exception: java.lang.NullPointerException
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
> Caused by: java.lang.NullPointerException
>     at java.io.Reader.<init>(Reader.java:78)
>     at java.io.BufferedReader.<init>(BufferedReader.java:101)
>     at java.io.BufferedReader.<init>(BufferedReader.java:116)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
>     at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
>     at
> org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
> Please guide me on resolving this issue.
>
>
>
> Thanks & Regards,
>
> Gajalakshmi.G
>
> Assistant Consultant
>
> Tata Consultancy Services
> Mailto: [hidden email]<
> https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com
> >
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 2.4 with selenium

Sebastian Nagel-2
Hi,

> Nutch 2.4 with selenium

Nutch 2.4 does not include any plugin to use Selenium. In addition, 2.4 is for now the last release on the 2.x branch which is not
maintained anymore. You should use 1.x (1.17 is the
most recent release.

> standalone nutch crawling with selenium.

For 1.x there's a good README how to setup protocol-selenium:
  https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md

In general, the tutorial is the recommended way to start
  https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
Please try to get it running first without Selenium, it's important to understand first
how Nutch works before you start with the clearly more complex Selenium-based crawling.

Best,
Sebastian

On 10/7/20 2:49 PM, Gajalakshmi G wrote:

> Hi,
>
> Thanks for the response, the 'conf/regex-urlfilter.txt' file was available inside the current working directory.
>
> Please guide me or share me useful links on standalone  nutch crawling with selenium.
>
>
>
> Thanks & Regards,
>
> Gajalakshmi.G
>
> Assistant Consultant
>
> Tata Consultancy Services
> Mailto: [hidden email]<https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com>
>
> ________________________________
> From: Shashanka Balakuntala <[hidden email]>
> Sent: Wednesday, October 7, 2020 5:49 PM
> To: [hidden email] <[hidden email]>
> Subject: Re: Nutch 2.4 with selenium
>
> "External email. Open with Caution"
>
> Hi Gajalakshmi,
>
> The NPE can be thrown because of the file not found on the disk. So in the
> working directory/current directory check if you have the file
> conf/regex-urlfilter.txt
>
>
> *Regards*
>   Shashanka Balakuntala Srinivasa
>
>
>
> On Wed, Oct 7, 2020 at 2:09 PM Gajalakshmi G <[hidden email]>
> wrote:
>
>> Hi all,
>>
>> I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0
>> with Firefox version 79. I am getting the below error in injector job
>> itself.
>>
>> java.lang.Exception: java.lang.NullPointerException
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
>> Caused by: java.lang.NullPointerException
>>     at java.io.Reader.<init>(Reader.java:78)
>>     at java.io.BufferedReader.<init>(BufferedReader.java:101)
>>     at java.io.BufferedReader.<init>(BufferedReader.java:116)
>>     at
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
>>     at
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
>>     at
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>>     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
>>     at
>> org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>>     at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>>
>> Please guide me on resolving this issue.
>>
>>
>>
>> Thanks & Regards,
>>
>> Gajalakshmi.G
>>
>> Assistant Consultant
>>
>> Tata Consultancy Services
>> Mailto: [hidden email]<
>> https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com
>>>
>> =====-----=====-----=====
>> Notice: The information contained in this e-mail
>> message and/or attachments to it may contain
>> confidential or privileged information. If you are
>> not the intended recipient, any dissemination, use,
>> review, distribution, printing or copying of the
>> information contained in this e-mail message
>> and/or attachments to it are strictly prohibited. If
>> you have received this communication in error,
>> please notify us by reply e-mail or telephone and
>> immediately and permanently delete the message
>> and any attachments. Thank you
>>
>>
>>
>