Nutch(plugins) and R

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch(plugins) and R

Semyon Semyonov
Hello,

I'm looking for a way to use R in Nutch, particularly HTML parser, but usage in the other parts can be intresting as well. For each parsed document I would like to run a script and provide the results back to the system e.g. topic detection of the document.
 
NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft R server. This way uses Hadoop as an execution engine after the crawling process. In other words, first the computationally intensive full crawling after that another computationally intensive R/Hadoop process.
 
Instead I'm looking for a way of calling R scripts directly from java code of map or reduce jobs. Any ideas how to make it? One way to do it is "Rserve - Binary R server", but I'm looking for alternatives, to compare efficiency.

Semyon.
Reply | Threaded
Open this post in threaded view
|

RE: Nutch(plugins) and R

Markus Jelsma-2
Hello - there are no responses, and i don't know what R is, but you are interested in HTML parsing, specifically topic detection, so here are my thoughts.

We have done topic detection in our custom HTML parser, but in Nutch speak we would do it in a ParseFilter implementation. Get the extracted text - a problem on its own - and feed it into a model builder with annotated data. Use the produced model in the ParseFilter to get the topic.

In our case we used Mallet, and it produced decent results, although we needed lots of code to facilitate the whole thing and keep stable results between model iterations.

If R has a Java interface, the ParseFilter is the place to be because there you can feed the text into the model, and get the topic back.

If R is not Java, i would - and have done - build a simple HTTP daemon around it, and call it over HTTP. It breaks a Hadoop principle of bringing code to data but rules can be broken. On the other hand, topic models are usually very large due to the amount of vocabulary. Not bringing the data with the code each time has its benefits too.

Regards,
M.
 
-----Original message-----

> From:Semyon Semyonov <[hidden email]>
> Sent: Friday 3rd November 2017 16:59
> To: [hidden email]
> Subject: Nutch(plugins) and R
>
> Hello,
>
> I'm looking for a way to use R in Nutch, particularly HTML parser, but usage in the other parts can be intresting as well. For each parsed document I would like to run a script and provide the results back to the system e.g. topic detection of the document.
>  
> NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft R server. This way uses Hadoop as an execution engine after the crawling process. In other words, first the computationally intensive full crawling after that another computationally intensive R/Hadoop process.
>  
> Instead I'm looking for a way of calling R scripts directly from java code of map or reduce jobs. Any ideas how to make it? One way to do it is "Rserve - Binary R server", but I'm looking for alternatives, to compare efficiency.
>
> Semyon.
>
Reply | Threaded
Open this post in threaded view
|

Re: RE: Nutch(plugins) and R

Semyon Semyonov
In reply to this post by Semyon Semyonov
Thanks for the suggestion, nice to get some insight about topic detection.

Probably, Mallet is the most efficient way for the specific algorithm, but the biggest advantage of R is a huge cover of mathematical/data science/machine learning algorithms(it is worth nothing to mention how easy to develop the analytical solutions there). I plan to experiment with integration of R and Nutch and bring some suggestions in the future.

About R:
R is one of the most popular Data science language(the second one is Python).
It has a list of pre-imlemented packages for NLP for example https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
or deep learning(I'm not sure why you need it, but you can feel on the top of the hype) https://www.r-bloggers.com/deep-learning-in-r-2/[https://www.r-bloggers.com/deep-learning-in-r-2/]
 

Sent: Wednesday, November 08, 2017 at 12:15 AM
From: "Markus Jelsma" <[hidden email]>
To: "[hidden email]" <[hidden email]>
Subject: RE: Nutch(plugins) and R
Hello - there are no responses, and i don't know what R is, but you are interested in HTML parsing, specifically topic detection, so here are my thoughts.

We have done topic detection in our custom HTML parser, but in Nutch speak we would do it in a ParseFilter implementation. Get the extracted text - a problem on its own - and feed it into a model builder with annotated data. Use the produced model in the ParseFilter to get the topic.

In our case we used Mallet, and it produced decent results, although we needed lots of code to facilitate the whole thing and keep stable results between model iterations.

If R has a Java interface, the ParseFilter is the place to be because there you can feed the text into the model, and get the topic back.

If R is not Java, i would - and have done - build a simple HTTP daemon around it, and call it over HTTP. It breaks a Hadoop principle of bringing code to data but rules can be broken. On the other hand, topic models are usually very large due to the amount of vocabulary. Not bringing the data with the code each time has its benefits too.

Regards,
M.

-----Original message-----

> From:Semyon Semyonov <[hidden email]>
> Sent: Friday 3rd November 2017 16:59
> To: [hidden email]
> Subject: Nutch(plugins) and R
>
> Hello,
>
> I'm looking for a way to use R in Nutch, particularly HTML parser, but usage in the other parts can be intresting as well. For each parsed document I would like to run a script and provide the results back to the system e.g. topic detection of the document.
>  
> NB I'm not looking for a way of scaling R to Hadoop or HDFS like Microsoft R server. This way uses Hadoop as an execution engine after the crawling process. In other words, first the computationally intensive full crawling after that another computationally intensive R/Hadoop process.
>  
> Instead I'm looking for a way of calling R scripts directly from java code of map or reduce jobs. Any ideas how to make it? One way to do it is "Rserve - Binary R server", but I'm looking for alternatives, to compare efficiency.
>
> Semyon.
>