Nutch vs Lucidworks Fusion

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch vs Lucidworks Fusion

Jorge Luis Betancourt González
Hi the new Fusion product from Lucidworks provides “advanced filesystem and web crawlers” anyone have had any time to check this out and how to compare to the current and future plans with Nutch? Just interested I personally haven’t been able to download the product and test it but I’m a bit curious and I would appreciate your comments on this topic.

Regards,Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Bayu Widyasanyata
I haven't used Fusion yet, but already play with Lucidworks 2.8.
The native embedded crawler for Lucidworks is Apperture [0].
IMHO nutch is better than Apperture in terms of stability, speed and
features.

[0] http://sourceforge.net/projects/aperture/

On Wed, Oct 1, 2014 at 9:19 AM, Jorge Luis Betancourt Gonzalez <
[hidden email]> wrote:

> Hi the new Fusion product from Lucidworks provides “advanced filesystem
> and web crawlers” anyone have had any time to check this out and how to
> compare to the current and future plans with Nutch? Just interested I
> personally haven’t been able to download the product and test it but I’m a
> bit curious and I would appreciate your comments on this topic.
>
> Regards,Concurso "Mi selfie por los 5". Detalles en
> http://justiciaparaloscinco.wordpress.com
>



--
wassalam,
[bayu]
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

lewis john mcgibbney
In reply to this post by Jorge Luis Betancourt González
Hi Folks,

On Thu, Oct 2, 2014 at 4:01 PM, <[hidden email]> wrote:

>
> Hi the new Fusion product from Lucidworks provides “advanced filesystem
> and web crawlers” anyone have had any time to check this out and how to
> compare to the current and future plans with Nutch?


I am always dissapointed (but never surprised) when people go and make
thier own crawlers, then run them on 'Hadoop'.
Nutch is THE native Hadoop application... why people go and write thier own
is utterly beyond me. Maybe they like MatLab too much or something ;) ...
or maybe modern fortran.

I do not speak on behalf of the Nutch PMC, however what I will say is this.
I know that there are many CIO's, CTO's as well as many engineers on this
list and I know they are watching this thread. Nutch if a different product
now than it was <1.5 years ago. The work that has been done is unparalleled
in the Python community, and I make this statement boldly. From what I have
seen, Nutch is the most comprehensive (if a bit challenging w.r.t
configuration) product out there for crawling. There are a number of issue
to be addressed in Jira. We know this. But this still does not change my
opinion on the software.

I have been corrected previously before for making such statements, however
my justification is as follows

* There is a HUGE difference between crawling and scraping.
* There is a huge difference between leveraging Apache Tika within the
Nutch framework for metadata augmentation of URLs over scraping.
* There is a HUGE benefit to be obtained by utlising the Nutch community...
which is sh*t hot in comparison to ~2-3 years ago. The same community has
also ensured that Nutch has been making regular releases for a number of
years now.



> Just interested I personally haven’t been able to download the product and
> test it but I’m a bit curious and I would appreciate your comments on this
> topic.
>
>
Hopefull the above is my outtake on things. If LucidWorks have some magic
sauce then great. Hopefully they consider bringing some of it back into
Nutch rather than writing some Perl or Python scripts. I would never expect
this to happen, however I am utterly depressed at how often I see this
happening.
Many software projects are failures.
Nutch is not. It is a decade old.
Nutch is a success.

 hth
Lewis
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Julien Nioche-4
Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch
PMC chair prior to me and a huge contributor to Nutch over the years. He
also works for Lucid.
Andrzej : would you mind telling us a bit about LW's crawler and why you
went for Aperture? Am I right in thinking that this has to do with the fact
that you needed to be able to pilot the crawl via a REST-like service?

Julien

On 3 October 2014 09:27, Lewis John Mcgibbney <[hidden email]>
wrote:

> Hi Folks,
>
> On Thu, Oct 2, 2014 at 4:01 PM, <[hidden email]> wrote:
>
> >
> > Hi the new Fusion product from Lucidworks provides “advanced filesystem
> > and web crawlers” anyone have had any time to check this out and how to
> > compare to the current and future plans with Nutch?
>
>
> I am always dissapointed (but never surprised) when people go and make
> thier own crawlers, then run them on 'Hadoop'.
> Nutch is THE native Hadoop application... why people go and write thier own
> is utterly beyond me. Maybe they like MatLab too much or something ;) ...
> or maybe modern fortran.
>
> I do not speak on behalf of the Nutch PMC, however what I will say is this.
> I know that there are many CIO's, CTO's as well as many engineers on this
> list and I know they are watching this thread. Nutch if a different product
> now than it was <1.5 years ago. The work that has been done is unparalleled
> in the Python community, and I make this statement boldly. From what I have
> seen, Nutch is the most comprehensive (if a bit challenging w.r.t
> configuration) product out there for crawling. There are a number of issue
> to be addressed in Jira. We know this. But this still does not change my
> opinion on the software.
>
> I have been corrected previously before for making such statements, however
> my justification is as follows
>
> * There is a HUGE difference between crawling and scraping.
> * There is a huge difference between leveraging Apache Tika within the
> Nutch framework for metadata augmentation of URLs over scraping.
> * There is a HUGE benefit to be obtained by utlising the Nutch community...
> which is sh*t hot in comparison to ~2-3 years ago. The same community has
> also ensured that Nutch has been making regular releases for a number of
> years now.
>
>
>
> > Just interested I personally haven’t been able to download the product
> and
> > test it but I’m a bit curious and I would appreciate your comments on
> this
> > topic.
> >
> >
> Hopefull the above is my outtake on things. If LucidWorks have some magic
> sauce then great. Hopefully they consider bringing some of it back into
> Nutch rather than writing some Perl or Python scripts. I would never expect
> this to happen, however I am utterly depressed at how often I see this
> happening.
> Many software projects are failures.
> Nutch is not. It is a decade old.
> Nutch is a success.
>
>  hth
> Lewis
>



--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Jorge Luis Betancourt González
In reply to this post by lewis john mcgibbney
Hi Lewis this was although my impression when I found out that Lucidwords haven’t used Nutch for their product. I also think that Nutch is great and I initially thought that Fusion used Nutch, but as I haven’t had the opportunity of testing the product of work with it I planted the question, mainly to satisfy my curiosity and to get a pick under the hood and opinions from those that tested Fusion.

In terms of development I think that Nutch is more mature than Aperture, and definitely though that if Lucidwords used it then it would mean a lot of work contributed back into the Nutch source code, making in the end a better product and perhaps helping to fulfill some weak points of Nutch (i.e the ability to control the crawl via REST API, despite the pending JIRA on the Fjodor work).

So in the end, the decision of using Apperture instead of Nutch really stroke me.

Regards,

On Oct 3, 2014, at 4:27 AM, Lewis John Mcgibbney <[hidden email]> wrote:

> Hi Folks,
>
> On Thu, Oct 2, 2014 at 4:01 PM, <[hidden email]> wrote:
>
>>
>> Hi the new Fusion product from Lucidworks provides “advanced filesystem
>> and web crawlers” anyone have had any time to check this out and how to
>> compare to the current and future plans with Nutch?
>
>
> I am always dissapointed (but never surprised) when people go and make
> thier own crawlers, then run them on 'Hadoop'.
> Nutch is THE native Hadoop application... why people go and write thier own
> is utterly beyond me. Maybe they like MatLab too much or something ;) ...
> or maybe modern fortran.
>
> I do not speak on behalf of the Nutch PMC, however what I will say is this.
> I know that there are many CIO's, CTO's as well as many engineers on this
> list and I know they are watching this thread. Nutch if a different product
> now than it was <1.5 years ago. The work that has been done is unparalleled
> in the Python community, and I make this statement boldly. From what I have
> seen, Nutch is the most comprehensive (if a bit challenging w.r.t
> configuration) product out there for crawling. There are a number of issue
> to be addressed in Jira. We know this. But this still does not change my
> opinion on the software.
>
> I have been corrected previously before for making such statements, however
> my justification is as follows
>
> * There is a HUGE difference between crawling and scraping.
> * There is a huge difference between leveraging Apache Tika within the
> Nutch framework for metadata augmentation of URLs over scraping.
> * There is a HUGE benefit to be obtained by utlising the Nutch community...
> which is sh*t hot in comparison to ~2-3 years ago. The same community has
> also ensured that Nutch has been making regular releases for a number of
> years now.
>
>
>
>> Just interested I personally haven’t been able to download the product and
>> test it but I’m a bit curious and I would appreciate your comments on this
>> topic.
>>
>>
> Hopefull the above is my outtake on things. If LucidWorks have some magic
> sauce then great. Hopefully they consider bringing some of it back into
> Nutch rather than writing some Perl or Python scripts. I would never expect
> this to happen, however I am utterly depressed at how often I see this
> happening.
> Many software projects are failures.
> Nutch is not. It is a decade old.
> Nutch is a success.
>
> hth
> Lewis

Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Mattmann, Chris A (3010)
In reply to this post by Julien Nioche-4
Thanks for following on this Julien and heya AB :)

We have an action through DARPA Memex and XDATA work and other
funding to expand out the REST services framework for Nutch.
Hope to have something patched and up on JIRA soon.

Of course would love to hear about this discussion.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Julien Nioche <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Friday, October 3, 2014 at 3:44 AM
To: "[hidden email]" <[hidden email]>, "[hidden email]"
<[hidden email]>
Subject: Re: Nutch vs Lucidworks Fusion

>Attaching Andrzej to this thread. As most of you know Andrzej was the
>Nutch
>PMC chair prior to me and a huge contributor to Nutch over the years. He
>also works for Lucid.
>Andrzej : would you mind telling us a bit about LW's crawler and why you
>went for Aperture? Am I right in thinking that this has to do with the
>fact
>that you needed to be able to pilot the crawl via a REST-like service?
>
>Julien
>
>On 3 October 2014 09:27, Lewis John Mcgibbney <[hidden email]>
>wrote:
>
>> Hi Folks,
>>
>> On Thu, Oct 2, 2014 at 4:01 PM, <[hidden email]>
>>wrote:
>>
>> >
>> > Hi the new Fusion product from Lucidworks provides ³advanced
>>filesystem
>> > and web crawlers² anyone have had any time to check this out and how
>>to
>> > compare to the current and future plans with Nutch?
>>
>>
>> I am always dissapointed (but never surprised) when people go and make
>> thier own crawlers, then run them on 'Hadoop'.
>> Nutch is THE native Hadoop application... why people go and write thier
>>own
>> is utterly beyond me. Maybe they like MatLab too much or something ;)
>>...
>> or maybe modern fortran.
>>
>> I do not speak on behalf of the Nutch PMC, however what I will say is
>>this.
>> I know that there are many CIO's, CTO's as well as many engineers on
>>this
>> list and I know they are watching this thread. Nutch if a different
>>product
>> now than it was <1.5 years ago. The work that has been done is
>>unparalleled
>> in the Python community, and I make this statement boldly. From what I
>>have
>> seen, Nutch is the most comprehensive (if a bit challenging w.r.t
>> configuration) product out there for crawling. There are a number of
>>issue
>> to be addressed in Jira. We know this. But this still does not change my
>> opinion on the software.
>>
>> I have been corrected previously before for making such statements,
>>however
>> my justification is as follows
>>
>> * There is a HUGE difference between crawling and scraping.
>> * There is a huge difference between leveraging Apache Tika within the
>> Nutch framework for metadata augmentation of URLs over scraping.
>> * There is a HUGE benefit to be obtained by utlising the Nutch
>>community...
>> which is sh*t hot in comparison to ~2-3 years ago. The same community
>>has
>> also ensured that Nutch has been making regular releases for a number of
>> years now.
>>
>>
>>
>> > Just interested I personally haven¹t been able to download the product
>> and
>> > test it but I¹m a bit curious and I would appreciate your comments on
>> this
>> > topic.
>> >
>> >
>> Hopefull the above is my outtake on things. If LucidWorks have some
>>magic
>> sauce then great. Hopefully they consider bringing some of it back into
>> Nutch rather than writing some Perl or Python scripts. I would never
>>expect
>> this to happen, however I am utterly depressed at how often I see this
>> happening.
>> Many software projects are failures.
>> Nutch is not. It is a decade old.
>> Nutch is a success.
>>
>>  hth
>> Lewis
>>
>
>
>
>--
>
>Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble

Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

David Arthur
Hello, all. My name is David Arthur and I work with AB on Fusion over here at Lucidworks. Just wanted to point out that we are _not_ using Aperture anymore. We were not super happy with it and have been looking to replace it for some time. We now have a home-grown crawler which is included with Fusion and is significantly more performant than Aperture. I'm sure AB will chime in when the sun rises in his timezone ;)

Cheers
-David
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Grant Ingersoll-2
I'd also add a few things from my view from Lucidworks.  Nutch is a very good crawler.  When we truly need large scale distributed crawls at Lucid, we use Nutch and all the rest of the features of Fusion work (recommendation engine, signal capture, pipelines, etc.) just fine w/ it.

As for our crawler (we call it Anda):  

1. It is geared towards the small to medium-largish crawls.  We don't rely on Hadoop (which is a significant overhead for most people who just want to crawl a few hundred thousand to a few million pages, in my experience. YMMV.) and it isn't a distributed crawler, although it can execute multiple parallel crawls across a large number of nodes.  

2. We've made it highly configurable, yet simple to start (an id and a URL to start)

3. The foundation of our web crawler is actually the basis for several of our crawlers: Filesystems, Dropbox, Box, Jive (coming next release) and many more.  One of our goals for rewriting the crawler was to have it support a wide variety of crawl activities and make it easy to implement those crawlers.  Given the rate at which we are able to add new data sources, I would say we have achieved that (especially compared to our old crawlers)

As for why we wrote our own?  To be honest, it started as a side project by one of our team members b/c he wanted to write a crawler and was never satisfied with what was currently available.  After a year or so of doing that, we started to use it in select commercial engagements where Aperture (our old crawler) was failing and where Nutch was too "heavy" (in reality, again, it was that Hadoop was too heavy), for lack of a better word.   From there, it has taken on numerous tasks for us and solved many very complex crawls at highly demanding customers due to its configurability and speed.


On Oct 3, 2014, at 10:16 PM, David Arthur <[hidden email]> wrote:

> Hello, all. My name is David Arthur and I work with AB on Fusion over here at
> Lucidworks. Just wanted to point out that we are _not_ using Aperture
> anymore. We were not super happy with it and have been looking to replace it
> for some time. We now have a home-grown crawler which is included with
> Fusion and is significantly more performant than Aperture. I'm sure AB will
> chime in when the sun rises in his timezone ;)
>
> Cheers
> -David
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-vs-Lucidworks-Fusion-tp4162022p4162626.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com





Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Mattmann, Chris A (3010)
Thanks for the info Grant. Hope to see more info about the
crawler at some point and maybe even some day an ASF Fusion
crawler (which you guys already contribute a ton to open source
and maybe it will happen some day anyways).

Lots of good stuff going on in Nutch, Tika, Solr, OODT, your guys
stuff, CommonsCrawl and tons of good crawling stuff going on lately.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Grant Ignersoll <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Sunday, October 5, 2014 at 5:04 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Nutch vs Lucidworks Fusion

>I'd also add a few things from my view from Lucidworks.  Nutch is a very
>good crawler.  When we truly need large scale distributed crawls at
>Lucid, we use Nutch and all the rest of the features of Fusion work
>(recommendation engine, signal capture, pipelines, etc.) just fine w/ it.
>
>As for our crawler (we call it Anda):
>
>1. It is geared towards the small to medium-largish crawls.  We don't
>rely on Hadoop (which is a significant overhead for most people who just
>want to crawl a few hundred thousand to a few million pages, in my
>experience. YMMV.) and it isn't a distributed crawler, although it can
>execute multiple parallel crawls across a large number of nodes.
>
>2. We've made it highly configurable, yet simple to start (an id and a
>URL to start)
>
>3. The foundation of our web crawler is actually the basis for several of
>our crawlers: Filesystems, Dropbox, Box, Jive (coming next release) and
>many more.  One of our goals for rewriting the crawler was to have it
>support a wide variety of crawl activities and make it easy to implement
>those crawlers.  Given the rate at which we are able to add new data
>sources, I would say we have achieved that (especially compared to our
>old crawlers)
>
>As for why we wrote our own?  To be honest, it started as a side project
>by one of our team members b/c he wanted to write a crawler and was never
>satisfied with what was currently available.  After a year or so of doing
>that, we started to use it in select commercial engagements where
>Aperture (our old crawler) was failing and where Nutch was too "heavy"
>(in reality, again, it was that Hadoop was too heavy), for lack of a
>better word.   From there, it has taken on numerous tasks for us and
>solved many very complex crawls at highly demanding customers due to its
>configurability and speed.
>
>
>On Oct 3, 2014, at 10:16 PM, David Arthur <[hidden email]> wrote:
>
>> Hello, all. My name is David Arthur and I work with AB on Fusion over
>>here at
>> Lucidworks. Just wanted to point out that we are _not_ using Aperture
>> anymore. We were not super happy with it and have been looking to
>>replace it
>> for some time. We now have a home-grown crawler which is included with
>> Fusion and is significantly more performant than Aperture. I'm sure AB
>>will
>> chime in when the sun rises in his timezone ;)
>>
>> Cheers
>> -David
>>
>>
>>
>> --
>> View this message in context:
>>http://lucene.472066.n3.nabble.com/Nutch-vs-Lucidworks-Fusion-tp4162022p4
>>162626.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>--------------------------------------------
>Grant Ingersoll | @gsingers
>http://www.lucidworks.com
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Andrzej Białecki-2
In reply to this post by Julien Nioche-4

On 03 Oct 2014, at 12:44, Julien Nioche <[hidden email]> wrote:

> Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch PMC chair prior to me and a huge contributor to Nutch over the years. He also works for Lucid.
> Andrzej : would you mind telling us a bit about LW's crawler and why you went for Aperture? Am I right in thinking that this has to do with the fact that you needed to be able to pilot the crawl via a REST-like service?
>


Hi Julien, and the Nutch community,

It’s been a while. :)

First, let me clarify a few issues:

* indeed I now work for Lucidworks and I’m involved in the design and implementation of the connectors framework in the Lucidworks Fusion product.

* the connectors framework in Fusion allows us to integrate wildly different third-party modules, e.g. we have connectors based on GCM, Hadoop map-reduce, databases, local files, remote filesystems, repositories, etc. In fact, it’s relatively straightforward to integrate Nutch with this framework, and we actually provide docs on how to do this, so nothing stops you from using Nutch if it fits the bill.

* this framework provides a uniform REST API to control the processing pipeline for documents collected by connectors, and in most cases to manage the crawlers configurations and processes. Only the first part is in place for the integration with Nutch, i.e. configuration and jobs have to be managed externally, and only the processing and content enrichment is controlled by Lucidworks Fusion. If we get a business case that requires a tighter integration I’m sure we will be happy to do it.

* the previous generation of Lucidworks products (called “LucidWorks Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy integration and while it worked fine for what it was originally intended, it definitely had some painful limitations, not to mention the fact that the Aperture project is no longer active.

* the current version of the product DOES NOT use Aperture for web crawling. It uses a web- and file-crawler implementation created in-house - it re-uses some code from crawler-commons, with some insignificant modifications.

* our content processing framework uses many Open Source tools (among them Tika, OpenNLP, Drools, of course Solr, and many others), on top of which we’ve built a powerful system for content enrichment, event processing and data analytics.

So, that’s the facts. Now, let’s move on to opinions ;)

There are many different use cases for web/file crawling and many different scalability and content processing requirements. So far the target audience for Lucidworks Fusion required small- to medium-scale web crawls, but with sophisticated content processing, extensive controls over the crawling frontier (handling sessions for depth-first crawls, cookies, form logins, etc) and easy management / control of the process over REST / UI. In many cases also the effort to set up and operate a Hadoop cluster was deemed too high or irrelevant to the core business. And in reality, as you know, there are workload sizes for which Hadoop is a total overkill and the roundtrip for processing is in the order of several minutes instead of seconds.

For these reasons we wanted to provide a web crawler that is self-contained, lean, doesn’t require Hadoop, is scalable well-enough from small to mid-size workloads without Hadoop’s overhead, and at the same time to provide an easy way to integrate high-scale crawler like Nutch for customers that need it - and for such customers we DO recommend Nutch as the best high-scale crawler. :)

So, in my opinion Lucidworks Fusion satisfies these goals, and provides a reasonable tradeoff between ease of use, scalability, rich content processing and ease of integration. Don’t take my word for it - download a copy and try it yourself!

To Lewis:

> Hopefull the above is my outtake on things. If LucidWorks have some magic
> sauce then great. Hopefully they consider bringing some of it back into
> Nutch rather than writing some Perl or Python scripts. I would never expect
> this to happen, however I am utterly depressed at how often I see this
> happening.

Lucidworks is a Java/Clojure shop, the connectors framework and the web crawler are written in Java - no Perl or Python in sight ;) Our magic sauce is in enterprise integration and rich content processing pipelines, not so much in base web crawling.

So, that’s my contribution to this discussion … I hope this answered some questions. Feel fee to ask if you need more information.

--
Best regards,
Andrzej Bialecki <[hidden email]>

--=# http://www.lucidworks.com #=--

Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Julien Nioche-4
Thanks for the explanations Andrzej and Grant!
Great to hear that you are using stuff from crawler-commons.

Julien

On 6 October 2014 14:47, Andrzej Białecki <[hidden email]> wrote:

>
> On 03 Oct 2014, at 12:44, Julien Nioche <[hidden email]>
> wrote:
>
> > Attaching Andrzej to this thread. As most of you know Andrzej was the
> Nutch PMC chair prior to me and a huge contributor to Nutch over the years.
> He also works for Lucid.
> > Andrzej : would you mind telling us a bit about LW's crawler and why you
> went for Aperture? Am I right in thinking that this has to do with the fact
> that you needed to be able to pilot the crawl via a REST-like service?
> >
>
>
> Hi Julien, and the Nutch community,
>
> It’s been a while. :)
>
> First, let me clarify a few issues:
>
> * indeed I now work for Lucidworks and I’m involved in the design and
> implementation of the connectors framework in the Lucidworks Fusion product.
>
> * the connectors framework in Fusion allows us to integrate wildly
> different third-party modules, e.g. we have connectors based on GCM, Hadoop
> map-reduce, databases, local files, remote filesystems, repositories, etc.
> In fact, it’s relatively straightforward to integrate Nutch with this
> framework, and we actually provide docs on how to do this, so nothing stops
> you from using Nutch if it fits the bill.
>
> * this framework provides a uniform REST API to control the processing
> pipeline for documents collected by connectors, and in most cases to manage
> the crawlers configurations and processes. Only the first part is in place
> for the integration with Nutch, i.e. configuration and jobs have to be
> managed externally, and only the processing and content enrichment is
> controlled by Lucidworks Fusion. If we get a business case that requires a
> tighter integration I’m sure we will be happy to do it.
>
> * the previous generation of Lucidworks products (called “LucidWorks
> Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy
> integration and while it worked fine for what it was originally intended,
> it definitely had some painful limitations, not to mention the fact that
> the Aperture project is no longer active.
>
> * the current version of the product DOES NOT use Aperture for web
> crawling. It uses a web- and file-crawler implementation created in-house -
> it re-uses some code from crawler-commons, with some insignificant
> modifications.
>
> * our content processing framework uses many Open Source tools (among them
> Tika, OpenNLP, Drools, of course Solr, and many others), on top of which
> we’ve built a powerful system for content enrichment, event processing and
> data analytics.
>
> So, that’s the facts. Now, let’s move on to opinions ;)
>
> There are many different use cases for web/file crawling and many
> different scalability and content processing requirements. So far the
> target audience for Lucidworks Fusion required small- to medium-scale web
> crawls, but with sophisticated content processing, extensive controls over
> the crawling frontier (handling sessions for depth-first crawls, cookies,
> form logins, etc) and easy management / control of the process over REST /
> UI. In many cases also the effort to set up and operate a Hadoop cluster
> was deemed too high or irrelevant to the core business. And in reality, as
> you know, there are workload sizes for which Hadoop is a total overkill and
> the roundtrip for processing is in the order of several minutes instead of
> seconds.
>
> For these reasons we wanted to provide a web crawler that is
> self-contained, lean, doesn’t require Hadoop, is scalable well-enough from
> small to mid-size workloads without Hadoop’s overhead, and at the same time
> to provide an easy way to integrate high-scale crawler like Nutch for
> customers that need it - and for such customers we DO recommend Nutch as
> the best high-scale crawler. :)
>
> So, in my opinion Lucidworks Fusion satisfies these goals, and provides a
> reasonable tradeoff between ease of use, scalability, rich content
> processing and ease of integration. Don’t take my word for it - download a
> copy and try it yourself!
>
> To Lewis:
>
> > Hopefull the above is my outtake on things. If LucidWorks have some magic
> > sauce then great. Hopefully they consider bringing some of it back into
> > Nutch rather than writing some Perl or Python scripts. I would never
> expect
> > this to happen, however I am utterly depressed at how often I see this
> > happening.
>
> Lucidworks is a Java/Clojure shop, the connectors framework and the web
> crawler are written in Java - no Perl or Python in sight ;) Our magic sauce
> is in enterprise integration and rich content processing pipelines, not so
> much in base web crawling.
>
> So, that’s my contribution to this discussion … I hope this answered some
> questions. Feel fee to ask if you need more information.
>
> --
> Best regards,
> Andrzej Bialecki <[hidden email]>
>
> --=# http://www.lucidworks.com #=--
>
>


--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Mattmann, Chris A (3010)
In reply to this post by Andrzej Białecki-2
Thanks AB.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Andrzej Białecki <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, October 6, 2014 at 3:47 PM
To: "[hidden email]" <[hidden email]>
Subject: Re: Nutch vs Lucidworks Fusion

>
>On 03 Oct 2014, at 12:44, Julien Nioche <[hidden email]>
>wrote:
>
>> Attaching Andrzej to this thread. As most of you know Andrzej was the
>>Nutch PMC chair prior to me and a huge contributor to Nutch over the
>>years. He also works for Lucid.
>> Andrzej : would you mind telling us a bit about LW's crawler and why
>>you went for Aperture? Am I right in thinking that this has to do with
>>the fact that you needed to be able to pilot the crawl via a REST-like
>>service?
>>
>
>
>Hi Julien, and the Nutch community,
>
>It's been a while. :)
>
>First, let me clarify a few issues:
>
>* indeed I now work for Lucidworks and I'm involved in the design and
>implementation of the connectors framework in the Lucidworks Fusion
>product.
>
>* the connectors framework in Fusion allows us to integrate wildly
>different third-party modules, e.g. we have connectors based on GCM,
>Hadoop map-reduce, databases, local files, remote filesystems,
>repositories, etc. In fact, it's relatively straightforward to integrate
>Nutch with this framework, and we actually provide docs on how to do
>this, so nothing stops you from using Nutch if it fits the bill.
>
>* this framework provides a uniform REST API to control the processing
>pipeline for documents collected by connectors, and in most cases to
>manage the crawlers configurations and processes. Only the first part is
>in place for the integration with Nutch, i.e. configuration and jobs have
>to be managed externally, and only the processing and content enrichment
>is controlled by Lucidworks Fusion. If we get a business case that
>requires a tighter integration I'm sure we will be happy to do it.
>
>* the previous generation of Lucidworks products (called "LucidWorks
>Search", shortly LWS) used Aperture as a Web crawler. This was a legacy
>integration and while it worked fine for what it was originally intended,
>it definitely had some painful limitations, not to mention the fact that
>the Aperture project is no longer active.
>
>* the current version of the product DOES NOT use Aperture for web
>crawling. It uses a web- and file-crawler implementation created in-house
>- it re-uses some code from crawler-commons, with some insignificant
>modifications.
>
>* our content processing framework uses many Open Source tools (among
>them Tika, OpenNLP, Drools, of course Solr, and many others), on top of
>which we've built a powerful system for content enrichment, event
>processing and data analytics.
>
>So, that's the facts. Now, let's move on to opinions ;)
>
>There are many different use cases for web/file crawling and many
>different scalability and content processing requirements. So far the
>target audience for Lucidworks Fusion required small- to medium-scale web
>crawls, but with sophisticated content processing, extensive controls
>over the crawling frontier (handling sessions for depth-first crawls,
>cookies, form logins, etc) and easy management / control of the process
>over REST / UI. In many cases also the effort to set up and operate a
>Hadoop cluster was deemed too high or irrelevant to the core business.
>And in reality, as you know, there are workload sizes for which Hadoop is
>a total overkill and the roundtrip for processing is in the order of
>several minutes instead of seconds.
>
>For these reasons we wanted to provide a web crawler that is
>self-contained, lean, doesn't require Hadoop, is scalable well-enough
>from small to mid-size workloads without Hadoop's overhead, and at the
>same time to provide an easy way to integrate high-scale crawler like
>Nutch for customers that need it - and for such customers we DO recommend
>Nutch as the best high-scale crawler. :)
>
>So, in my opinion Lucidworks Fusion satisfies these goals, and provides a
>reasonable tradeoff between ease of use, scalability, rich content
>processing and ease of integration. Don't take my word for it - download
>a copy and try it yourself!
>
>To Lewis:
>
>> Hopefull the above is my outtake on things. If LucidWorks have some
>>magic
>> sauce then great. Hopefully they consider bringing some of it back into
>> Nutch rather than writing some Perl or Python scripts. I would never
>>expect
>> this to happen, however I am utterly depressed at how often I see this
>> happening.
>
>Lucidworks is a Java/Clojure shop, the connectors framework and the web
>crawler are written in Java - no Perl or Python in sight ;) Our magic
>sauce is in enterprise integration and rich content processing pipelines,
>not so much in base web crawling.
>
>So, that's my contribution to this discussion ... I hope this answered some
>questions. Feel fee to ask if you need more information.
>
>--
>Best regards,
>Andrzej Bialecki <[hidden email]>
>
>--=# http://www.lucidworks.com #=--
>

Reply | Threaded
Open this post in threaded view
|

RE: Nutch vs Lucidworks Fusion

Markus Jelsma-2
In reply to this post by Jorge Luis Betancourt González
Hi Andrzej - how are you dealing with text extraction and other relevant items such as article date and accompanying images? And what about other metadata such as the author of the article or the rating some pasta recipe got? Also, must clients (or your consultants) implement site-specific URL filters to avoid those dreadful spider traps, or do you automatically resolve traps? If so, how?

Looking forward :)

Cheers,
Markus
 
 
-----Original message-----

> From:Andrzej Białecki <[hidden email]>
> Sent: Monday 6th October 2014 15:47
> To: [hidden email]
> Subject: Re: Nutch vs Lucidworks Fusion
>
> On 03 Oct 2014, at 12:44, Julien Nioche <[hidden email]> wrote:
>
> > Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch PMC chair prior to me and a huge contributor to Nutch over the years. He also works for Lucid.
> > Andrzej : would you mind telling us a bit about LW's crawler and why you went for Aperture? Am I right in thinking that this has to do with the fact that you needed to be able to pilot the crawl via a REST-like service?
> >
>
> Hi Julien, and the Nutch community,
>
> It’s been a while. :)
>
> First, let me clarify a few issues:
>
> * indeed I now work for Lucidworks and I’m involved in the design and implementation of the connectors framework in the Lucidworks Fusion product.
>
> * the connectors framework in Fusion allows us to integrate wildly different third-party modules, e.g. we have connectors based on GCM, Hadoop map-reduce, databases, local files, remote filesystems, repositories, etc. In fact, it’s relatively straightforward to integrate Nutch with this framework, and we actually provide docs on how to do this, so nothing stops you from using Nutch if it fits the bill.
>
> * this framework provides a uniform REST API to control the processing pipeline for documents collected by connectors, and in most cases to manage the crawlers configurations and processes. Only the first part is in place for the integration with Nutch, i.e. configuration and jobs have to be managed externally, and only the processing and content enrichment is controlled by Lucidworks Fusion. If we get a business case that requires a tighter integration I’m sure we will be happy to do it.
>
> * the previous generation of Lucidworks products (called “LucidWorks Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy integration and while it worked fine for what it was originally intended, it definitely had some painful limitations, not to mention the fact that the Aperture project is no longer active.
>
> * the current version of the product DOES NOT use Aperture for web crawling. It uses a web- and file-crawler implementation created in-house - it re-uses some code from crawler-commons, with some insignificant modifications.
>
> * our content processing framework uses many Open Source tools (among them Tika, OpenNLP, Drools, of course Solr, and many others), on top of which we’ve built a powerful system for content enrichment, event processing and data analytics.
>
> So, that’s the facts. Now, let’s move on to opinions ;)
>
> There are many different use cases for web/file crawling and many different scalability and content processing requirements. So far the target audience for Lucidworks Fusion required small- to medium-scale web crawls, but with sophisticated content processing, extensive controls over the crawling frontier (handling sessions for depth-first crawls, cookies, form logins, etc) and easy management / control of the process over REST / UI. In many cases also the effort to set up and operate a Hadoop cluster was deemed too high or irrelevant to the core business. And in reality, as you know, there are workload sizes for which Hadoop is a total overkill and the roundtrip for processing is in the order of several minutes instead of seconds.
>
> For these reasons we wanted to provide a web crawler that is self-contained, lean, doesn’t require Hadoop, is scalable well-enough from small to mid-size workloads without Hadoop’s overhead, and at the same time to provide an easy way to integrate high-scale crawler like Nutch for customers that need it - and for such customers we DO recommend Nutch as the best high-scale crawler. :)
>
> So, in my opinion Lucidworks Fusion satisfies these goals, and provides a reasonable tradeoff between ease of use, scalability, rich content processing and ease of integration. Don’t take my word for it - download a copy and try it yourself!
>
> To Lewis:
>
> > Hopefull the above is my outtake on things. If LucidWorks have some magic
> > sauce then great. Hopefully they consider bringing some of it back into
> > Nutch rather than writing some Perl or Python scripts. I would never expect
> > this to happen, however I am utterly depressed at how often I see this
> > happening.
>
> Lucidworks is a Java/Clojure shop, the connectors framework and the web crawler are written in Java - no Perl or Python in sight ;) Our magic sauce is in enterprise integration and rich content processing pipelines, not so much in base web crawling.
>
> So, that’s my contribution to this discussion … I hope this answered some questions. Feel fee to ask if you need more information.
>
> --
> Best regards,
> Andrzej Bialecki <[hidden email]>
>
> --=# http://www.lucidworks.com #=--
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Nutch vs Lucidworks Fusion

Markus Jelsma-2
Hi - anything on this? These are interesting topics so i am curious :)

Cheers,
Markus

 
 
-----Original message-----

> From:Markus Jelsma <[hidden email]>
> Sent: Thursday 9th October 2014 0:46
> To: [hidden email]; [hidden email]
> Subject: RE: Nutch vs Lucidworks Fusion
>
> Hi Andrzej - how are you dealing with text extraction and other relevant items such as article date and accompanying images? And what about other metadata such as the author of the article or the rating some pasta recipe got? Also, must clients (or your consultants) implement site-specific URL filters to avoid those dreadful spider traps, or do you automatically resolve traps? If so, how?
>
> Looking forward :)
>
> Cheers,
> Markus
>  
>  
> -----Original message-----
> > From:Andrzej Białecki <[hidden email]>
> > Sent: Monday 6th October 2014 15:47
> > To: [hidden email]
> > Subject: Re: Nutch vs Lucidworks Fusion
> >
> > On 03 Oct 2014, at 12:44, Julien Nioche <[hidden email]> wrote:
> >
> > > Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch PMC chair prior to me and a huge contributor to Nutch over the years. He also works for Lucid.
> > > Andrzej : would you mind telling us a bit about LW's crawler and why you went for Aperture? Am I right in thinking that this has to do with the fact that you needed to be able to pilot the crawl via a REST-like service?
> > >
> >
> > Hi Julien, and the Nutch community,
> >
> > It’s been a while. :)
> >
> > First, let me clarify a few issues:
> >
> > * indeed I now work for Lucidworks and I’m involved in the design and implementation of the connectors framework in the Lucidworks Fusion product.
> >
> > * the connectors framework in Fusion allows us to integrate wildly different third-party modules, e.g. we have connectors based on GCM, Hadoop map-reduce, databases, local files, remote filesystems, repositories, etc. In fact, it’s relatively straightforward to integrate Nutch with this framework, and we actually provide docs on how to do this, so nothing stops you from using Nutch if it fits the bill.
> >
> > * this framework provides a uniform REST API to control the processing pipeline for documents collected by connectors, and in most cases to manage the crawlers configurations and processes. Only the first part is in place for the integration with Nutch, i.e. configuration and jobs have to be managed externally, and only the processing and content enrichment is controlled by Lucidworks Fusion. If we get a business case that requires a tighter integration I’m sure we will be happy to do it.
> >
> > * the previous generation of Lucidworks products (called “LucidWorks Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy integration and while it worked fine for what it was originally intended, it definitely had some painful limitations, not to mention the fact that the Aperture project is no longer active.
> >
> > * the current version of the product DOES NOT use Aperture for web crawling. It uses a web- and file-crawler implementation created in-house - it re-uses some code from crawler-commons, with some insignificant modifications.
> >
> > * our content processing framework uses many Open Source tools (among them Tika, OpenNLP, Drools, of course Solr, and many others), on top of which we’ve built a powerful system for content enrichment, event processing and data analytics.
> >
> > So, that’s the facts. Now, let’s move on to opinions ;)
> >
> > There are many different use cases for web/file crawling and many different scalability and content processing requirements. So far the target audience for Lucidworks Fusion required small- to medium-scale web crawls, but with sophisticated content processing, extensive controls over the crawling frontier (handling sessions for depth-first crawls, cookies, form logins, etc) and easy management / control of the process over REST / UI. In many cases also the effort to set up and operate a Hadoop cluster was deemed too high or irrelevant to the core business. And in reality, as you know, there are workload sizes for which Hadoop is a total overkill and the roundtrip for processing is in the order of several minutes instead of seconds.
> >
> > For these reasons we wanted to provide a web crawler that is self-contained, lean, doesn’t require Hadoop, is scalable well-enough from small to mid-size workloads without Hadoop’s overhead, and at the same time to provide an easy way to integrate high-scale crawler like Nutch for customers that need it - and for such customers we DO recommend Nutch as the best high-scale crawler. :)
> >
> > So, in my opinion Lucidworks Fusion satisfies these goals, and provides a reasonable tradeoff between ease of use, scalability, rich content processing and ease of integration. Don’t take my word for it - download a copy and try it yourself!
> >
> > To Lewis:
> >
> > > Hopefull the above is my outtake on things. If LucidWorks have some magic
> > > sauce then great. Hopefully they consider bringing some of it back into
> > > Nutch rather than writing some Perl or Python scripts. I would never expect
> > > this to happen, however I am utterly depressed at how often I see this
> > > happening.
> >
> > Lucidworks is a Java/Clojure shop, the connectors framework and the web crawler are written in Java - no Perl or Python in sight ;) Our magic sauce is in enterprise integration and rich content processing pipelines, not so much in base web crawling.
> >
> > So, that’s my contribution to this discussion … I hope this answered some questions. Feel fee to ask if you need more information.
> >
> > --
> > Best regards,
> > Andrzej Bialecki <[hidden email]>
> >
> > --=# http://www.lucidworks.com #=--
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Talat Uyarer
Hi Markus,

We used Opennlp Named Extraction Tool. It is basic but very useful if
you have good model.



2014-10-14 0:03 GMT+03:00 Markus Jelsma <[hidden email]>:

> Hi - anything on this? These are interesting topics so i am curious :)
>
> Cheers,
> Markus
>
>
>
> -----Original message-----
>> From:Markus Jelsma <[hidden email]>
>> Sent: Thursday 9th October 2014 0:46
>> To: [hidden email]; [hidden email]
>> Subject: RE: Nutch vs Lucidworks Fusion
>>
>> Hi Andrzej - how are you dealing with text extraction and other relevant items such as article date and accompanying images? And what about other metadata such as the author of the article or the rating some pasta recipe got? Also, must clients (or your consultants) implement site-specific URL filters to avoid those dreadful spider traps, or do you automatically resolve traps? If so, how?
>>
>> Looking forward :)
>>
>> Cheers,
>> Markus
>>
>>
>> -----Original message-----
>> > From:Andrzej Białecki <[hidden email]>
>> > Sent: Monday 6th October 2014 15:47
>> > To: [hidden email]
>> > Subject: Re: Nutch vs Lucidworks Fusion
>> >
>> > On 03 Oct 2014, at 12:44, Julien Nioche <[hidden email]> wrote:
>> >
>> > > Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch PMC chair prior to me and a huge contributor to Nutch over the years. He also works for Lucid.
>> > > Andrzej : would you mind telling us a bit about LW's crawler and why you went for Aperture? Am I right in thinking that this has to do with the fact that you needed to be able to pilot the crawl via a REST-like service?
>> > >
>> >
>> > Hi Julien, and the Nutch community,
>> >
>> > It’s been a while. :)
>> >
>> > First, let me clarify a few issues:
>> >
>> > * indeed I now work for Lucidworks and I’m involved in the design and implementation of the connectors framework in the Lucidworks Fusion product.
>> >
>> > * the connectors framework in Fusion allows us to integrate wildly different third-party modules, e.g. we have connectors based on GCM, Hadoop map-reduce, databases, local files, remote filesystems, repositories, etc. In fact, it’s relatively straightforward to integrate Nutch with this framework, and we actually provide docs on how to do this, so nothing stops you from using Nutch if it fits the bill.
>> >
>> > * this framework provides a uniform REST API to control the processing pipeline for documents collected by connectors, and in most cases to manage the crawlers configurations and processes. Only the first part is in place for the integration with Nutch, i.e. configuration and jobs have to be managed externally, and only the processing and content enrichment is controlled by Lucidworks Fusion. If we get a business case that requires a tighter integration I’m sure we will be happy to do it.
>> >
>> > * the previous generation of Lucidworks products (called “LucidWorks Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy integration and while it worked fine for what it was originally intended, it definitely had some painful limitations, not to mention the fact that the Aperture project is no longer active.
>> >
>> > * the current version of the product DOES NOT use Aperture for web crawling. It uses a web- and file-crawler implementation created in-house - it re-uses some code from crawler-commons, with some insignificant modifications.
>> >
>> > * our content processing framework uses many Open Source tools (among them Tika, OpenNLP, Drools, of course Solr, and many others), on top of which we’ve built a powerful system for content enrichment, event processing and data analytics.
>> >
>> > So, that’s the facts. Now, let’s move on to opinions ;)
>> >
>> > There are many different use cases for web/file crawling and many different scalability and content processing requirements. So far the target audience for Lucidworks Fusion required small- to medium-scale web crawls, but with sophisticated content processing, extensive controls over the crawling frontier (handling sessions for depth-first crawls, cookies, form logins, etc) and easy management / control of the process over REST / UI. In many cases also the effort to set up and operate a Hadoop cluster was deemed too high or irrelevant to the core business. And in reality, as you know, there are workload sizes for which Hadoop is a total overkill and the roundtrip for processing is in the order of several minutes instead of seconds.
>> >
>> > For these reasons we wanted to provide a web crawler that is self-contained, lean, doesn’t require Hadoop, is scalable well-enough from small to mid-size workloads without Hadoop’s overhead, and at the same time to provide an easy way to integrate high-scale crawler like Nutch for customers that need it - and for such customers we DO recommend Nutch as the best high-scale crawler. :)
>> >
>> > So, in my opinion Lucidworks Fusion satisfies these goals, and provides a reasonable tradeoff between ease of use, scalability, rich content processing and ease of integration. Don’t take my word for it - download a copy and try it yourself!
>> >
>> > To Lewis:
>> >
>> > > Hopefull the above is my outtake on things. If LucidWorks have some magic
>> > > sauce then great. Hopefully they consider bringing some of it back into
>> > > Nutch rather than writing some Perl or Python scripts. I would never expect
>> > > this to happen, however I am utterly depressed at how often I see this
>> > > happening.
>> >
>> > Lucidworks is a Java/Clojure shop, the connectors framework and the web crawler are written in Java - no Perl or Python in sight ;) Our magic sauce is in enterprise integration and rich content processing pipelines, not so much in base web crawling.
>> >
>> > So, that’s my contribution to this discussion … I hope this answered some questions. Feel fee to ask if you need more information.
>> >
>> > --
>> > Best regards,
>> > Andrzej Bialecki <[hidden email]>
>> >
>> > --=# http://www.lucidworks.com #=--
>> >
>> >
>>



--
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Andrzej Białecki-2
In reply to this post by Markus Jelsma-2

On 13 Oct 2014, at 23:03, Markus Jelsma <[hidden email]> wrote:

> Hi - anything on this? These are interesting topics so i am curious :)

Hi,

Sorry, I was away for a few days (visiting Athens, which is a lovely city at this time of the year… :) )

We use Tika plus a few customised ContentHandlers and parsers to solve a few corner cases, and to extract text or xml + metadata recursively.

Linked items are noted as such, but processed independently.

We use a processing pipeline consisting of many stages, among others named-entity recognizers, regex extractors and transformers, Drools, etc. This pipeline is fully customizable and scriptable.

We don’t do anything specific yet to avoid spider traps, so yeah, it’s up to the filters to handle them as best as possible...

>
> Cheers,
> Markus
>
>
>
> -----Original message-----
>> From:Markus Jelsma <[hidden email]>
>> Sent: Thursday 9th October 2014 0:46
>> To: [hidden email]; [hidden email]
>> Subject: RE: Nutch vs Lucidworks Fusion
>>
>> Hi Andrzej - how are you dealing with text extraction and other relevant items such as article date and accompanying images? And what about other metadata such as the author of the article or the rating some pasta recipe got? Also, must clients (or your consultants) implement site-specific URL filters to avoid those dreadful spider traps, or do you automatically resolve traps? If so, how?
>>
>> Looking forward :)
>>
>> Cheers,
>> Markus
>>
>>
>> -----Original message-----
>>> From:Andrzej Białecki <[hidden email]>
>>> Sent: Monday 6th October 2014 15:47
>>> To: [hidden email]
>>> Subject: Re: Nutch vs Lucidworks Fusion
>>>
>>> On 03 Oct 2014, at 12:44, Julien Nioche <[hidden email]> wrote:
>>>
>>>> Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch PMC chair prior to me and a huge contributor to Nutch over the years. He also works for Lucid.
>>>> Andrzej : would you mind telling us a bit about LW's crawler and why you went for Aperture? Am I right in thinking that this has to do with the fact that you needed to be able to pilot the crawl via a REST-like service?
>>>>
>>>
>>> Hi Julien, and the Nutch community,
>>>
>>> It’s been a while. :)
>>>
>>> First, let me clarify a few issues:
>>>
>>> * indeed I now work for Lucidworks and I’m involved in the design and implementation of the connectors framework in the Lucidworks Fusion product.
>>>
>>> * the connectors framework in Fusion allows us to integrate wildly different third-party modules, e.g. we have connectors based on GCM, Hadoop map-reduce, databases, local files, remote filesystems, repositories, etc. In fact, it’s relatively straightforward to integrate Nutch with this framework, and we actually provide docs on how to do this, so nothing stops you from using Nutch if it fits the bill.
>>>
>>> * this framework provides a uniform REST API to control the processing pipeline for documents collected by connectors, and in most cases to manage the crawlers configurations and processes. Only the first part is in place for the integration with Nutch, i.e. configuration and jobs have to be managed externally, and only the processing and content enrichment is controlled by Lucidworks Fusion. If we get a business case that requires a tighter integration I’m sure we will be happy to do it.
>>>
>>> * the previous generation of Lucidworks products (called “LucidWorks Search”, shortly LWS) used Aperture as a Web crawler. This was a legacy integration and while it worked fine for what it was originally intended, it definitely had some painful limitations, not to mention the fact that the Aperture project is no longer active.
>>>
>>> * the current version of the product DOES NOT use Aperture for web crawling. It uses a web- and file-crawler implementation created in-house - it re-uses some code from crawler-commons, with some insignificant modifications.
>>>
>>> * our content processing framework uses many Open Source tools (among them Tika, OpenNLP, Drools, of course Solr, and many others), on top of which we’ve built a powerful system for content enrichment, event processing and data analytics.
>>>
>>> So, that’s the facts. Now, let’s move on to opinions ;)
>>>
>>> There are many different use cases for web/file crawling and many different scalability and content processing requirements. So far the target audience for Lucidworks Fusion required small- to medium-scale web crawls, but with sophisticated content processing, extensive controls over the crawling frontier (handling sessions for depth-first crawls, cookies, form logins, etc) and easy management / control of the process over REST / UI. In many cases also the effort to set up and operate a Hadoop cluster was deemed too high or irrelevant to the core business. And in reality, as you know, there are workload sizes for which Hadoop is a total overkill and the roundtrip for processing is in the order of several minutes instead of seconds.
>>>
>>> For these reasons we wanted to provide a web crawler that is self-contained, lean, doesn’t require Hadoop, is scalable well-enough from small to mid-size workloads without Hadoop’s overhead, and at the same time to provide an easy way to integrate high-scale crawler like Nutch for customers that need it - and for such customers we DO recommend Nutch as the best high-scale crawler. :)
>>>
>>> So, in my opinion Lucidworks Fusion satisfies these goals, and provides a reasonable tradeoff between ease of use, scalability, rich content processing and ease of integration. Don’t take my word for it - download a copy and try it yourself!
>>>
>>> To Lewis:
>>>
>>>> Hopefull the above is my outtake on things. If LucidWorks have some magic
>>>> sauce then great. Hopefully they consider bringing some of it back into
>>>> Nutch rather than writing some Perl or Python scripts. I would never expect
>>>> this to happen, however I am utterly depressed at how often I see this
>>>> happening.
>>>
>>> Lucidworks is a Java/Clojure shop, the connectors framework and the web crawler are written in Java - no Perl or Python in sight ;) Our magic sauce is in enterprise integration and rich content processing pipelines, not so much in base web crawling.
>>>
>>> So, that’s my contribution to this discussion … I hope this answered some questions. Feel fee to ask if you need more information.
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki <[hidden email]>
>>>
>>> --=# http://www.lucidworks.com #=--
>>>
>>>
>>

---
Best regards,

Andrzej Bialecki

Reply | Threaded
Open this post in threaded view
|

Re: Nutch vs Lucidworks Fusion

Mattmann, Chris A (3010)
Thanks Andrzej. We have been doing some awesome stuff with Tika
lately (OCR, GDAL and other things), and glad to hear you guys are
integrating with that. If there's any good stuff you guys have
(like NER, etc.) that would be appreciated to be pushed up, and
also to be collaborated on. We are funded on DARPA Memex and a number
of us are working on that project to expand, Nutch, Tika and Solr.

CC'ing dev lists for Nutch and Tika for awareness.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Andrzej Białecki <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 14, 2014 at 9:01 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Nutch vs Lucidworks Fusion

>
>On 13 Oct 2014, at 23:03, Markus Jelsma <[hidden email]>
>wrote:
>
>> Hi - anything on this? These are interesting topics so i am curious :)
>
>Hi,
>
>Sorry, I was away for a few days (visiting Athens, which is a lovely city
>at this time of the year... :) )
>
>We use Tika plus a few customised ContentHandlers and parsers to solve a
>few corner cases, and to extract text or xml + metadata recursively.
>
>Linked items are noted as such, but processed independently.
>
>We use a processing pipeline consisting of many stages, among others
>named-entity recognizers, regex extractors and transformers, Drools, etc.
>This pipeline is fully customizable and scriptable.
>
>We don't do anything specific yet to avoid spider traps, so yeah, it's up
>to the filters to handle them as best as possible...
>
>>
>> Cheers,
>> Markus
>>
>>
>>
>> -----Original message-----
>>> From:Markus Jelsma <[hidden email]>
>>> Sent: Thursday 9th October 2014 0:46
>>> To: [hidden email]; [hidden email]
>>> Subject: RE: Nutch vs Lucidworks Fusion
>>>
>>> Hi Andrzej - how are you dealing with text extraction and other
>>>relevant items such as article date and accompanying images? And what
>>>about other metadata such as the author of the article or the rating
>>>some pasta recipe got? Also, must clients (or your consultants)
>>>implement site-specific URL filters to avoid those dreadful spider
>>>traps, or do you automatically resolve traps? If so, how?
>>>
>>> Looking forward :)
>>>
>>> Cheers,
>>> Markus
>>>
>>>
>>> -----Original message-----
>>>> From:Andrzej Białecki <[hidden email]>
>>>> Sent: Monday 6th October 2014 15:47
>>>> To: [hidden email]
>>>> Subject: Re: Nutch vs Lucidworks Fusion
>>>>
>>>> On 03 Oct 2014, at 12:44, Julien Nioche
>>>><[hidden email]> wrote:
>>>>
>>>>> Attaching Andrzej to this thread. As most of you know Andrzej was
>>>>>the Nutch PMC chair prior to me and a huge contributor to Nutch over
>>>>>the years. He also works for Lucid.
>>>>> Andrzej : would you mind telling us a bit about LW's crawler and why
>>>>>you went for Aperture? Am I right in thinking that this has to do
>>>>>with the fact that you needed to be able to pilot the crawl via a
>>>>>REST-like service?
>>>>>
>>>>
>>>> Hi Julien, and the Nutch community,
>>>>
>>>> It's been a while. :)
>>>>
>>>> First, let me clarify a few issues:
>>>>
>>>> * indeed I now work for Lucidworks and I'm involved in the design and
>>>>implementation of the connectors framework in the Lucidworks Fusion
>>>>product.
>>>>
>>>> * the connectors framework in Fusion allows us to integrate wildly
>>>>different third-party modules, e.g. we have connectors based on GCM,
>>>>Hadoop map-reduce, databases, local files, remote filesystems,
>>>>repositories, etc. In fact, it's relatively straightforward to
>>>>integrate Nutch with this framework, and we actually provide docs on
>>>>how to do this, so nothing stops you from using Nutch if it fits the
>>>>bill.
>>>>
>>>> * this framework provides a uniform REST API to control the
>>>>processing pipeline for documents collected by connectors, and in most
>>>>cases to manage the crawlers configurations and processes. Only the
>>>>first part is in place for the integration with Nutch, i.e.
>>>>configuration and jobs have to be managed externally, and only the
>>>>processing and content enrichment is controlled by Lucidworks Fusion.
>>>>If we get a business case that requires a tighter integration I'm sure
>>>>we will be happy to do it.
>>>>
>>>> * the previous generation of Lucidworks products (called "LucidWorks
>>>>Search", shortly LWS) used Aperture as a Web crawler. This was a
>>>>legacy integration and while it worked fine for what it was originally
>>>>intended, it definitely had some painful limitations, not to mention
>>>>the fact that the Aperture project is no longer active.
>>>>
>>>> * the current version of the product DOES NOT use Aperture for web
>>>>crawling. It uses a web- and file-crawler implementation created
>>>>in-house - it re-uses some code from crawler-commons, with some
>>>>insignificant modifications.
>>>>
>>>> * our content processing framework uses many Open Source tools (among
>>>>them Tika, OpenNLP, Drools, of course Solr, and many others), on top
>>>>of which we've built a powerful system for content enrichment, event
>>>>processing and data analytics.
>>>>
>>>> So, that's the facts. Now, let's move on to opinions ;)
>>>>
>>>> There are many different use cases for web/file crawling and many
>>>>different scalability and content processing requirements. So far the
>>>>target audience for Lucidworks Fusion required small- to medium-scale
>>>>web crawls, but with sophisticated content processing, extensive
>>>>controls over the crawling frontier (handling sessions for depth-first
>>>>crawls, cookies, form logins, etc) and easy management / control of
>>>>the process over REST / UI. In many cases also the effort to set up
>>>>and operate a Hadoop cluster was deemed too high or irrelevant to the
>>>>core business. And in reality, as you know, there are workload sizes
>>>>for which Hadoop is a total overkill and the roundtrip for processing
>>>>is in the order of several minutes instead of seconds.
>>>>
>>>> For these reasons we wanted to provide a web crawler that is
>>>>self-contained, lean, doesn't require Hadoop, is scalable well-enough
>>>>from small to mid-size workloads without Hadoop's overhead, and at the
>>>>same time to provide an easy way to integrate high-scale crawler like
>>>>Nutch for customers that need it - and for such customers we DO
>>>>recommend Nutch as the best high-scale crawler. :)
>>>>
>>>> So, in my opinion Lucidworks Fusion satisfies these goals, and
>>>>provides a reasonable tradeoff between ease of use, scalability, rich
>>>>content processing and ease of integration. Don't take my word for it
>>>>- download a copy and try it yourself!
>>>>
>>>> To Lewis:
>>>>
>>>>> Hopefull the above is my outtake on things. If LucidWorks have some
>>>>>magic
>>>>> sauce then great. Hopefully they consider bringing some of it back
>>>>>into
>>>>> Nutch rather than writing some Perl or Python scripts. I would never
>>>>>expect
>>>>> this to happen, however I am utterly depressed at how often I see
>>>>>this
>>>>> happening.
>>>>
>>>> Lucidworks is a Java/Clojure shop, the connectors framework and the
>>>>web crawler are written in Java - no Perl or Python in sight ;) Our
>>>>magic sauce is in enterprise integration and rich content processing
>>>>pipelines, not so much in base web crawling.
>>>>
>>>> So, that's my contribution to this discussion ... I hope this answered
>>>>some questions. Feel fee to ask if you need more information.
>>>>
>>>> --
>>>> Best regards,
>>>> Andrzej Bialecki <[hidden email]>
>>>>
>>>> --=# http://www.lucidworks.com #=--
>>>>
>>>>
>>>
>
>---
>Best regards,
>
>Andrzej Bialecki
>