Best Practices for open source pipeline/connectors

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Best Practices for open source pipeline/connectors

Dan Davis-2
I'm trying to do research for my organization on the best practices for
open source pipeline/connectors.   Since we need Web Crawls, File System
crawls, and Databases, it seems to me that Manifold CF might be the best
case.

Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
DataImportHandler?   It would be nice to decide in ManifestCF which
resultHandler should receive a document or id, barring that, you can post
some fields including an URL and have Data Import Handler handle it - it
already supports scripts whereas ManifestCF may not at this time.

Suggestions and ideas?

Thanks,

Dan
Reply | Threaded
Open this post in threaded view
|

Re: Best Practices for open source pipeline/connectors

Alexandre Rafalovitch
And, just to get the stupid question out of the way, you prefer to pay
in developer integration time rather than in purchase/maintenance
fees?

Because, otherwise, I would look at LucidWorks commercial offering
first, even to just have a comparison.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 4 November 2014 16:01, Dan Davis <[hidden email]> wrote:

> I'm trying to do research for my organization on the best practices for
> open source pipeline/connectors.   Since we need Web Crawls, File System
> crawls, and Databases, it seems to me that Manifold CF might be the best
> case.
>
> Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
> DataImportHandler?   It would be nice to decide in ManifestCF which
> resultHandler should receive a document or id, barring that, you can post
> some fields including an URL and have Data Import Handler handle it - it
> already supports scripts whereas ManifestCF may not at this time.
>
> Suggestions and ideas?
>
> Thanks,
>
> Dan
Reply | Threaded
Open this post in threaded view
|

Re: Best Practices for open source pipeline/connectors

Dan Davis-3
We are looking at LucidWorks, but also want to see what we can do on our
own so we can evaluate the value-add of Lucid Works among other products.

On Tue, Nov 4, 2014 at 4:13 PM, Alexandre Rafalovitch <[hidden email]>
wrote:

> And, just to get the stupid question out of the way, you prefer to pay
> in developer integration time rather than in purchase/maintenance
> fees?
>
> Because, otherwise, I would look at LucidWorks commercial offering
> first, even to just have a comparison.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 4 November 2014 16:01, Dan Davis <[hidden email]> wrote:
> > I'm trying to do research for my organization on the best practices for
> > open source pipeline/connectors.   Since we need Web Crawls, File System
> > crawls, and Databases, it seems to me that Manifold CF might be the best
> > case.
> >
> > Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
> > DataImportHandler?   It would be nice to decide in ManifestCF which
> > resultHandler should receive a document or id, barring that, you can post
> > some fields including an URL and have Data Import Handler handle it - it
> > already supports scripts whereas ManifestCF may not at this time.
> >
> > Suggestions and ideas?
> >
> > Thanks,
> >
> > Dan
>
Reply | Threaded
Open this post in threaded view
|

Re: Best Practices for open source pipeline/connectors

"Jürgen Wagner (DVT)"
In reply to this post by Dan Davis-2
Hello Dan,
  ManifoldCF is a connector framework, not a processing framework. Therefore, you may try your own lightweight connectors (which usually are not really rocket science and may take less time to write than time to configure a super-generic connector of some sort), any connector out there (including Nutch and others), or even commercial offerings from some companies. That, however, won't make you very happy all by itself - my guess. Key to really creating value out of data dragged into a search platform is the processing pipeline. Depending on the scale of data and the amount of processing you need to do, you may have a simplistic approach with just some more or less configurable Java components massaging your data until it can be sent to Solr (without using Tika or any other processing in Solr), or you can employ frameworks like Apache Spark to really heavily transform and enrich data before feeding them into Solr.

I prefer to have a clear separation between connectors, processing, indexing/querying and front-end visualization/interaction. Only the indexing/querying task I grant to Solr (or naked Lucene or Elasticsearch). Each of the different task types has entirely different scaling requirements and computing/networking properties, so you definitely don't want them depend on each other too much. Addressing the needs of several customers, one needs to even swap one or the other component in favour of what a customer prefers or needs.

So, my answer is YES. But we've also tried Nutch, our own specialized crawlers and a number of elaborate connectors for special customer applications. In any case, the result of that connector won't go into Solr. It will go into processing. From there it will go into Solr. I suspect that connectors won't be the challenge in your project. Solr requires a bit of tuning and tweaking, but you'll be fine eventually. Document processing will be the fun part. As you come to scaling the zoo of components, this will become evident :-)

What is the volume and influx rate in your scenario?

Best regards,
--Jürgen


On 04.11.2014 22:01, Dan Davis wrote:
I'm trying to do research for my organization on the best practices for
open source pipeline/connectors.   Since we need Web Crawls, File System
crawls, and Databases, it seems to me that Manifold CF might be the best
case.

Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
DataImportHandler?   It would be nice to decide in ManifestCF which
resultHandler should receive a document or id, barring that, you can post
some fields including an URL and have Data Import Handler handle it - it
already supports scripts whereas ManifestCF may not at this time.

Suggestions and ideas?

Thanks,

Dan



--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


Reply | Threaded
Open this post in threaded view
|

Fwd: Best Practices for open source pipeline/connectors

Dan Davis-2
The volume and influx rate in my scenario are very modest.  Our largest
collections with existing indexing software is about 20 million objects,
second up is about 5 million, and more typical collections are in the tens
of thousands.   Aside from the 20 million object corpus, we re-index and
replicate nightly.

Note that I am not responsible for any specific operation, only for
advising my organization on how to go.   My organization wants to
understand how much "programming" will be involved using Solr rather than
higher level tools.   I have to acknowledge that our current solution
involves less "programming", even as I urge them to think of programming as
not a bad thing ;)   From my perspective, 'programming', that is,
configuration files in a git archive (with internal comments and commit
comments) is much, much more productive than using form-based configuration
software.  So, my organizations' needs and mine may be different...

---------- Forwarded message ----------
From: "Jürgen Wagner (DVT)" <[hidden email]>
Date: Tue, Nov 4, 2014 at 4:48 PM
Subject: Re: Best Practices for open source pipeline/connectors
To: [hidden email]


 Hello Dan,
  ManifoldCF is a connector framework, not a processing framework.
Therefore, you may try your own lightweight connectors (which usually are
not really rocket science and may take less time to write than time to
configure a super-generic connector of some sort), any connector out there
(including Nutch and others), or even commercial offerings from some
companies. That, however, won't make you very happy all by itself - my
guess. Key to really creating value out of data dragged into a search
platform is the processing pipeline. Depending on the scale of data and the
amount of processing you need to do, you may have a simplistic approach
with just some more or less configurable Java components massaging your
data until it can be sent to Solr (without using Tika or any other
processing in Solr), or you can employ frameworks like Apache Spark to
really heavily transform and enrich data before feeding them into Solr.

I prefer to have a clear separation between connectors, processing,
indexing/querying and front-end visualization/interaction. Only the
indexing/querying task I grant to Solr (or naked Lucene or Elasticsearch).
Each of the different task types has entirely different scaling
requirements and computing/networking properties, so you definitely don't
want them depend on each other too much. Addressing the needs of several
customers, one needs to even swap one or the other component in favour of
what a customer prefers or needs.

So, my answer is YES. But we've also tried Nutch, our own specialized
crawlers and a number of elaborate connectors for special customer
applications. In any case, the result of that connector won't go into Solr.
It will go into processing. From there it will go into Solr. I suspect that
connectors won't be the challenge in your project. Solr requires a bit of
tuning and tweaking, but you'll be fine eventually. Document processing
will be the fun part. As you come to scaling the zoo of components, this
will become evident :-)

What is the volume and influx rate in your scenario?

Best regards,
--Jürgen



On 04.11.2014 22:01, Dan Davis wrote:

I'm trying to do research for my organization on the best practices for
open source pipeline/connectors.   Since we need Web Crawls, File System
crawls, and Databases, it seems to me that Manifold CF might be the best
case.

Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
DataImportHandler?   It would be nice to decide in ManifestCF which
resultHandler should receive a document or id, barring that, you can post
some fields including an URL and have Data Import Handler handle it - it
already supports scripts whereas ManifestCF may not at this time.

Suggestions and ideas?

Thanks,

Dan




--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: www.devoteam.de
------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071