Can this be achieved? (Was: document support for file system crawling)

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Can this be achieved? (Was: document support for file system crawling)

Eivind Hasle Amundsen
First: Please pardon the cross-post to solr-user for reference. I hope
to continue this thread in solr-dev. Please answer to solr-dev.

> 1) more documentation (and posisbly some locking configuration options) on
> how you can use Solr to access an index generated by the nutch crawler (i
> think Thorsten has allready done this) or by Compass, or any other system
> that builds a Lucene index.

Thorsten Scherler? Is this code available anywhere? Sounds very
interesting to me. Maybe someone could ellaborate on the differences
between the indexes created by Nutch/Solr/Compass/etc., or point me in
the direction of an answer?

> 2) "contrib" code that runs as it's own process to crawl documents and
> send them to a Solr server. (mybe it parses them, or maybe it relies on
> the next item...)

Do you know FAST? It uses a step-by-step approach ("pipeline") in which
all of these tasks are done. Much of it is tuned in a easy web tool.

The point I'm trying to make is that contrib code is nice, but a
"complete package" with these possibilities could broaden Solr's appeal
somewhat.

> 3) Stock "update" plugins that can each read a raw inputstreams of a some
> widely used file format (PDF, RDF, HTML, XML of any schema) and have
> configuration options telling them them what fields in the schema each
> part of their document type should go in.

Exactly, this sounds more like it. But if similar inputstreams can be
handled by Nutch, what's the point in using Solr at all? The http API's?
  In other words, both Nutch and Solr seem to have functionality that
enterprises would want. But neither gives you the "total solution".

Don't get it wrong, I don't want to bloat the products, even though it
would be nice to have a crossover solution which is easy to set up.

The architecture could look something like this:

Connector -> Parser -> DocProc -> (via schema) -> Index

Possible connectors: JDBC, filesystem, crawler, manual feed
Possible parsers: PDF, whatever

Both connectors, parsers AND the document processors would be plugins.
The DocProcs would typically be adjusted for each enterprise' needs, so
that it fits with their schema.xml.

Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to
really know all possibilities and limitations. But I do believe that the
outlined architecture would be flexible and answer many needs. So the
question is:

What is Solr missing? Could parts of Nutch be used in Solr to achieve
this? How? Have I misunderstood completely? :)

Eivind
Reply | Threaded
Open this post in threaded view
|

Merging Results from Multiple Solr Instances

Sangraal Aiken-2
I have three instances of Solr on a single machine that I would like  
to query as if they were a single instance.

I was wondering if there's a facility, or if anyone has any  
recommendations, for searching across multiple instances with a  
single query, or merging the results of multiple instances into one  
result set.

-STA


Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Thorsten Scherler-3
In reply to this post by Eivind Hasle Amundsen
On Tue, 2007-01-16 at 16:28 +0100, Eivind Hasle Amundsen wrote:
> First: Please pardon the cross-post to solr-user for reference. I hope
> to continue this thread in solr-dev. Please answer to solr-dev.
>
> > 1) more documentation (and posisbly some locking configuration options) on
> > how you can use Solr to access an index generated by the nutch crawler (i
> > think Thorsten has allready done this) or by Compass, or any other system
> > that builds a Lucene index.
>
> Thorsten Scherler?

Hmm, I did the exact opposite. Let me explain you my use case. I am
working on a part of a portal http://andaluciajunta.es. The new version
of http://andaluciajunta.es/BOJA is this part. The current version is
based on a proprietary CMS in a dynamic environment.

The new development is using Apache Forrest to generate static html. Now
coming to solr/nutch, you can find
http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html the
current search engine especially for the BOJA. This will be changed to a
solr powered solution.

Like I said I only doing one part of the portal and the main portal has
a search engine as well. http://andaluciajunta.es/aj-sea-.html This
search engine will be based on nutch in the next version. The special
character is that this main portal search engine has to search against
the solr BOJA based indexed. Meaning Nutch will have to search the solr
index and not vice versa.

What I did before we decided to go with solr is a simple test. I copied
my solr index into a nutch instance and dispatched a couple of queries.
The only thing that you need is to keep your solr schema as close as
possible to the one nutch uses. For example nutch is using "content",
"url" and "title" as default fields when returning the search result. If
you do not have this fields in your solr schema then nutch will return
null.

> Is this code available anywhere?

Like stated above it is a couple of lines in the solr schema:
<field name="title" type="string" stored="true" ></field>
<field name="content" type="text" indexed="true" stored="true" />
<field name="url" type="string" stored="true" ></field>

Then you just need to point your nutch instance to this index for
searching.

The same is true (I guess) for solr searching a nutch index. You could
use nutch to update the index, point solr to the index and it should
work (if you have defined all field in the schema).

> Sounds very
> interesting to me. Maybe someone could ellaborate on the differences
> between the indexes created by Nutch/Solr/Compass/etc., or point me in
> the direction of an answer?
>

I am far from being an expert, but actually the only real difference I
see is the usage of field names. All indexes could be searched with a
raw lucene component (if they are based on the same lucene version)

> > 2) "contrib" code that runs as it's own process to crawl documents and
> > send them to a Solr server. (mybe it parses them, or maybe it relies on
> > the next item...)
>
> Do you know FAST? It uses a step-by-step approach ("pipeline") in which
> all of these tasks are done. Much of it is tuned in a easy web tool.
>
> The point I'm trying to make is that contrib code is nice, but a
> "complete package" with these possibilities could broaden Solr's appeal
> somewhat.

Hmm, I think like Hoss on this, why do we want do the same work of
nutch. If you need a crawler why not use the one from nutch and change
some lines? I actually use Forrest as crawler when I generate the new
sites, which will then push the content to the solr server via a plugin
I developed:
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/

>
> > 3) Stock "update" plugins that can each read a raw inputstreams of a some
> > widely used file format (PDF, RDF, HTML, XML of any schema) and have
> > configuration options telling them them what fields in the schema each
> > part of their document type should go in.
>
> Exactly, this sounds more like it. But if similar inputstreams can be
> handled by Nutch, what's the point in using Solr at all? The http API's?
>   In other words, both Nutch and Solr seem to have functionality that
> enterprises would want. But neither gives you the "total solution".
>

Not sure. I am using solr because I did not had to develop three
different nutch plugin to make it work. Further I have punctual updates
where I push a certain set of documents to the server, so no need for a
crawler.

> Don't get it wrong, I don't want to bloat the products, even though it
> would be nice to have a crossover solution which is easy to set up.
>
> The architecture could look something like this:
>
> Connector -> Parser -> DocProc -> (via schema) -> Index
>
> Possible connectors: JDBC, filesystem, crawler, manual feed
> Possible parsers: PDF, whatever

> Both connectors, parsers AND the document processors would be plugins.
> The DocProcs would typically be adjusted for each enterprise' needs, so
> that it fits with their schema.xml.
>
> Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to
> really know all possibilities and limitations. But I do believe that the
> outlined architecture would be flexible and answer many needs.

Not sure.

> So the
> question is:
>
> What is Solr missing? Could parts of Nutch be used in Solr to achieve
> this? How? Have I misunderstood completely? :)
>

Solr and nutch is based on lucene. Meaning you COULD reuse nearly
everything. From my own experience (I am a Apache Lenya and Apache
Forrest committer, both based in cocoon) I must say that interproject
collaboration is very hard to archive.

Anyway if you need a crawler but want use solr then see the crawling
code of nutch and write a standalone crawler that will update the solr
index.

salu2

> Eivind

Reply | Threaded
Open this post in threaded view
|

Re: Merging Results from Multiple Solr Instances

Chris Hostetter-3
In reply to this post by Sangraal Aiken-2

1) please don't reply to another thread with a new subject that is
unrelated ... it makes following threads in mail readers and mailing list
archives difficult.

2) the mailing list archives have some recent discussion on this you
should look at...

http://www.nabble.com/forum/Search.jtp?forum=14479&local=y&query=merging
http://www.nabble.com/forum/Search.jtp?forum=14479&local=y&query=multiple+indices

3) this question really belongs on solr-user, as it is about using Solr --
not the underlying Solr development. if you have followup questions after
reading the archives, please post your questions there.

: I have three instances of Solr on a single machine that I would like
: to query as if they were a single instance.
:
: I was wondering if there's a facility, or if anyone has any
: recommendations, for searching across multiple instances with a
: single query, or merging the results of multiple instances into one
: result set.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Eivind Hasle Amundsen
In reply to this post by Thorsten Scherler-3
> (...) http://andaluciajunta.es/aj-sea-.html This
> search engine will be based on nutch in the next version. The special
> character is that this main portal search engine has to search against
> the solr BOJA based indexed. Meaning Nutch will have to search the solr
> index and not vice versa.

Looks interesting, too bad I don't understand the language :) But I do
get the idea.

> <field name="title" type="string" stored="true" ></field>
> <field name="content" type="text" indexed="true" stored="true" />
> <field name="url" type="string" stored="true" ></field>

This is valuable info to a newbie like me. Thanks a lot! It also makes
me think "why didn't they make Nutch more general" but I guess they
wanted consistence (and it's probably configurable in Nutch, hidden
somewhere, anyway).

> Hmm, I think like Hoss on this, why do we want do the same work of
> nutch. If you need a crawler why not use the one from nutch and change
> some lines? I actually use Forrest as crawler when I generate the new
> sites, which will then push the content to the solr server via a plugin
> I developed:
> http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/

Nice one. I didn't know about Forrest, so thanks for the advice. My
"needs" are actually not related to a certain site or application at
all. I am here for pure interest in Lucene/Solr/Nutch/etc, and the
search field in general (enterprise in particular). Think of my needs as
more of R&D, if you'd like.

Ultimately I hope to be able to contribute, but don't know where to
start (and how much time/resources I have).

> Not sure. I am using solr because I did not had to develop three
> different nutch plugin to make it work. Further I have punctual updates
> where I push a certain set of documents to the server, so no need for a
> crawler.

My suggestion is independent of how often docs are indexed. Everything
should be possible - manual feed, crawler, filesystem surveillance,
database transaction reports - as long as this is kept separate limit
lies in one's imagination.

>> Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to
>> really know all possibilities and limitations. But I do believe that the
>> outlined architecture would be flexible and answer many needs.
>
> Not sure.

Well I am thinking about a way to meet the same market as some
commercial vendors. They should not and may not be copied, so don't get
me wrong. But I do know something about this market, or at least I like
to think so.

> (...) I must say that interproject
> collaboration is very hard to archive.

I take your word for it :) I guess one way is to just code/create the
damn thing, not talk about it like I do now. *dreaming*

> Anyway if you need a crawler but want use solr then see the crawling
> code of nutch and write a standalone crawler that will update the solr
> index.

Will do! Thanks for a full and wise reply.

Eivind
Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Chris Hostetter-3
In reply to this post by Eivind Hasle Amundsen

: > 2) "contrib" code that runs as it's own process to crawl documents and
: > send them to a Solr server. (mybe it parses them, or maybe it relies on
: > the next item...)
:
: Do you know FAST? It uses a step-by-step approach ("pipeline") in which
: all of these tasks are done. Much of it is tuned in a easy web tool.
:
: The point I'm trying to make is that contrib code is nice, but a
: "complete package" with these possibilities could broaden Solr's appeal
: somewhat.

in my limited experience, commercial applications tend to be "all in one
solutions" not so much because it really adds value that they are "all in
one" but because it helps with vendor lock in -- companies tend to want
to give you a single monolithic product, because if they gave you lots of
little products that tried to do just one thing very well, you might
decide that one of their little products is crap, and write your own
replacement for just that piece using a great open-source library you
found .. and then you might realize that this *other* open-source library
would make it really easy for you to replace this other little piece of
their system and would be a lot more efficient ... etc.  the point being
that once they've got you using a monolithic application, it's a lot
harder to stop using the whole thing all at once, then it would be for you
to stop using 1 of N mini-applications they provide.

open source projects on the other hand, tend to work well when they are
composed of lots of little pieces -- because little pieces are easier to
work on when you have a finite number of developers working in their spare
time, because each developer can work on a few peices at a time, and those
peices can be reviewed/used by other people even if the system as a whole
isn't finished.

I ramble about this to try and explain why Solr may not be what you would
consider a "complete package" at the moment .... and why it may never
reach the state you think would make it a "complete package" ... because
there are a lot of people out there who don't need it to be -- it would be
hard to be a full blown GUI configurable, web crawling, document
detecting, customizable schema based application and still allow for
people to use small pieces of it.

To put it another way: it's a lot easier for people to put reusable
components with clean APIs together in interesting ways, then it is for
people to extract reusable components with clean APIs from a monolithic
application.

: Exactly, this sounds more like it. But if similar inputstreams can be
: handled by Nutch, what's the point in using Solr at all? The http API's?
:   In other words, both Nutch and Solr seem to have functionality that
: enterprises would want. But neither gives you the "total solution".

if what you care about is extracting text from arbitrary documents, that's
what Nutch does well -- it doesn't worry about trying to extract
complex structure from the documents, so it can parse/index lots of
document formats into the same index.  Solr's goal is to let *you* define
the index format, but that requires you defining what data goes into which
fields as well, and that makes generic reusable document crawlers parsers
harder to get "right" in a way that can work for anyone.





-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Bertrand Delacretaz
On 1/17/07, Chris Hostetter <[hidden email]> wrote:

> ...To put it another way: it's a lot easier for people to put reusable
> components with clean APIs together in interesting ways, then it is for
> people to extract reusable components with clean APIs from a monolithic
> application....

Very much +1 on this, the beauty of Solr is that it does *one* thing
very well, it's important to keep it this way IMHO.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Eivind Hasle Amundsen
In reply to this post by Chris Hostetter-3
> (...) the point being
> that once they've got you using a monolithic application, it's a lot
> harder to stop using the whole thing all at once, then it would be for you
> to stop using 1 of N mini-applications they provide.

Well, FAST is composed of many small, modular products that can be
replaced by other (open source) parts. It is not monolithic. The first
time you install it, it might appear to be just one giant beast. However
it is not.

> I ramble about this to try and explain why Solr may not be what you would
> consider a "complete package" at the moment .... and why it may never
> reach the state you think would make it a "complete package" ... because
> there are a lot of people out there who don't need it to be -- it would be
> hard to be a full blown GUI configurable, web crawling, document
> detecting, customizable schema based application and still allow for
> people to use small pieces of it.

I am not arguing on this. I think my point didn't get through, then.

Compare this to Linux distributions. People still use them, right? What
about an "enterprise search distro"? That is exactly what some
commercial vendors offer, only far less elegant than anything containing
Lucene et al would probably be.

> To put it another way: it's a lot easier for people to put reusable
> components with clean APIs together in interesting ways, then it is for
> people to extract reusable components with clean APIs from a monolithic
> application.

Yes, I agree completely, and the strength is exactly what you say - they
focus on doing a small thing very well. I believe this fact would make
such a "search distribution" even more appealing.

Another approach to achieve the same goal though, could be just to
bundle a customized system as a virtual machine snapshot, but that could
very well be too limiting.

Eivind
Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Zaheed Haque
On 1/17/07, Eivind Hasle Amundsen <[hidden email]> wrote:

> > (...) the point being
> > that once they've got you using a monolithic application, it's a lot
> > harder to stop using the whole thing all at once, then it would be for you
> > to stop using 1 of N mini-applications they provide.
>
> Well, FAST is composed of many small, modular products that can be
> replaced by other (open source) parts. It is not monolithic. The first
> time you install it, it might appear to be just one giant beast. However
> it is not.
>
> > I ramble about this to try and explain why Solr may not be what you would
> > consider a "complete package" at the moment .... and why it may never
> > reach the state you think would make it a "complete package" ... because
> > there are a lot of people out there who don't need it to be -- it would be
> > hard to be a full blown GUI configurable, web crawling, document
> > detecting, customizable schema based application and still allow for
> > people to use small pieces of it.
>
> I am not arguing on this. I think my point didn't get through, then.
>
> Compare this to Linux distributions. People still use them, right? What
> about an "enterprise search distro"? That is exactly what some
> commercial vendors offer, only far less elegant than anything containing
> Lucene et al would probably be.
>
> > To put it another way: it's a lot easier for people to put reusable
> > components with clean APIs together in interesting ways, then it is for
> > people to extract reusable components with clean APIs from a monolithic
> > application.
>
> Yes, I agree completely, and the strength is exactly what you say - they
> focus on doing a small thing very well. I believe this fact would make
> such a "search distribution" even more appealing.

I am not sure I follow. Enterprise search distro?. Anyway any
enterprise interested
in having a serious search solution (i.e. buy FAST, Autonomy or do
open source lucene)
will want a custom solution i.e. pick and choose the module/feature
they need/want and then
let an integrator/consultancy-firm/IT department to do the actual
implementation.  So
a search distribution as pointed out is somewhat meaningless if customization is
important.

Now there are organization that will want to have a black-box solution
i.e. Google-mini or Searchblox or the new IBM/Yahoo/Lucene search
solution (sorry I can't remember the name). These are pre-packaged
solution and low cost alternative, in some case free and offer no
customization and I am 100% sure those organization don not even want
customization.

So having the possibility to pick and choose and make a custom
solution from Lucene, Solr, Nutch, Hadoop is super perfect..You can do
more cool things then if all of these are bundles.

Just some thoughts.
Cheers
Zaheed
Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Eivind Hasle Amundsen
> (...) any enterprise interested
> in having a serious search solution (i.e. buy FAST, Autonomy or do
> open source lucene) will want a custom solution (...) then
> let an integrator/consultancy-firm/IT department to do the actual
> implementation.  So
> a search distribution as pointed out is somewhat meaningless if
> customization is important.

I'm talking about creating something that works much more easily out of
the box, and that can be customized as much as now - at the same time.

Of course serious search solutions would be completely customized,
always. And there are "out of the box" solutions (Google Appliance
etc.). But is there no market for a middle ground here?

> Now there are organization that will want to have a black-box solution
> i.e. Google-mini or Searchblox or the new IBM/Yahoo/Lucene search
> solution (sorry I can't remember the name). These are pre-packaged
> solution and low cost alternative, in some case free and offer no
> customization and I am 100% sure those organization don not even want
> customization.

Which ones are free? Are there any FLOSS alternatives to these black box
solutions? (IANAL, but the Apache license is more like LGPL than GPL,
right?)

> So having the possibility to pick and choose and make a custom
> solution from Lucene, Solr, Nutch, Hadoop is super perfect..You can do
> more cool things then if all of these are bundles.

What I am really talking about, is this: There is a growing market for
simple search solutions that can work out of the box, and that can still
be customized. Something that:
- organizations can use on their network, out of the box
- on their intraweb, out of the box, just give credentials
- can handle user access out of the box (LDAP/NIS/AD)
- is FLOSS(!)
- can be fully customized, if desired
- modularized for even more customization if needed

Sure, one can argue like you have done so far by saying that they could
just compose their own solution completely... But then we are falling
outside the market again - which I hypotesize exist.

I am not looking to change Solr in that direction. But take a look at
Solr. Or Nutch. They are already built on Lucene and many other
projects. Why/not build something on top of this? Something more/else?

Thanks for all the feedback :) Please keep it coming.
Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Mike Klaas
On 1/17/07, Eivind Hasle Amundsen <[hidden email]> wrote:

> What I am really talking about, is this: There is a growing market for
> simple search solutions that can work out of the box, and that can still
> be customized. Something that:
> - organizations can use on their network, out of the box
> - on their intraweb, out of the box, just give credentials
> - can handle user access out of the box (LDAP/NIS/AD)
> - is FLOSS(!)
> - can be fully customized, if desired
> - modularized for even more customization if needed

<>
> I am not looking to change Solr in that direction. But take a look at
> Solr. Or Nutch. They are already built on Lucene and many other
> projects. Why/not build something on top of this? Something more/else?

I don't think that anyone is arguing that this product shouldn't exist
in the open-source world, just that it shouldn't be part of Solr's
mandate.  It sounds like a cool project (though the closer you get to
"commercial product" the more important support, packaging, marketing,
etc. become--some of which are very difficult to achieve in a purely
open-source setting).

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Chris Hostetter-3
: > What I am really talking about, is this: There is a growing market for
: > simple search solutions that can work out of the box, and that can still
: > be customized. Something that:
: > - organizations can use on their network, out of the box

: > I am not looking to change Solr in that direction. But take a look at
: > Solr. Or Nutch. They are already built on Lucene and many other
: > projects. Why/not build something on top of this? Something more/else?
:
: I don't think that anyone is arguing that this product shouldn't exist
: in the open-source world, just that it shouldn't be part of Solr's
: mandate.  It sounds like a cool project (though the closer you get to

Exactly.

Eivind: earlier in this thread, you were talking about having more
crawling features, and document parsing features and built in to Solr, and
i got hte impression that you didn't like the idea that they could be
loosely coupled external applications ... but if your interest is in
having an "enterprise search solution" that people can deploy on a box
and haveit start working for them, then there is no reason for all of that
code to run in a single JVM using a single code base -- i'm going to go
out on a limb and guess that that the Google Appliances run more then a
single process :)

given a collection of loosely coupled pieces, including Solr,
including Nutch, including whatever future document parsing contribs might
be written for either SOlr or Nutch ... you could bundle them all together
into an enterprise search system that when installed deployed them all and
coupled them together and had a GUI for configuring them ... but that
would be a seperate project from Solr -- just as Solr and Nutch are
seperate projects from Java-Lucene ... it's all about laysers built on top
of layers that allow for reuse.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Walter Underwood, Netflix
On 1/19/07 10:33 AM, "Chris Hostetter" <[hidden email]> wrote:

> [...] but if your interest is in
> having an "enterprise search solution" that people can deploy on a box
> and haveit start working for them, then there is no reason for all of that
> code to run in a single JVM using a single code base -- i'm going to go
> out on a limb and guess that that the Google Appliances run more then a
> single process :)

Ultraseek does exactly that and is a single multi-threaded process.
A single process is much easier for the admin. A multi-process solution
is more complicated to start up, monitor, shut down, and upgrade.

There is decent demand for a spidering enterprise search engine.
Look at the Google Appliance, Ultraseek, and IBM OmniFind. The
free IBM OmniFind Yahoo! Edition uses Lucene.

I'd love to see the Ultraseek spider connected to Solr, but that
depends on Autonomy.

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: Can this be achieved? (Was: document support for file system crawling)

Mike Klaas
On 1/19/07, Walter Underwood <[hidden email]> wrote:

> Ultraseek does exactly that and is a single multi-threaded process.
> A single process is much easier for the admin. A multi-process solution
> is more complicated to start up, monitor, shut down, and upgrade.
>
> There is decent demand for a spidering enterprise search engine.
> Look at the Google Appliance, Ultraseek, and IBM OmniFind. The
> free IBM OmniFind Yahoo! Edition uses Lucene.
>
> I'd love to see the Ultraseek spider connected to Solr, but that
> depends on Autonomy.

You could accomplish this by throwing them together as various webapps
in a single container instance.

-MIke