document support for file system crawling

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

document support for file system crawling

Eivind Hasle Amundsen
Hi,

I want to pick up this old thread from the summer (see below). I do
understand that Solr is inteded for more structured data, and that Nutch
is a good basis for cluttered information, particularly fetched from
crawlers.

However Solr's ease of setup and flexible schemas make it a viable
alternative for enterprise solutions. It seems even the purpose of the
project itself is to create an enterprise search platform.

In that respect I agree with the original posting that Solr lacks
functionality with respect to desired functionality. One can argue that
more or less random data should be structured by the user writing a
decent application. However a more easy to use and configurable plugin
architecture for different filtering and document parsing could make
Solr more attractive. I think that many potential users would welcome
such additions.

In other words, Solr *could* very well be the right tool for the job in
many cases, provided that there is a configurable "pre-Solr" step that
can be run on content before it actually "turns XML".

A related design question is to what extent this should be contracted
between the XML documents themselves and the schema.xml, or whether most
of the work should be done in the parser/pre-processing (i.e. when
making the XML documents).

Your thoughts and feedback is greatly appriciated.

Regards,

Eivind


 >> browsing through the message thread I tried to find a trail
addressing file
 >> system crawls. I want to implement an enterprise search over a networked
 >> filesystem, crawling all sorts of documents, such as html, doc, ppt
and pdf.
 >> Nutch provides plugins enabling it to read proprietary formats.
 >> Is there support for the same functionality in solr?

 > the text out of these types of documents.  You could borrow the
 > document parsing pieces from Lucene's contrib and Nutch and glue them
 > together into your client that speaks to Solr, or perhaps Solr isn't
 > the right approach for your needs?   It certainly is possible to add
 > these capabilities into Solr, but it would be awkward to have to
 > stream binary data into XML documents such that Solr could parse them
 > on the server side.

Agreed.  Solr's focus is in indexing "Structured Data".  The support for
dynamic fields certainly allows you do deal with complex structured data,
and somewhat heterogeneous structured data -- but it's still structured
data.  If your goal is to do a lot of crawling of disparat physical
documents, extract the text, and build a "path,title,content" index
then Nutch is probably your best bet.

Reply | Threaded
Open this post in threaded view
|

Re: document support for file system crawling

Chris Hostetter-3

: In that respect I agree with the original posting that Solr lacks
: functionality with respect to desired functionality. One can argue that
: more or less random data should be structured by the user writing a
: decent application. However a more easy to use and configurable plugin
: architecture for different filtering and document parsing could make
: Solr more attractive. I think that many potential users would welcome
: such additions.

i don't think you'll get any argument about the benefits of supporting
more plugins to handle updates - both in terms of how the data is
expressed, and how the data is fetched, in fact you'll find some rather
involved discussions on that very topic going on on the solr-dev list
right now.

the thread you cite was specificly asking about:
  a) crawling a filesystem
  b) detecting document types and indexing text portions accordingly.

I honestly can't imagine either of those things being supported out of the
box by Solr -- there's just no reason for Solr to duplicate what Nutch
alrady does very well.

What i see being far more likely are:

1) more documentation (and posisbly some locking configuration options) on
how you can use Solr to access an index generated by the nutch crawler (i
think Thorsten has allready done this) or by Compass, or any other system
that builds a Lucene index.

2) "contrib" code that runs as it's own process to crawl documents and
send them to a Solr server. (mybe it parses them, or maybe it relies on
the next item...)

3) Stock "update" plugins that can each read a raw inputstreams of a some
widely used file format (PDF, RDF, HTML, XML of any schema) and have
configuration options telling them them what fields in the schema each
part of their document type should go in.

4) easy hooks for people to write their own update plugins for non widely
used fileformats.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Can this be achieved? (Was: document support for file system crawling)

Eivind Hasle Amundsen
First: Please pardon the cross-post to solr-user for reference. I hope
to continue this thread in solr-dev. Please answer to solr-dev.

> 1) more documentation (and posisbly some locking configuration options) on
> how you can use Solr to access an index generated by the nutch crawler (i
> think Thorsten has allready done this) or by Compass, or any other system
> that builds a Lucene index.

Thorsten Scherler? Is this code available anywhere? Sounds very
interesting to me. Maybe someone could ellaborate on the differences
between the indexes created by Nutch/Solr/Compass/etc., or point me in
the direction of an answer?

> 2) "contrib" code that runs as it's own process to crawl documents and
> send them to a Solr server. (mybe it parses them, or maybe it relies on
> the next item...)

Do you know FAST? It uses a step-by-step approach ("pipeline") in which
all of these tasks are done. Much of it is tuned in a easy web tool.

The point I'm trying to make is that contrib code is nice, but a
"complete package" with these possibilities could broaden Solr's appeal
somewhat.

> 3) Stock "update" plugins that can each read a raw inputstreams of a some
> widely used file format (PDF, RDF, HTML, XML of any schema) and have
> configuration options telling them them what fields in the schema each
> part of their document type should go in.

Exactly, this sounds more like it. But if similar inputstreams can be
handled by Nutch, what's the point in using Solr at all? The http API's?
  In other words, both Nutch and Solr seem to have functionality that
enterprises would want. But neither gives you the "total solution".

Don't get it wrong, I don't want to bloat the products, even though it
would be nice to have a crossover solution which is easy to set up.

The architecture could look something like this:

Connector -> Parser -> DocProc -> (via schema) -> Index

Possible connectors: JDBC, filesystem, crawler, manual feed
Possible parsers: PDF, whatever

Both connectors, parsers AND the document processors would be plugins.
The DocProcs would typically be adjusted for each enterprise' needs, so
that it fits with their schema.xml.

Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to
really know all possibilities and limitations. But I do believe that the
outlined architecture would be flexible and answer many needs. So the
question is:

What is Solr missing? Could parts of Nutch be used in Solr to achieve
this? How? Have I misunderstood completely? :)

Eivind