Building an enterprise quality search engine using Apache Solr

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Building an enterprise quality search engine using Apache Solr

Venky Naganathan
Hello,

Can some one please provide me advise on the below ?

1) I am considering building an enterprise search engine that indexes
different types of documents:
   - Text, Microsoft formats (including Outlook email), PDF, Sharepoint,
Wikipedia etc
   As i understand, using Apache Solr, Apache Nutch (for crawling), Apache
Tika (for document formats), I should be able to implement a crawler,
indexer/searcher with support for numerous formats. Is this correct ? Do i
need any other special packages for sharepoint and wikipedia ?

2) How much development effort is required in terms of person months to
accomplish the above ?

3) Does anyone have experience building an enterprise search engine using
Solr ? How is the quality of the search results compared to other popular
engines ?

Thank you very much for your advise. I can be reached at [hidden email]

-Venky
Reply | Threaded
Open this post in threaded view
|

Re: Building an enterprise quality search engine using Apache Solr

Jack Krupansky-2
Take a look at Apache ManifoldCF for crawling enterprise repositories such
as SharePoint (as well as lighterweight web crawling and file system
crawling).

http://manifoldcf.apache.org/en_US/index.html

-- Jack Krupansky

-----Original Message-----
From: Venky Naganathan
Sent: Thursday, October 18, 2012 2:21 PM
To: [hidden email]
Subject: Building an enterprise quality search engine using Apache Solr

Hello,

Can some one please provide me advise on the below ?

1) I am considering building an enterprise search engine that indexes
different types of documents:
   - Text, Microsoft formats (including Outlook email), PDF, Sharepoint,
Wikipedia etc
   As i understand, using Apache Solr, Apache Nutch (for crawling), Apache
Tika (for document formats), I should be able to implement a crawler,
indexer/searcher with support for numerous formats. Is this correct ? Do i
need any other special packages for sharepoint and wikipedia ?

2) How much development effort is required in terms of person months to
accomplish the above ?

3) Does anyone have experience building an enterprise search engine using
Solr ? How is the quality of the search results compared to other popular
engines ?

Thank you very much for your advise. I can be reached at [hidden email]

-Venky

Reply | Threaded
Open this post in threaded view
|

Re: Building an enterprise quality search engine using Apache Solr

Alexandre Rafalovitch
This is the first time I hear of this project. Looks interesting, but
Is it active?

The integration FAQ seem to be talking about Solr 1.4, a bit out of date.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, Oct 19, 2012 at 12:37 AM, Jack Krupansky
<[hidden email]> wrote:

> Take a look at Apache ManifoldCF for crawling enterprise repositories such
> as SharePoint (as well as lighterweight web crawling and file system
> crawling).
>
> http://manifoldcf.apache.org/en_US/index.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: Venky Naganathan
> Sent: Thursday, October 18, 2012 2:21 PM
> To: [hidden email]
> Subject: Building an enterprise quality search engine using Apache Solr
>
>
> Hello,
>
> Can some one please provide me advise on the below ?
>
> 1) I am considering building an enterprise search engine that indexes
Reply | Threaded
Open this post in threaded view
|

Re: Building an enterprise quality search engine using Apache Solr

dirk
In reply to this post by Venky Naganathan
Hi,
your question is not easy to answer. It depends on so many things, that there is no standard way to realize an enterprise solution and time planning aspects are depending on so much things.

I can try to give you some brief notes about our solution, but there are some differences in target group and data source. I am technical responsible for the system disco (a research and discovery system) at the library at university of Münster. (excuse me, I don't want to make a promotion tour here, I earn no money with such activities -:)). Ok, in this search engine, based on lucene, we search in about 200 Mio Articles, Books, Journals and so on. So we have different data sources in structure and also in the way of delivery. At the beginning we thought, lets buy a solution in order to avoid more or less own developement work. So we bought a commercial search engine, which works on a lucene core with a proprietary business logic in order to talk to lucene core. So far so good - or not good. At that time I was the onliest worker on this project and I need nearly one and a half year in fulltime in order to fullfill most features and requirements. And the reason for that long time is not, that I had no exiperiences, (I hope so). I work in this area nearly 15 years in different companies, always as developer in J2EE. (That`s rare today, because today every experienced developer wants to work as "leader" or manager, that`s sounds better and less project leader are outsourced. ok, other topic) And other universities (customers) who realized a comparable search engine in that environment took as long or longer. So I am hopefully...

In germany we say "der teufel steckt im detail" (translation literally: devil is hidden in detail), which means you start work and parallel to that process mostly requirements changed, sadly in most cases after development has done the software basis. For example we need a lot of time for the fine tuning of ranking and for realizing a complete automatic mechanism to update data sources. And it was one thing to realize the search in development and run a first developer test, a complete other thing is to make the system fit for 24/7 service and run a productive system without problems.

Most time we need on data pre-processing because of the "shit in - shit out" problem. Work on the quality of data is expensive but you get no appreciation, because everybody is cope with searching features. This requirement shows us, that mostly it is impossible to avoid own developement completely.
Next thing is user interface, not every feature a customer knows from good old database backboned systems is easy to realized in a search engine because of more or less flat data structure. So we had to develop one service after the other in order to read additional informations. In our case for example runtime holding informations of our library.

Summarized, if you want to estimate a concrete time duration in order to realize a complete productive enterprise search solution, you should talk to some people with similar solutions, think of your own requirements in detail and then multiply your estimation with 2. Then perhaps you have a realistic estimate.
Dirk          
Reply | Threaded
Open this post in threaded view
|

Re: Building an enterprise quality search engine using Apache Solr

iorixxx
In reply to this post by Alexandre Rafalovitch
Hi Alexandre,

Yes it is active. ManifoldCF 1.0.1 is released yesterday :)
You can index content of SharePoint 2010 to Solr 4.0.0 .

'End user documentation' and 'in action book' are two main resources.

http://manifoldcf.apache.org/release/release-1.0.1/en_US/end-user-documentation.html

http://www.manning.com/wright/


--- On Fri, 10/19/12, Alexandre Rafalovitch <[hidden email]> wrote:

> From: Alexandre Rafalovitch <[hidden email]>
> Subject: Re: Building an enterprise quality search engine using Apache Solr
> To: [hidden email]
> Date: Friday, October 19, 2012, 7:18 AM
> This is the first time I hear of this
> project. Looks interesting, but
> Is it active?
>
> The integration FAQ seem to be talking about Solr 1.4, a bit
> out of date.
>
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from
> happening all
> at once. Lately, it doesn't seem to be working. 
> (Anonymous  - via GTD
> book)
>
>
> On Fri, Oct 19, 2012 at 12:37 AM, Jack Krupansky
> <[hidden email]>
> wrote:
> > Take a look at Apache ManifoldCF for crawling
> enterprise repositories such
> > as SharePoint (as well as lighterweight web crawling
> and file system
> > crawling).
> >
> > http://manifoldcf.apache.org/en_US/index.html
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Venky Naganathan
> > Sent: Thursday, October 18, 2012 2:21 PM
> > To: [hidden email]
> > Subject: Building an enterprise quality search engine
> using Apache Solr
> >
> >
> > Hello,
> >
> > Can some one please provide me advise on the below ?
> >
> > 1) I am considering building an enterprise search
> engine that indexes
>