how to crawl when Solr is search engine?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

how to crawl when Solr is search engine?

Manoharam Reddy
I have just begun using Solr. I see that we have to insert documents
by posting XMLs to solr/update

I would like to know how Solr is used as a search engine in
enterprises. How do you do the crawling of your intranet and passing
the information as XML to solr/update. Isn't this going to be slow? To
put all content in the index via a HTTP POST request requiring network
sockets to be opened?

Isn't there any direct way to to do the same thing without resorting to HTTP?
Reply | Threaded
Open this post in threaded view
|

Re: how to crawl when Solr is search engine?

Ian Holsman (Lists)
Hi Manoharam.

we use nutch to do the crawl, and have used sami's patch of nutch
(http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html 
) to have it integrate with Solr. It works quite well for our needs.

If you are concerned with the speed, Solr also has a CSV upload
facility, which you might be able to use to upload the data into solr
that way, but we haven't found the HTTP Post speed to be an issue for us.

Regards
Ian


Manoharam Reddy wrote:

> I have just begun using Solr. I see that we have to insert documents
> by posting XMLs to solr/update
>
> I would like to know how Solr is used as a search engine in
> enterprises. How do you do the crawling of your intranet and passing
> the information as XML to solr/update. Isn't this going to be slow? To
> put all content in the index via a HTTP POST request requiring network
> sockets to be opened?
>
> Isn't there any direct way to to do the same thing without resorting
> to HTTP?
>

Reply | Threaded
Open this post in threaded view
|

Re: how to crawl when Solr is search engine?

Manoharam Reddy
Thanks for your quick response.

This brings me to another question. As far as I know Nutch can take
care of crawling as well as indexing. Then why go through the hassle
of crawling through Nutch and integrating it into Solr?

Another question I have, Solr provides the search results in XML
format, any ready made tools to convert them directly to web pages for
visitors to see?

On 6/7/07, Ian Holsman <[hidden email]> wrote:

> Hi Manoharam.
>
> we use nutch to do the crawl, and have used sami's patch of nutch
> (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
> ) to have it integrate with Solr. It works quite well for our needs.
>
> If you are concerned with the speed, Solr also has a CSV upload
> facility, which you might be able to use to upload the data into solr
> that way, but we haven't found the HTTP Post speed to be an issue for us.
>
> Regards
> Ian
>
>
> Manoharam Reddy wrote:
> > I have just begun using Solr. I see that we have to insert documents
> > by posting XMLs to solr/update
> >
> > I would like to know how Solr is used as a search engine in
> > enterprises. How do you do the crawling of your intranet and passing
> > the information as XML to solr/update. Isn't this going to be slow? To
> > put all content in the index via a HTTP POST request requiring network
> > sockets to be opened?
> >
> > Isn't there any direct way to to do the same thing without resorting
> > to HTTP?
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to crawl when Solr is search engine?

Ian Holsman (Lists)
Manoharam Reddy wrote:
> Thanks for your quick response.
>
> This brings me to another question. As far as I know Nutch can take
> care of crawling as well as indexing. Then why go through the hassle
> of crawling through Nutch and integrating it into Solr?

I found Solr's caching and maintenance easier to use than nutch's. But
that's just me.

>
> Another question I have, Solr provides the search results in XML
> format, any ready made tools to convert them directly to web pages for
> visitors to see?

yep.. it's called XSLT. most modern browsers can do the transform on the
client side.
otherwise there is some server side tools (cocoon I think does this) to
do the transform on the server before sending it out.

--Ian

>
> On 6/7/07, Ian Holsman <[hidden email]> wrote:
>> Hi Manoharam.
>>
>> we use nutch to do the crawl, and have used sami's patch of nutch
>> (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html 
>>
>> ) to have it integrate with Solr. It works quite well for our needs.
>>
>> If you are concerned with the speed, Solr also has a CSV upload
>> facility, which you might be able to use to upload the data into solr
>> that way, but we haven't found the HTTP Post speed to be an issue for
>> us.
>>
>> Regards
>> Ian
>>
>>
>> Manoharam Reddy wrote:
>> > I have just begun using Solr. I see that we have to insert documents
>> > by posting XMLs to solr/update
>> >
>> > I would like to know how Solr is used as a search engine in
>> > enterprises. How do you do the crawling of your intranet and passing
>> > the information as XML to solr/update. Isn't this going to be slow? To
>> > put all content in the index via a HTTP POST request requiring network
>> > sockets to be opened?
>> >
>> > Isn't there any direct way to to do the same thing without resorting
>> > to HTTP?
>> >
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: how to crawl when Solr is search engine?

Manoharam Reddy
Pardon me if I am taking too much of your time.

It would be really great if you could please highlight a few
advantages of caching and maintenance over nutch.

Some musing:-
(I have used Nutch before and one thing I observed there was that if I
delete the crawl folder when Nutch is running, users can still search
and obtain proper results. It seems Nutch caches all the indexes in
the memory when it starts. I don't understand how is that feasible
when the size of the crawl is in the order of 10 GBs where as you have
a RAM + swap of only a few GBs.)

How is Solr caching better than this?

On 6/7/07, Ian Holsman <[hidden email]> wrote:

> Manoharam Reddy wrote:
> > Thanks for your quick response.
> >
> > This brings me to another question. As far as I know Nutch can take
> > care of crawling as well as indexing. Then why go through the hassle
> > of crawling through Nutch and integrating it into Solr?
>
> I found Solr's caching and maintenance easier to use than nutch's. But
> that's just me.
>
> >
> > Another question I have, Solr provides the search results in XML
> > format, any ready made tools to convert them directly to web pages for
> > visitors to see?
>
> yep.. it's called XSLT. most modern browsers can do the transform on the
> client side.
> otherwise there is some server side tools (cocoon I think does this) to
> do the transform on the server before sending it out.
>
> --Ian
> >
> > On 6/7/07, Ian Holsman <[hidden email]> wrote:
> >> Hi Manoharam.
> >>
> >> we use nutch to do the crawl, and have used sami's patch of nutch
> >> (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
> >>
> >> ) to have it integrate with Solr. It works quite well for our needs.
> >>
> >> If you are concerned with the speed, Solr also has a CSV upload
> >> facility, which you might be able to use to upload the data into solr
> >> that way, but we haven't found the HTTP Post speed to be an issue for
> >> us.
> >>
> >> Regards
> >> Ian
> >>
> >>
> >> Manoharam Reddy wrote:
> >> > I have just begun using Solr. I see that we have to insert documents
> >> > by posting XMLs to solr/update
> >> >
> >> > I would like to know how Solr is used as a search engine in
> >> > enterprises. How do you do the crawling of your intranet and passing
> >> > the information as XML to solr/update. Isn't this going to be slow? To
> >> > put all content in the index via a HTTP POST request requiring network
> >> > sockets to be opened?
> >> >
> >> > Isn't there any direct way to to do the same thing without resorting
> >> > to HTTP?
> >> >
> >>
> >>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to crawl when Solr is search engine?

Bertrand Delacretaz
In reply to this post by Ian Holsman (Lists)
On 6/7/07, Ian Holsman <[hidden email]> wrote:

> . it's called XSLT. most modern browsers can do the transform on the
> client side.
> otherwise there is some server side tools (cocoon I think does this) to
> do the transform on the server before sending it out....

Solr also does server-side XSLT, see
http://wiki.apache.org/solr/XsltResponseWriter

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: how to crawl when Solr is search engine?

Walter Underwood, Netflix
In reply to this post by Manoharam Reddy
Solr is not designed to be a general enterprise search engine. It is
a back end search server.

If you are going to crawl your intranet, you will need a good crawler
that is easy to manage, and the ability to parse lots of kinds of
documents. Unfortunately, Solr really doesn't have those.

Commercial solutions aren't very expensive, probably less than the
cost of the time it would take you to put together a worse solution
from open source bits.

Look at Ultraseek (www.ultraseek.com), IBM OmniFind, or one of the
Google Search Appliances. Ultraseek and OmniFind are software
products and have eval downloads. I worked on Ultraseek for years
and it is really easy to install and get going.

Why would posting XML be any slower than the initial crawl over
HTTP? It is local, it should be way faster.

wunder

On 6/7/07 12:30 AM, "Manoharam Reddy" <[hidden email]> wrote:

> I have just begun using Solr. I see that we have to insert documents
> by posting XMLs to solr/update
>
> I would like to know how Solr is used as a search engine in
> enterprises. How do you do the crawling of your intranet and passing
> the information as XML to solr/update. Isn't this going to be slow? To
> put all content in the index via a HTTP POST request requiring network
> sockets to be opened?
>
> Isn't there any direct way to to do the same thing without resorting to HTTP?

Reply | Threaded
Open this post in threaded view
|

Re: how to crawl when Solr is search engine?

Mike Klaas
In reply to this post by Manoharam Reddy
On 7-Jun-07, at 1:04 AM, Manoharam Reddy wrote:

> Some musing:-
> (I have used Nutch before and one thing I observed there was that if I
> delete the crawl folder when Nutch is running, users can still search
> and obtain proper results. It seems Nutch caches all the indexes in
> the memory when it starts. I don't understand how is that feasible
> when the size of the crawl is in the order of 10 GBs where as you have
> a RAM + swap of only a few GBs.)

This is true also for Solr, because it is an OS feature: if you  
delete a file that is open by certain processes, it isn't really  
deleted at all (check disk usage stats).

> How is Solr caching better than this?

It is unrelated. Solr can cache certain reusable components of  
queries (namely, filters), and provides for fully-customizable schema  
and arbitrary query execution on it.

-Mike