Building a web based search engine

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Building a web based search engine

jjanderson5
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

RE: Building a web based search engine

Markus Jelsma-2
Hello,

We have been building precisely that for over ten years now. The '10,000 foot level overview' is basically:

* forget about Lucene for now, Solr uses it under the hood;
* get Solr, and start it with the schema.xml file that comes with Nutch;
* get Nutch, give it a set of domains or hosts to crawl and some URLs to start the crawl with and point the indexer towards the previously configured Solr;
* put a proxy in front of Solr (we use Nginx), or skip this step if it is just an internal demo (do not expose Solr to the outside world);
* make some basic JS tool that handles input and search result responses.

This was our first web search engine prototype and it was set up in a few days. The chapter "How To Build A Web Based Search Engine With Solr, Lucene and Nutch" just means: set up Solr, and point Nutch towards it, and tell it to start crawling and indexing.

Then there comes and endless list of things to improve, autocomplete, spell checking, query and click log handling and analysis, proper text extraction, etc.

Regards,
Markus

-----Original message-----

> From:Jim Anderson <[hidden email]>
> Sent: Tuesday 2nd June 2020 16:36
> To: [hidden email]
> Subject: Building a web based search engine
>
> Hi,
>
> I have been looking at solr, lucene and nutch websites and tutuorials for
> over a week now, experimenting and learning, but also frustrated be the
> fact the I am totally missing the 'how to' do what I want. I see a lot of
> examples of how to use each of the tools, but not how to put them all
> together. I think an 'overview' at the 10,000 foot level is needed, Maybe
> one is available and I have not yet found it. If someone can point me to
> one, please do.
>
> If I am correct that an overview on "How To Build A Web Based Search Engine
> With Solr, Lucene and Nutch" is not available, then I will be willing to
> write an overview and make it available to the Solr community.  I will need
> input, explanation and review of others.
>
> My 2 goals are:
>
> 1) Build a demo web based search engine [Note: I have a very specific
> business need to able to demonstrate a web application on top of a search
> engine. This demo is intended to show a 'proof of concept' of the web
> application to a small audience.]
>
> 2) Document the process of building the demo and customizing it using the
> java API so that others can more easily build their own web base search
> engine.
>
> Jim Anderson
>
Reply | Threaded
Open this post in threaded view
|

Re: Building a web based search engine

jjanderson5
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

RE: Building a web based search engine

Markus Jelsma-2
In reply to this post by jjanderson5
Hello, see inline.

Markus
 
-----Original message-----

> From:Jim Anderson <[hidden email]>
> Sent: Tuesday 2nd June 2020 19:59
> To: [hidden email]
> Subject: Re: Building a web based search engine
>
> Hi Markus,
>
> Thanks for your response. I appreciate you giving me the bullet list of
> things to do. I can take that list and work from it and hopefully make
> progress, but I don't think it will get me where I want to be - just a bit
> closer.
>
> You say, "We have been building precisely that for over ten years now". Is
> it in a document? I would like to read it.

No, i haven't written a book about it and don't intend to.

> Some basic things I would like to know that should be documented:
>
> 1) Using nutch as the crawler, how do I run a nutch thread that crawls my
> named URLs.

You don't, but run Nutch as a separate process from the command line. Or when you have to deal with 50+ million records, you run it on Apache Hadoop.

> 2) I will use nutch to visit websites and create documents in solr. How do
> I verify that documents have been created in Solr via nutch?

By searching for them using Solr, or retrieving them by URL, using Solr's simple HTTP API. You can use SolrJ, the Java client, too.

> 3) Solr will store and index the documents. How do I verify the index?

See 2.

> 4) I assume I can run a tomcat server on my host and then provide a
> localhost URI to my web browser. Tomcat will then forward the URI to my
> application. My application will take a query and using a java API is will
> pass the query to Solr. I would like to see an example of a java program
> passing a query to Solr.

See 3. Though i would recommend to use Solr's HTTP API, it is much easier to deal with.

> 5) Solr will take the query, parse it and then locate appropriate documents
> using the index. Is there a log in Solr showing what queries have been
> parsed?

Yes, see Solr's log directory.

> 6) Solr will pass back the list of documents it has located. I have not
> really looked at this issue yet, but it would be nice to have an example of
> this.

Search for a SolrJ tutorial, they are plentiful. Also check out Solr's own extensive manual, everything you need is there.

> Jim
>
>
>
> On Tue, Jun 2, 2020 at 12:12 PM Markus Jelsma <[hidden email]>
> wrote:
>
> > Hello,
> >
> > We have been building precisely that for over ten years now. The '10,000
> > foot level overview' is basically:
> >
> > * forget about Lucene for now, Solr uses it under the hood;
> > * get Solr, and start it with the schema.xml file that comes with Nutch;
> > * get Nutch, give it a set of domains or hosts to crawl and some URLs to
> > start the crawl with and point the indexer towards the previously
> > configured Solr;
> > * put a proxy in front of Solr (we use Nginx), or skip this step if it is
> > just an internal demo (do not expose Solr to the outside world);
> > * make some basic JS tool that handles input and search result responses.
> >
> > This was our first web search engine prototype and it was set up in a few
> > days. The chapter "How To Build A Web Based Search Engine With Solr, Lucene
> > and Nutch" just means: set up Solr, and point Nutch towards it, and tell it
> > to start crawling and indexing.
> >
> > Then there comes and endless list of things to improve, autocomplete,
> > spell checking, query and click log handling and analysis, proper text
> > extraction, etc.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> > > From:Jim Anderson <[hidden email]>
> > > Sent: Tuesday 2nd June 2020 16:36
> > > To: [hidden email]
> > > Subject: Building a web based search engine
> > >
> > > Hi,
> > >
> > > I have been looking at solr, lucene and nutch websites and tutuorials for
> > > over a week now, experimenting and learning, but also frustrated be the
> > > fact the I am totally missing the 'how to' do what I want. I see a lot of
> > > examples of how to use each of the tools, but not how to put them all
> > > together. I think an 'overview' at the 10,000 foot level is needed, Maybe
> > > one is available and I have not yet found it. If someone can point me to
> > > one, please do.
> > >
> > > If I am correct that an overview on "How To Build A Web Based Search
> > Engine
> > > With Solr, Lucene and Nutch" is not available, then I will be willing to
> > > write an overview and make it available to the Solr community.  I will
> > need
> > > input, explanation and review of others.
> > >
> > > My 2 goals are:
> > >
> > > 1) Build a demo web based search engine [Note: I have a very specific
> > > business need to able to demonstrate a web application on top of a search
> > > engine. This demo is intended to show a 'proof of concept' of the web
> > > application to a small audience.]
> > >
> > > 2) Document the process of building the demo and customizing it using the
> > > java API so that others can more easily build their own web base search
> > > engine.
> > >
> > > Jim Anderson
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Building a web based search engine

jjanderson5
CONTENTS DELETED
The author has deleted this message.