Simple Faceted Searching out of the box

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Simple Faceted Searching out of the box

Chris Hostetter-3

Hey everybody, I just wanted to officially announce that as of the
solr-2006-09-08.zip nightly build, Solr supports some simple Faceted
Searching options right out of the box.

Both the StandardRequestHandler and DisMaxRequestHandler now support some
query params for specifying simple queries to use as facet constraints, or
fields in your index you wish to use as facets - generating a constraint
count for each term in the field.  All of these params can be configured
as "defaults" when registering the RequestHandler in your solrconfig.xml

Information on what the new facet parameters are, how to use them, and
what types of resultsthey generate can be found in the wiki...

http://wiki.apache.org/solr/SimpleFacetParameters
http://wiki.apache.org/solr/StandardRequestHandler
http://wiki.apache.org/solr/DisMaxRequestHandler

...as allways: feedback, comments, suggestions and general discussion is
strongly encouraged :)


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Tim Archambault-2
Hoss,

What is "faceted browsing"? Maybe an example of a site interface that is
using it would be good. Dumb question, I know.


On 9/8/06, Chris Hostetter <[hidden email]> wrote:

>
>
> Hey everybody, I just wanted to officially announce that as of the
> solr-2006-09-08.zip nightly build, Solr supports some simple Faceted
> Searching options right out of the box.
>
> Both the StandardRequestHandler and DisMaxRequestHandler now support some
> query params for specifying simple queries to use as facet constraints, or
> fields in your index you wish to use as facets - generating a constraint
> count for each term in the field.  All of these params can be configured
> as "defaults" when registering the RequestHandler in your solrconfig.xml
>
> Information on what the new facet parameters are, how to use them, and
> what types of resultsthey generate can be found in the wiki...
>
> http://wiki.apache.org/solr/SimpleFacetParameters
> http://wiki.apache.org/solr/StandardRequestHandler
> http://wiki.apache.org/solr/DisMaxRequestHandler
>
> ...as allways: feedback, comments, suggestions and general discussion is
> strongly encouraged :)
>
>
> -Hoss
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Erik Hatcher

On Sep 9, 2006, at 8:15 AM, Tim Archambault wrote:
> What is "faceted browsing"? Maybe an example of a site interface  
> that is
> using it would be good. Dumb question, I know.

Faceted browsing is like this:  http://shopper.cnet.com/ and http://
www.nines.org/collex

In Collex, the "constrain further" box are the facets.  Clicking on  
them adds them to "your constraints".  The idea is to divide the  
documents in the index into distinct buckets (or sets) and show the  
counts of how many results are in each set.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Tim Archambault-2
I need to understand this then. Thanks. I want to use Solr for our newspaper
website and this would be a great way to break out content. Kind of greys
the lines between what is search and what is browsing categories, which is a
great thing actually. Thanks for the help.

Tim


On 9/9/06, Erik Hatcher <[hidden email]> wrote:

>
>
> On Sep 9, 2006, at 8:15 AM, Tim Archambault wrote:
> > What is "faceted browsing"? Maybe an example of a site interface
> > that is
> > using it would be good. Dumb question, I know.
>
> Faceted browsing is like this:  http://shopper.cnet.com/ and http://
> www.nines.org/collex
>
> In Collex, the "constrain further" box are the facets.  Clicking on
> them adds them to "your constraints".  The idea is to divide the
> documents in the index into distinct buckets (or sets) and show the
> counts of how many results are in each set.
>
>        Erik
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

James liu-2
In reply to this post by Tim Archambault-2
Good. Thk u,Hoss.

2006/9/9, Tim Archambault <[hidden email]>:

>
> Hoss,
>
> What is "faceted browsing"? Maybe an example of a site interface that is
> using it would be good. Dumb question, I know.
>
>
> On 9/8/06, Chris Hostetter <[hidden email]> wrote:
> >
> >
> > Hey everybody, I just wanted to officially announce that as of the
> > solr-2006-09-08.zip nightly build, Solr supports some simple Faceted
> > Searching options right out of the box.
> >
> > Both the StandardRequestHandler and DisMaxRequestHandler now support
> some
> > query params for specifying simple queries to use as facet constraints,
> or
> > fields in your index you wish to use as facets - generating a constraint
> > count for each term in the field.  All of these params can be configured
> > as "defaults" when registering the RequestHandler in your solrconfig.xml
> >
> > Information on what the new facet parameters are, how to use them, and
> > what types of resultsthey generate can be found in the wiki...
> >
> > http://wiki.apache.org/solr/SimpleFacetParameters
> > http://wiki.apache.org/solr/StandardRequestHandler
> > http://wiki.apache.org/solr/DisMaxRequestHandler
> >
> > ...as allways: feedback, comments, suggestions and general discussion is
> > strongly encouraged :)
> >
> >
> > -Hoss
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Erik Hatcher
In reply to this post by Tim Archambault-2

On Sep 9, 2006, at 9:09 AM, Tim Archambault wrote:
> I need to understand this then. Thanks. I want to use Solr for our  
> newspaper
> website and this would be a great way to break out content. Kind of  
> greys
> the lines between what is search and what is browsing categories,  
> which is a
> great thing actually. Thanks for the help.

greys the lines indeed.  there isn't any difference between search  
and browse in my view now.  let's just call it "findability" :)  (by  
the way, "Ambient Findability" is a fantastic book)

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Tim Archambault-2
For those using PHP to interface with can you explain to me how your PHP
code interacts with Solr? Does PHP create a query_string manually and post
an URL like this:
http://localhost:8983/solr/select?q=vertical%3Ajobs+accounting&version=2.1&start=0&rows=10&fl=&qt=standard&stylesheet=&indent=on&explainOther=&hl.fl=
for example then using some PHP command to read a webpage, it then parses
it?

I'm not much of a programmer, but I do know Coldfusion so I'm trying to
apply the PHP principles to CF.

Thanks for any and all help.

Tim


On 9/10/06, Erik Hatcher <[hidden email]> wrote:

>
>
> On Sep 9, 2006, at 9:09 AM, Tim Archambault wrote:
> > I need to understand this then. Thanks. I want to use Solr for our
> > newspaper
> > website and this would be a great way to break out content. Kind of
> > greys
> > the lines between what is search and what is browsing categories,
> > which is a
> > great thing actually. Thanks for the help.
>
> greys the lines indeed.  there isn't any difference between search
> and browse in my view now.  let's just call it "findability" :)  (by
> the way, "Ambient Findability" is a fantastic book)
>
>        Erik
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Chris Hostetter-3
In reply to this post by Tim Archambault-2

: > > What is "faceted browsing"? Maybe an example of a site interface

Whoops! ... sorry about that, i tend to get ahead of my self.

The examples Erik pointed out are very representative, but there are more
subtle ways faceted searching can come into play -- for example, if you
look at these two search results...

   http://shopper-search.cnet.com/search?q=gta
   http://shopper-search.cnet.com/search?q=ipod

...the categories in the left nav change based on what you search on,
because we treat "category" as a facet, and the individual categories as
possible "constraints" ... we don't show the user the exact count of how
many products match in each category but we use that information to
determine the order of the categories (or wether we should include a
category in the list at all)

: website and this would be a great way to break out content. Kind of greys
: the lines between what is search and what is browsing categories, which is a
: great thing actually. Thanks for the help.

Even without facets, "browsing" a set of documents is just a search for
"all" docuemnts (or depending on who you talk to: "searching" is just
browsing with a special user entered constraint on the "text" facet)




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Kevin Lewandowski
In reply to this post by Chris Hostetter-3
Is it possible that the facets can be based on the contents of an
entire field instead of the terms?

For example say I have a document with this field:
<field name="genre">Hip Hop</field>

A facet query on the genre field returns:
<lst name="genre">
  <int name="hip">1</int>
  <int name="hop">1</int>
</lst>

but I'd like it to return:
<lst name="genre">
  <int name="hip hop">1</int>
</lst>

thanks,
Kevin
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Erik Hatcher

On Sep 12, 2006, at 4:15 AM, Kevin Lewandowski wrote:
> Is it possible that the facets can be based on the contents of an
> entire field instead of the terms?

For this, you could use a <copyField> to copy one field into another  
field where one is tokenized and the other is not.  And then return  
facets for the non-tokenized field.

> For example say I have a document with this field:
> <field name="genre">Hip Hop</field>
>
> A facet query on the genre field returns:
> <lst name="genre">
>  <int name="hip">1</int>
>  <int name="hop">1</int>
> </lst>
>
> but I'd like it to return:
> <lst name="genre">
>  <int name="hip hop">1</int>
> </lst>


I assume you want the same case as the original field though... "Hip  
Hop".

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Tim Archambault-2
In reply to this post by Chris Hostetter-3
I have a couple of questions from some online newspaper folks who are
interested in Solr and are trying to understand how and why it came to be. I
think inherent in these questions is the underlying theme I hear all the
time and that is "Solr is not a content management system. It's a search
engine."

What I really wonder about CNet is how they manage their content and how
Solr fits into their overall architecture -- is it an add-on? a
purpose-built hammer to handle a specific problem they were having? was it
something they "wanted" ... or instead something they needed to do, despite
preferring something else?

Another question asked of me was "Will Solr ever connect with datasources
directly?"

Thanks in advance for any feedback I can supply the folks.

Tim


On 9/10/06, Chris Hostetter <[hidden email]> wrote:

>
>
> : > > What is "faceted browsing"? Maybe an example of a site interface
>
> Whoops! ... sorry about that, i tend to get ahead of my self.
>
> The examples Erik pointed out are very representative, but there are more
> subtle ways faceted searching can come into play -- for example, if you
> look at these two search results...
>
>   http://shopper-search.cnet.com/search?q=gta
>   http://shopper-search.cnet.com/search?q=ipod
>
> ...the categories in the left nav change based on what you search on,
> because we treat "category" as a facet, and the individual categories as
> possible "constraints" ... we don't show the user the exact count of how
> many products match in each category but we use that information to
> determine the order of the categories (or wether we should include a
> category in the list at all)
>
> : website and this would be a great way to break out content. Kind of
> greys
> : the lines between what is search and what is browsing categories, which
> is a
> : great thing actually. Thanks for the help.
>
> Even without facets, "browsing" a set of documents is just a search for
> "all" docuemnts (or depending on who you talk to: "searching" is just
> browsing with a special user entered constraint on the "text" facet)
>
>
>
>
> -Hoss
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Yonik Seeley
On 9/22/06, Tim Archambault <[hidden email]> wrote:

> I have a couple of questions from some online newspaper folks who are
> interested in Solr and are trying to understand how and why it came to be. I
> think inherent in these questions is the underlying theme I hear all the
> time and that is "Solr is not a content management system. It's a search
> engine."
>
> What I really wonder about CNet is how they manage their content and how
> Solr fits into their overall architecture -- is it an add-on? a
> purpose-built hammer to handle a specific problem they were having? was it
> something they "wanted" ... or instead something they needed to do, despite
> preferring something else?

Putting on my CNET hat for a little history:

We had a search server... a very thin layer built around a proprietary
search engine, used in a ton of places, for search-box type
functionality and direct generation of dynamic content.

That search engine was being discontinued by the vendor, so a
replacement was needed.  RFPs were put out, and all the commercial
alternatives were examined, but licensing costs  for the number of
servers we were talking about was exorbitant.

So we decided to build our own...

The replacement: ATOMICS- a MySQL/Apache hybrid.
http://conferences.oreillynet.com/cs/mysqluc2005/view/e_sess/7066
It works well for many of the search collections we have that don't
need much in the way of full-text search (MySQL does have full-text
capabilities, but nothing like Lucene).

Backup plan: something based on Lucene.
SOLAR really started out as a pure backup plan... just in case ATOMICS
had problems in some areas.  I had joined CNET a week earlier, and the
task of building "something lucene-based" was luckily handed to me as
I didn't have any other responsibilities yet.  Pretty much no
requirements except for the preference of something that spoke
HTTP/XML that could be put behind a load-balancer and scaled.

ATOMICS was pretty much done by the time I started on SOLAR, and was
rapidly deployed across CNET.  SOLAR had a tough time gaining traction
until someone crossed a problem that ATOMICS couldn't easily handle:
faceted browsing.  There was finally something concrete to aim for,
and filter caching, docsets, autowarming, custom query handlers, etc,
were rapidly added to allow the ability to write custom plugins that
could acutally do the faceting logic.

The result: http://www.mail-archive.com/java-user@.../msg02645.html

It soulds like Hoss might go into some more details in his ApacheCon session:
http://www.us.apachecon.com/html/sessions.html#FR26

> Another question asked of me was "Will Solr ever connect with datasources
> directly?"

As far as where Solr fits into our architecture, it's a back-end
component in the generation of dynamic content... sort of the same
place that a database would occupy.

I don't know much about content generation in CNET, and specific
content manangement syustems, but a lot of it ends up in databases.
An "indexer" piece normally pulls stuff from one or more databases,
and puts them into a solr master, which is replicated out to solr
searchers (or slaves) that the app-servers generating dynamic content
hit through a load-balancer.

There is a diagram of that from my ApacheCon presentation:
http://people.apache.org/~yonik/ApacheConEU2006/

As far as connecting to datasources directly... I think that being
able to pull content from a database is a good idea, and It's on the
todo list.  What specific other data sources did you have in mind?

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Tim Archambault-2
Obvious datasources: MSSQL, MySQL, etc. I'm under the impression that I have
to send an XML request to SOLR for every add, update, delete, etc. in my
database.

I believe there's a way to access MSSQL, MySQL etc. directly with Lucene,
but not sure how to do this with SOLR.

Thanks for all your feedback. While I started out way over my head. Solr is
actually fun to play around with, even for non-programmers or marginal
programmers like myself.

On 9/22/06, Yonik Seeley <[hidden email]> wrote:

>
> On 9/22/06, Tim Archambault <[hidden email]> wrote:
> > I have a couple of questions from some online newspaper folks who are
> > interested in Solr and are trying to understand how and why it came to
> be. I
> > think inherent in these questions is the underlying theme I hear all the
> > time and that is "Solr is not a content management system. It's a search
> > engine."
> >
> > What I really wonder about CNet is how they manage their content and how
> > Solr fits into their overall architecture -- is it an add-on? a
> > purpose-built hammer to handle a specific problem they were having? was
> it
> > something they "wanted" ... or instead something they needed to do,
> despite
> > preferring something else?
>
> Putting on my CNET hat for a little history:
>
> We had a search server... a very thin layer built around a proprietary
> search engine, used in a ton of places, for search-box type
> functionality and direct generation of dynamic content.
>
> That search engine was being discontinued by the vendor, so a
> replacement was needed.  RFPs were put out, and all the commercial
> alternatives were examined, but licensing costs  for the number of
> servers we were talking about was exorbitant.
>
> So we decided to build our own...
>
> The replacement: ATOMICS- a MySQL/Apache hybrid.
> http://conferences.oreillynet.com/cs/mysqluc2005/view/e_sess/7066
> It works well for many of the search collections we have that don't
> need much in the way of full-text search (MySQL does have full-text
> capabilities, but nothing like Lucene).
>
> Backup plan: something based on Lucene.
> SOLAR really started out as a pure backup plan... just in case ATOMICS
> had problems in some areas.  I had joined CNET a week earlier, and the
> task of building "something lucene-based" was luckily handed to me as
> I didn't have any other responsibilities yet.  Pretty much no
> requirements except for the preference of something that spoke
> HTTP/XML that could be put behind a load-balancer and scaled.
>
> ATOMICS was pretty much done by the time I started on SOLAR, and was
> rapidly deployed across CNET.  SOLAR had a tough time gaining traction
> until someone crossed a problem that ATOMICS couldn't easily handle:
> faceted browsing.  There was finally something concrete to aim for,
> and filter caching, docsets, autowarming, custom query handlers, etc,
> were rapidly added to allow the ability to write custom plugins that
> could acutally do the faceting logic.
>
> The result:
> http://www.mail-archive.com/java-user@.../msg02645.html
>
> It soulds like Hoss might go into some more details in his ApacheCon
> session:
> http://www.us.apachecon.com/html/sessions.html#FR26
>
> > Another question asked of me was "Will Solr ever connect with
> datasources
> > directly?"
>
> As far as where Solr fits into our architecture, it's a back-end
> component in the generation of dynamic content... sort of the same
> place that a database would occupy.
>
> I don't know much about content generation in CNET, and specific
> content manangement syustems, but a lot of it ends up in databases.
> An "indexer" piece normally pulls stuff from one or more databases,
> and puts them into a solr master, which is replicated out to solr
> searchers (or slaves) that the app-servers generating dynamic content
> hit through a load-balancer.
>
> There is a diagram of that from my ApacheCon presentation:
> http://people.apache.org/~yonik/ApacheConEU2006/
>
> As far as connecting to datasources directly... I think that being
> able to pull content from a database is a good idea, and It's on the
> todo list.  What specific other data sources did you have in mind?
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Erik Hatcher

On Sep 22, 2006, at 2:45 PM, Tim Archambault wrote:
> I believe there's a way to access MSSQL, MySQL etc. directly with  
> Lucene,
> but not sure how to do this with SOLR.

Nope.  Lucene is a pure search engine, with no hooks to databases, or  
document parsers, etc.  Lots of folks have built these kinds of  
things on top of Lucene, but the Lucene core is purely the text engine.

How would you envision communicating with Solr with a database in the  
picture?   How would the entire database be initially indexed?  How  
would changes to the database trigger Solr updates?   I'm not quite  
clear on what it would mean for Solr to work with a database directly  
so I'm curious.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Tim Archambault-2
Okay, I'll use an example.

A recruitment (jobs) customer goes onto our website and posts an online job
posting to our newspaper website. Upon insert into the database, I need to
generate an xml file to be sent to SOLR to ADD as  a record to the search
engine. Same  goes for an edit, my database updates the record and then I
have to send an ADD statement to Solr again to commit my change. 2x the
work.

I've been talking with other papers about Solr and I think what bothers many
is that there a is a deposit of information in a structured database here
[named A], then we have another set of basically the same data over here
[named B] and they don't understand why they have to manage to different
sets of data [A & B] that are virtually the same thing.  Many foresee a
maintenance nightmare. I've come to the conclusion that there's somewhat of
a disconnect between what a database does and what a search engine does. I
accept that the redundancy is necessary given the very different tasks that
each performs [keep in mind I'm still naive to the programming details here,
I understand conceptually].

In writing this to you another thought came to mind. Maybe there are
alternative ways to inject records into Solr outside the bounds of the
cygwin and CURL examples I've been using. Maybe that is the question we need
to be asking. What are some alternative ways to populate Solr?

Enough said, it's Friday afternoon.

Have a great weekend.

Tim

On 9/22/06, Erik Hatcher <[hidden email]> wrote:

>
>
> On Sep 22, 2006, at 2:45 PM, Tim Archambault wrote:
> > I believe there's a way to access MSSQL, MySQL etc. directly with
> > Lucene,
> > but not sure how to do this with SOLR.
>
> Nope.  Lucene is a pure search engine, with no hooks to databases, or
> document parsers, etc.  Lots of folks have built these kinds of
> things on top of Lucene, but the Lucene core is purely the text engine.
>
> How would you envision communicating with Solr with a database in the
> picture?   How would the entire database be initially indexed?  How
> would changes to the database trigger Solr updates?   I'm not quite
> clear on what it would mean for Solr to work with a database directly
> so I'm curious.
>
>         Erik
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Walter Underwood, Netflix
On 9/22/06 12:25 PM, "Tim Archambault" <[hidden email]>
wrote:

> A recruitment (jobs) customer goes onto our website and posts an online job
> posting to our newspaper website. Upon insert into the database, I need to
> generate an xml file to be sent to SOLR to ADD as  a record to the search
> engine. Same  goes for an edit, my database updates the record and then I
> have to send an ADD statement to Solr again to commit my change. 2x the
> work.
>
> I've been talking with other papers about Solr and I think what bothers many
> is that there a is a deposit of information in a structured database here
> [named A], then we have another set of basically the same data over here
> [named B] and they don't understand why they have to manage to different
> sets of data [A & B] that are virtually the same thing.

The work isn't duplicated. Two servers are building two kinds of index,
a transactional record index and a text index. That is two kinds of
work, not a duplication.

Storing the data is the small part of a database or a search engine.
The indexes are the real benefit.

In fact, the data does not have to be stored in Solr. You can return a
database key as the only field, then get the details from the database.
That is  how our current search works -- the search result is a list
of keys in relevance order. Period.

wunder
--
Walter Underwood
Search Guru, Netflix

Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Tim Archambault-2
I'm really confused. I don't mean "store" the data figuratively as in a
lucene/solr command. Storing an ID number in a solr index isn't going to
help a user find "nurse". I think part of this is that some people feel that
databases like MSSQL, MYSQL should be able to provide quality search
experience, but they just flat out don't. It's a separate utility.

Thanks Walter.

On 9/22/06, Walter Underwood <[hidden email]> wrote:

>
> On 9/22/06 12:25 PM, "Tim Archambault" <[hidden email]>
> wrote:
>
> > A recruitment (jobs) customer goes onto our website and posts an online
> job
> > posting to our newspaper website. Upon insert into the database, I need
> to
> > generate an xml file to be sent to SOLR to ADD as  a record to the
> search
> > engine. Same  goes for an edit, my database updates the record and then
> I
> > have to send an ADD statement to Solr again to commit my change. 2x the
> > work.
> >
> > I've been talking with other papers about Solr and I think what bothers
> many
> > is that there a is a deposit of information in a structured database
> here
> > [named A], then we have another set of basically the same data over here
> > [named B] and they don't understand why they have to manage to different
> > sets of data [A & B] that are virtually the same thing.
>
> The work isn't duplicated. Two servers are building two kinds of index,
> a transactional record index and a text index. That is two kinds of
> work, not a duplication.
>
> Storing the data is the small part of a database or a search engine.
> The indexes are the real benefit.
>
> In fact, the data does not have to be stored in Solr. You can return a
> database key as the only field, then get the details from the database.
> That is  how our current search works -- the search result is a list
> of keys in relevance order. Period.
>
> wunder
> --
> Walter Underwood
> Search Guru, Netflix
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Yonik Seeley-2
In reply to this post by Tim Archambault-2
On 9/22/06, Tim Archambault <[hidden email]> wrote:
> I've been talking with other papers about Solr and I think what bothers many
> is that there a is a deposit of information in a structured database here
> [named A], then we have another set of basically the same data over here
> [named B] and they don't understand why they have to manage to different
> sets of data [A & B] that are virtually the same thing.

Yes, I sympathize... if MySQL had a really good full-text search
somehow integrated in it, it would simplify things.  I think it's
probably a lot harder to be both a database and a full-text search
server and do both well.

We thought about closer integration in the past, but MySQL didn't have
triggers or anything, so there was no way to know when something
changed and what changed.  Databases also can't handle things like
Solr's dynamic fields as well either.

> In writing this to you another thought came to mind. Maybe there are
> alternative ways to inject records into Solr outside the bounds of the
> cygwin and CURL examples I've been using.

curl is just used as an example.
Hopefully the XML updates are generated programatically from the
database records and automatically sent to Solr?

That still requires coding on the users part though, and I would
eventually like to be able to index simple databases with a user
supplied SQL select statement.


-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Walter Underwood, Netflix
In reply to this post by Tim Archambault-2
Sorry, I was not being exact with "store". Lucene has separate
control over whether the value of a field is stored and whether
it is indexed. The term "nurse" might be searchable, but the
only value that is stored in the index for retrieval is the
database key for each matching job.

It seems like text search should be easy to add to a transactional
database, but lots of smart people have tried to make that work
and failed. Maybe it is possible, but neither Oracle nor Microsoft
nor the open source community have been able to make it happen.
The text search in RDBMSs seems to always be slow and lame.

There is one product that does transactional query and text
search: MarkLogic. It does a good job of both, but it is very
XML-centric. It might be a good match, if you are into commercial
software. It is a rather different style of programming than
SQL or Lucene. You write XQuery to define the result XML with
the contents fetched from the database.

wunder (not affiliated with MarkLogic)

On 9/22/06 12:42 PM, "Tim Archambault" <[hidden email]>
wrote:

> I'm really confused. I don't mean "store" the data figuratively as in a
> lucene/solr command. Storing an ID number in a solr index isn't going to
> help a user find "nurse". I think part of this is that some people feel that
> databases like MSSQL, MYSQL should be able to provide quality search
> experience, but they just flat out don't. It's a separate utility.
>
> Thanks Walter.
>
> On 9/22/06, Walter Underwood <[hidden email]> wrote:
>>
>> On 9/22/06 12:25 PM, "Tim Archambault" <[hidden email]>
>> wrote:
>>
>>> A recruitment (jobs) customer goes onto our website and posts an online
>> job
>>> posting to our newspaper website. Upon insert into the database, I need
>> to
>>> generate an xml file to be sent to SOLR to ADD as  a record to the
>> search
>>> engine. Same  goes for an edit, my database updates the record and then
>> I
>>> have to send an ADD statement to Solr again to commit my change. 2x the
>>> work.
>>>
>>> I've been talking with other papers about Solr and I think what bothers
>> many
>>> is that there a is a deposit of information in a structured database
>> here
>>> [named A], then we have another set of basically the same data over here
>>> [named B] and they don't understand why they have to manage to different
>>> sets of data [A & B] that are virtually the same thing.
>>
>> The work isn't duplicated. Two servers are building two kinds of index,
>> a transactional record index and a text index. That is two kinds of
>> work, not a duplication.
>>
>> Storing the data is the small part of a database or a search engine.
>> The indexes are the real benefit.
>>
>> In fact, the data does not have to be stored in Solr. You can return a
>> database key as the only field, then get the details from the database.
>> That is  how our current search works -- the search result is a list
>> of keys in relevance order. Period.
>>
>> wunder
>> --
>> Walter Underwood
>> Search Guru, Netflix
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Simple Faceted Searching out of the box

Joachim Martin
In reply to this post by Tim Archambault-2
I think you will find that this architecture is quite common.  What
commercial packages
provide (remember you are getting this for free!) are the tools for
managing the dynamic
export of data out of your database into the full-text search engine.

Solr provides a very easy way to do this, but yes, you have to do some
programming
to automate it.

Two common ways of doing this.  1) write a component that periodically
checks for
new/updated database content and submits it to solr.  2) write a trigger
in the database
that immediately posts to solr (I would use JMS or some other
asynchronous messaging
system for this).  I'm sure there are other solutions.

When/if MYSQL full text search is as good as solr/lucene, you can cut
out one of the steps.

I could see a component added to solr that did #1 above for you.  MG4j
has a simple
loader that takes a SQL query and indexes the result
(JdbcDocumentCollection). For
Solr, you'd want to be able to handle muti-valued fields, which
complicates things.

If this architecture bothers technical folks, they either are accustomed
to using very
expensive software, or haven't been doing this very long.

Of course, I am trying to figure out a way to make Solr more like a
database, so there
you go...

--Joachim

Tim Archambault wrote:

> Okay, I'll use an example.
>
> A recruitment (jobs) customer goes onto our website and posts an
> online job
> posting to our newspaper website. Upon insert into the database, I
> need to
> generate an xml file to be sent to SOLR to ADD as  a record to the search
> engine. Same  goes for an edit, my database updates the record and then I
> have to send an ADD statement to Solr again to commit my change. 2x the
> work.
>
> I've been talking with other papers about Solr and I think what
> bothers many
> is that there a is a deposit of information in a structured database here
> [named A], then we have another set of basically the same data over here
> [named B] and they don't understand why they have to manage to different
> sets of data [A & B] that are virtually the same thing.  Many foresee a
> maintenance nightmare. I've come to the conclusion that there's
> somewhat of
> a disconnect between what a database does and what a search engine
> does. I
> accept that the redundancy is necessary given the very different tasks
> that
> each performs [keep in mind I'm still naive to the programming details
> here,
> I understand conceptually].
>
> In writing this to you another thought came to mind. Maybe there are
> alternative ways to inject records into Solr outside the bounds of the
> cygwin and CURL examples I've been using. Maybe that is the question
> we need
> to be asking. What are some alternative ways to populate Solr?
>
> Enough said, it's Friday afternoon.
>
> Have a great weekend.
>
> Tim
>
> On 9/22/06, Erik Hatcher <[hidden email]> wrote:
>
>>
>>
>> On Sep 22, 2006, at 2:45 PM, Tim Archambault wrote:
>> > I believe there's a way to access MSSQL, MySQL etc. directly with
>> > Lucene,
>> > but not sure how to do this with SOLR.
>>
>> Nope.  Lucene is a pure search engine, with no hooks to databases, or
>> document parsers, etc.  Lots of folks have built these kinds of
>> things on top of Lucene, but the Lucene core is purely the text engine.
>>
>> How would you envision communicating with Solr with a database in the
>> picture?   How would the entire database be initially indexed?  How
>> would changes to the database trigger Solr updates?   I'm not quite
>> clear on what it would mean for Solr to work with a database directly
>> so I'm curious.
>>
>>         Erik
>>
>>
>

12