Best approach for indexing and querying against a multivalue name field like directors or actors?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Best approach for indexing and querying against a multivalue name field like directors or actors?

Daniel Einspanjer
I'm rather new to Solr and somewhat rusty on what little I learned on
Lucene a few years back.

I've got some documents I want to index that have multiple name fields
such as directors or actors. I'm wanting to index them such that
querying for "Jane Doe" would have a higher score for "Jane M. Doe"
than for "John Doe", but I need to make sure that "Jane Doe" wouldn't
match a document with two directors, "Jane Smith" and "John Doe" at
all.

If anyone has done something like this and could suggest some of the
solr filters that might be useful to me, I'd greatly appreciate it.

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: Best approach for indexing and querying against a multivalue name field like directors or actors?

Daniel Einspanjer
I'm sorry, I said something confusing there.
Let me try that last case again.

If you have three documents with a multivalue field named director
(represented here by ; separator)
1. "Jane M. Doe"
2. "Jane Smith"; "John Doe"
3. "John Doe"

And the user searched for director:"Jane Doe", I would ideally like 1.
to have the highest score and 2 and 3 to have nearly equal scores.
The experiments I've done so far have given 2. a score higher than 3.
because the terms Jane and Doe were found in document 2. even though
they were in separate instances of the multivalue field.

I hope this makes understanding my question better rather than worse. :)

Thanks,
Daniel

On 3/28/07, Daniel Einspanjer <[hidden email]> wrote:
> <snip> but I need to make sure that "Jane Doe" wouldn't
> match a document with two directors, "Jane Smith" and "John Doe" at
> all.
Reply | Threaded
Open this post in threaded view
|

Re: Best approach for indexing and querying against a multivalue name field like directors or actors?

Chris Hostetter-3
In reply to this post by Daniel Einspanjer

you'll want to look into the positionIncrementGap attribute that can be
specified when defining an Analyzer for your field type ... it defines the
"logical" gap between tokens in a multi-value field, so if you use a
whitespace tokenizer add the names "Jane Smith" and "John Doe" you'll get
the tokens "Jane", "Smith", ... John", "Doe" with a big gap between Smith
and John .. so now you cna do phrase queries and as long as the slop on
your phrase queries is less the the gap you used you don't have to worry
about false matches on "Jane Doe"



: Date: Wed, 28 Mar 2007 17:28:47 -0400
: From: Daniel Einspanjer <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: Best approach for indexing and querying against a multivalue
:     name field like directors or actors?
:
: I'm rather new to Solr and somewhat rusty on what little I learned on
: Lucene a few years back.
:
: I've got some documents I want to index that have multiple name fields
: such as directors or actors. I'm wanting to index them such that
: querying for "Jane Doe" would have a higher score for "Jane M. Doe"
: than for "John Doe", but I need to make sure that "Jane Doe" wouldn't
: match a document with two directors, "Jane Smith" and "John Doe" at
: all.
:
: If anyone has done something like this and could suggest some of the
: solr filters that might be useful to me, I'd greatly appreciate it.
:
: Daniel
:



-Hoss

Reply | Threaded
Open this post in threaded view
|

Snippets of indexed text

Pierre-Yves LANDRON
Hello everybody !

I wondering if there a way to get some relevant snippets (searched terms
contextualized) of indexed text with a solr response to a query, instead of
just the entire indexed field ? ( more widely, what are the possibilities to
let solr formate the answer (highlight terms, etc.) ? )

Thanks,
Kind regards,
P-Y Landron

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Reply | Threaded
Open this post in threaded view
|

Re: Snippets of indexed text

Thierry Collogne
It is possible. You need to pass highlighting parameters. Look here :

      http://wiki.apache.org/solr/HighlightingParameters

Hope this helps.

On 29/03/07, Pierre-Yves LANDRON <[hidden email]> wrote:

>
> Hello everybody !
>
> I wondering if there a way to get some relevant snippets (searched terms
> contextualized) of indexed text with a solr response to a query, instead
> of
> just the entire indexed field ? ( more widely, what are the possibilities
> to
> let solr formate the answer (highlight terms, etc.) ? )
>
> Thanks,
> Kind regards,
> P-Y Landron
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Snippets of indexed text

Pierre-Yves LANDRON
hello,

thanks for the info ; it's exactly what i need. i can't manage to make it
works, though. it's strange because i have the same problem with facets : it
seems that some options are not taken in account...

for example, here is my request to solr:
q=%28%28titre:moulin%29+OR+%28texte:moulin%29+OR+%28sujet:moulin%29+OR+%28desc:moulin%29%29&version=2.1&start=0&rows=12&fl=*+score&qt=standard&hl=true&hl.fl=texte,desc&hl.snippets=3&hl.fragsize=150

and an extract of the response is :
<doc>
<float name="score">0.0151801035</float>
<str name="PID">bml:8071</str>
<str name="texte">
Les Grands Moulins
Le chemin de la Bouteille n'est pas, comme son nom semblerait l'indiquer, le
chemin préféré des ivrognes. En l'occurrence, c'est plutôt le chemin des
Boulangers ou mieux encore (... cutted by me, in fact all the field is
returned)
</str>
<str name="thumb">http://10.208.0.215:8080/fedora/get/bml:8071/Thumb</str>
<str name="type">page</str>
</doc>

obviously  the hl parameters haven't been taken in account. I've hot the
same problem with the facet.mincount parameter; facets works fine, but this
parameter is not taken in account for some reason...

did i done something wrong ?

thanks,
kind regards,
p-y




>From: "Thierry Collogne" <[hidden email]>
>Reply-To: [hidden email]
>To: [hidden email]
>Subject: Re: Snippets of indexed text
>Date: Thu, 29 Mar 2007 08:56:51 +0200
>
>It is possible. You need to pass highlighting parameters. Look here :
>
>      http://wiki.apache.org/solr/HighlightingParameters
>
>Hope this helps.
>

_________________________________________________________________
It’s tax season, make sure to follow these few simple tips
http://articles.moneycentral.msn.com/Taxes/PreparationTips/PreparationTips.aspx?icid=HMMartagline

Reply | Threaded
Open this post in threaded view
|

Re: Snippets of indexed text

Thierry Collogne
I can't see anything wrong. But perhaps you are looking at the wrong part of
the response. It is the same lake with facets.
You need to look further down in the xml reponse. Here I asked solr to
highlight the field "content" and I used a facer called type.

This is a sample of an xml response in our application

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">5</int>
 <lst name="params">
  <str name="rows">10</str>
  <str name="start">0</str>

  <str name="facet">true</str>
  <str name="q">stamp AND site:3</str>
  <str name="version">2.2</str>
  <str name="hl.fl">content</str>
  <str name="facet.field">type</str>
  <str name="indent">on</str>

  <str name="hl">true</str>
 </lst>
</lst>
<result name="response" numFound="1" start="0">
 <doc>
  <str name="id">col_36863_NL</str>
  <str name="authorisation">ALL</str>
  <str name="content"></str>
  <str name="type">HR</str>
 </doc>
</result>
<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields">
  <lst name="type">
    <int name="HR">1</int>
  </lst>
 </lst>
</lst>
<lst name="highlighting">
 <lst name="col_36863_NL">
  <arr name="content">
    <str></str>
  </arr>
 </lst>
</lst>
</response>


If you look at the end you see the following for facets

<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields">
  <lst name="type">
    <int name="HR">1</int>
  </lst>
 </lst>
</lst>


And this is the part for the highlighted text :

<lst name="highlighting">
 <lst name="col_36863_NL">
  <arr name="content">
    <str></str>
  </arr>
 </lst>
</lst>

I hope this helps a bit. By the way, if you are using java, it may be good
to check out the java client here

   http://issues.apache.org/jira/browse/SOLR-20

There is a comment with some code that I added. This code can be added to
the java client to support highlighting.

If you need anymore help, just post it and I will try to help more.


On 30/03/07, Pierre-Yves LANDRON <[hidden email]> wrote:

>
> hello,
>
> thanks for the info ; it's exactly what i need. i can't manage to make it
> works, though. it's strange because i have the same problem with facets :
> it
> seems that some options are not taken in account...
>
> for example, here is my request to solr:
>
> q=%28%28titre:moulin%29+OR+%28texte:moulin%29+OR+%28sujet:moulin%29+OR+%28desc:moulin%29%29&version=
> 2.1&start=0&rows=12&fl=*+score&qt=standard&hl=true&hl.fl=texte
> ,desc&hl.snippets=3&hl.fragsize=150
>
> and an extract of the response is :
> <doc>
> <float name="score">0.0151801035</float>
> <str name="PID">bml:8071</str>
> <str name="texte">
> Les Grands Moulins
> Le chemin de la Bouteille n'est pas, comme son nom semblerait l'indiquer,
> le
> chemin préféré des ivrognes. En l'occurrence, c'est plutôt le chemin des
> Boulangers ou mieux encore (... cutted by me, in fact all the field is
> returned)
> </str>
> <str name="thumb">http://10.208.0.215:8080/fedora/get/bml:8071/Thumb</str>
> <str name="type">page</str>
> </doc>
>
> obviously  the hl parameters haven't been taken in account. I've hot the
> same problem with the facet.mincount parameter; facets works fine, but
> this
> parameter is not taken in account for some reason...
>
> did i done something wrong ?
>
> thanks,
> kind regards,
> p-y
>
>
>
>
> >From: "Thierry Collogne" <[hidden email]>
> >Reply-To: [hidden email]
> >To: [hidden email]
> >Subject: Re: Snippets of indexed text
> >Date: Thu, 29 Mar 2007 08:56:51 +0200
> >
> >It is possible. You need to pass highlighting parameters. Look here :
> >
> >      http://wiki.apache.org/solr/HighlightingParameters
> >
> >Hope this helps.
> >
>
> _________________________________________________________________
> It's tax season, make sure to follow these few simple tips
>
> http://articles.moneycentral.msn.com/Taxes/PreparationTips/PreparationTips.aspx?icid=HMMartagline
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Snippets of indexed text

Pierre-Yves LANDRON
>And this is the part for the highlighted text :
>
><lst name="highlighting">
><lst name="col_36863_NL">
>  <arr name="content">
>    <str></str>
>  </arr>
></lst>
></lst>
>

Yes it works just fine ! and it's great. :)

Thanks Thierry : you were right, i didn't look for the right tag in the
response.
( My problem with facets parameters is still unresolved but i will work on
that later)

The more i'm using solr, the more i'm glad i've choosen this way to work
with lucene.

Kind Regards...
P-Yves Landron

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Reply | Threaded
Open this post in threaded view
|

Re: Snippets of indexed text

Thierry Collogne
Glad I could help you.

On 02/04/07, Pierre-Yves LANDRON <[hidden email]> wrote:

>
> >And this is the part for the highlighted text :
> >
> ><lst name="highlighting">
> ><lst name="col_36863_NL">
> >  <arr name="content">
> >    <str></str>
> >  </arr>
> ></lst>
> ></lst>
> >
>
> Yes it works just fine ! and it's great. :)
>
> Thanks Thierry : you were right, i didn't look for the right tag in the
> response.
> ( My problem with facets parameters is still unresolved but i will work on
> that later)
>
> The more i'm using solr, the more i'm glad i've choosen this way to work
> with lucene.
>
> Kind Regards...
> P-Yves Landron
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>