Replacing FAST functionality at sesam.no

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Replacing FAST functionality at sesam.no

Glenn-Erik Sandbakken
At sesam.no we want to replace a FAST (fast.no) Query Matching Server
with a Solr index.

The index we are trying to replace is not a regular index, but specially
configured to perform phrases (and sub-phrases) matches against several
large lists (like an index with only a 'title' field).

I'm not sure of a correct, or logical, name for the behavior we are
after, but it is like a combination between Shingles and exact matching.

Some examples should explain it well.

Lets say we have the following list:

> one two three
> one two
> two three
> one
> two
> three
> three two
> two one
> one three
> three one
For the query "one two three", we need hits against, and only against:
> one two three
> one two
> two three
> one
> two
> three

For the query "one two", we need hits against, and only against:
> one two
> one
> two

For the query "one three four" (or "four one three"), we need hits
against, and only against:
> one three
> one
> three

For the query "one two sesam three", we need hits against, and only
against:
> one two
> one
> two
> three

We have been testing out solr with the ShingleFilter for this, but
without luck.
I am unsure whether the reason is misconfiguration in schema.xml or that
the ShingleFilter actually don't support this type of behavior.
Attached our current schema.xml
(it is different from when I made this post to the solr-dev mailinglist,
the shingle "fieldType" is of class "solr.StrField")
Attached is screenshots of the solr/admin/analysis.jsp against this
configuration.

I'd like to know if the SchingleFilter is at all able to do what we
want.
 If it is: How can I configure schema.xml?
 If not: does there exist any other solutions that we can incorporate
into solr which will give us this behavior?

If there is no existing solution to this, we will probably end up
writing our own methods for it, extending the ShingleFilter, gladly
contributing to the solr project =)

Thanks for a great product,
Glenn-Erik


schema.xml (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

Otis Gospodnetic-2
The screenshot didn't make it.... (some attachments gets stripped)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Glenn-Erik Sandbakken <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, August 27, 2008 1:44:53 PM
> Subject: Replacing FAST functionality at sesam.no
>
> At sesam.no we want to replace a FAST (fast.no) Query Matching Server
> with a Solr index.
>
> The index we are trying to replace is not a regular index, but specially
> configured to perform phrases (and sub-phrases) matches against several
> large lists (like an index with only a 'title' field).
>
> I'm not sure of a correct, or logical, name for the behavior we are
> after, but it is like a combination between Shingles and exact matching.
>
> Some examples should explain it well.
>
> Lets say we have the following list:
> > one two three
> > one two
> > two three
> > one
> > two
> > three
> > three two
> > two one
> > one three
> > three one
>
> For the query "one two three", we need hits against, and only against:
> > one two three
> > one two
> > two three
> > one
> > two
> > three
>
> For the query "one two", we need hits against, and only against:
> > one two
> > one
> > two
>
> For the query "one three four" (or "four one three"), we need hits
> against, and only against:
> > one three
> > one
> > three
>
> For the query "one two sesam three", we need hits against, and only
> against:
> > one two
> > one
> > two
> > three
>
> We have been testing out solr with the ShingleFilter for this, but
> without luck.
> I am unsure whether the reason is misconfiguration in schema.xml or that
> the ShingleFilter actually don't support this type of behavior.
> Attached our current schema.xml
> (it is different from when I made this post to the solr-dev mailinglist,
> the shingle "fieldType" is of class "solr.StrField")
> Attached is screenshots of the solr/admin/analysis.jsp against this
> configuration.
>
> I'd like to know if the SchingleFilter is at all able to do what we
> want.
> If it is: How can I configure schema.xml?
> If not: does there exist any other solutions that we can incorporate
> into solr which will give us this behavior?
>
> If there is no existing solution to this, we will probably end up
> writing our own methods for it, extending the ShingleFilter, gladly
> contributing to the solr project =)
>
> Thanks for a great product,
> Glenn-Erik

Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

Svein Parnas-2
In reply to this post by Glenn-Erik Sandbakken

On 27. aug.. 2008, at 19.44, Glenn-Erik Sandbakken wrote:

> At sesam.no we want to replace a FAST (fast.no) Query Matching Server
> with a Solr index.
>
> The index we are trying to replace is not a regular index, but  
> specially
> configured to perform phrases (and sub-phrases) matches against  
> several
> large lists (like an index with only a 'title' field).
>
> I'm not sure of a correct, or logical, name for the behavior we are
> after, but it is like a combination between Shingles and exact  
> matching.
>
> Some examples should explain it well.

In order to do this, you can´t use the ShingleFilter during indexing  
since a document like "one two three" and a query like "one two four"  
will match since they have the shingle "one two" in common.

You will get what you want, I think, if you don´t tokenize during  
indexing (some normalization will be required if your lists aren't  
normalized to begin with) and apply the ShingleFilter only to the  
queries.

Svein

Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

Glenn-Erik Sandbakken
In reply to this post by Otis Gospodnetic-2
> The screenshot didn't make it.... (some attachments gets stripped)
I have put the screenshots here:
http://www.glennerik.com/solr/solrshingle1.gif
and here:
http://www.glennerik.com/solr/solrshingle2.gif
I also put the schema.xml here:
http://www.glennerik.com/solr/schema.xml

> This sounds very much like shingles of variable length (1 to
length(terms in query)).
> Make sure you turn them into phrase queries and combine them with ORs
and things should work then.
(from your answer on the dev mailing list)
We have always had the solrQueryParser defaultOperator="OR"
(but I have tested with AND just to see the result)
I am not sure what you mean with "turn them into phrase queries", we
don't know about query analysis phrasing.

- Glenn-Erik

Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

Glenn-Erik Sandbakken
In reply to this post by Svein Parnas-2
>In order to do this, you can't use the ShingleFilter during indexing  
>since a document like "one two three" and a query like "one two four"  
>will match since they have the shingle "one two" in common.
Hello Svein, nice to meet you in this place =)
I have been trying with and without <analyzer type="index">
and also <analyzer type="query">
I have also been trying with and without outputUnigrams="true" for
analyzer type=index and analyzer type=query
And I have been trying with and without outputUnigramIfNoNgram="true"
for analyzer type=index (only)
I am pretty sure I have been trying all possible combinations of
switching all of this on and off.
I have never seen exactly the expected result.

>You will get what you want, I think, if you don't tokenize during  
>indexing (some normalization will be required if your lists aren't  
>normalized to begin with) and apply the ShingleFilter only to the  
>queries.
I also think that this sounds like the most logical configuration,
but such a configuration doesn't give us the expected results.
(Un?=)fortunately I am leaving on a two week vacation in one hour.
I'd love to follow up on this the coming days,
but Mick Semb Wever will be taking over this job for the next two weeks.

- Glenn-Erik Sandbakken

Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

michaelsembwever
> but Mick Semb Wever will be taking over this job for the next two weeks.

back from holidays and taking over where Glenn-Erik left. i'm very new
to Solr so please bear with me,

i'll run through our setup from scratch.

Our test list has 9 entries:
 "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh", "ijkl",
"ijkl efgh", "efgh abcd", and "ijkl efgh abcd".

I'm using a trunk build of Solr, and using the example/solr for the solr
home.

Editing schema.xml so to put these entries in as type="string" and using
defaultOperator="OR" gives the expected exact matching functionality
given queries are quoted, eg /solr/select/?q="abcd efgh ijkl"

So then i change type="string" to type="shingleString" along with

> <fieldType name="shingleString" class="solr.StrField" positionIncrementGap="100" omitNorms="true" >
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99" />
>       </analyzer>
> </fieldType>

I never get any hits with quoted queries.
Without quotes i only get the unigrams.

I get the same outcomes using:
fieldType@class="solr.TextField" and
in the index analyzer tokenizer@class="solr.KeywordTokenizerFactory".

In fact the ShingleFilter does nothing at all here, commenting the
filter line out leads exactly the same behaviour.

What am i missing to get shingles actually matching the indexed entries?
  It seems to be if this was solved it would work without having to use
quoted queries.

I have been using the analysis.jsp tool
Everything looks good except that quotes are captured into the words and
shingles, eg

> term position 1                2               3
> term text     "abcd            efgh            ijkl"
>               "abcd            efgh efgh ijkl"
>               "abcd efgh ijkl"

This would explain why quoted queries are not working - the
ShingleFilter produces tokens with the " character in it. But here i
would have atleast expected a hit against efgh

~mck

--
"He who joyfully marches to music in rank and file has already earned my
contempt. He has been given a large brain by mistake, since for him the
spinal cord would suffice." Albert Einstein
| semb.wever.org | sesat.no | sesam.no |

signature.asc (204 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

michaelsembwever
> So then i change type="string" to type="shingleString" along with
> > [snip]
> >       <analyzer type="query">
> >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >         <filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99" />
> >       </analyzer>

Debugging ShingleFilter I see that without quotes the shingles
StringBuffer array consists of just the current token.

When the query does have quotes the shingles array fills up with the
expected shingles.
And the Query (infact a MultiPhraseQuery)
  returned from SolrQueryParser.getFieldQuery()
  looks like

list_entry_shingle:"(abcd abcd efgh abcd efgh ijkl) (efgh efgh ijkl) ijkl"

I'm struggling to make sense of this.
How can the shingles be matched if they aren't quoted?
Why put the parenthesis () when the query has default operator OR?

I would be expecting a Query instead like:
abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl

(This with the ShingleFilter disabled does indeed work perfectly).

Am i barking up the wrong tree?
Is there a way to get the shingles phrased?

Otis, you mentioned this briefly on your reply on the dev list:
> Make sure you turn them into phrase queries

did you mean here something more than just quoting the original query?

~mck

--
"Claiming Java is easier than C++ is like saying that K2 is shorter than
Everest." Larry O'Brien
| semb.wever.org | sesat.no | sesam.no |

signature.asc (204 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

Shalin Shekhar Mangar
I'm not very familiar with shingles but it seems to be that you should have
ShingleFilter at index time and make the query as a phrase query?

On Mon, Sep 8, 2008 at 1:00 PM, Mck <[hidden email]> wrote:

> > So then i change type="string" to type="shingleString" along with
> > > [snip]
> > >       <analyzer type="query">
> > >         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >         <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
> outputUnigramIfNoNgram="true" maxShingleSize="99" />
> > >       </analyzer>
>
> Debugging ShingleFilter I see that without quotes the shingles
> StringBuffer array consists of just the current token.
>
> When the query does have quotes the shingles array fills up with the
> expected shingles.
> And the Query (infact a MultiPhraseQuery)
>  returned from SolrQueryParser.getFieldQuery()
>  looks like
>
> list_entry_shingle:"(abcd abcd efgh abcd efgh ijkl) (efgh efgh ijkl) ijkl"
>
> I'm struggling to make sense of this.
> How can the shingles be matched if they aren't quoted?
> Why put the parenthesis () when the query has default operator OR?
>
> I would be expecting a Query instead like:
> abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl
>
> (This with the ShingleFilter disabled does indeed work perfectly).
>
> Am i barking up the wrong tree?
> Is there a way to get the shingles phrased?
>
> Otis, you mentioned this briefly on your reply on the dev list:
> > Make sure you turn them into phrase queries
>
> did you mean here something more than just quoting the original query?
>
> ~mck
>
> --
> "Claiming Java is easier than C++ is like saying that K2 is shorter than
> Everest." Larry O'Brien
> | semb.wever.org | sesat.no | sesam.no |
>



--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

michaelsembwever
> I'm not very familiar with shingles but it seems to be that you should
> have ShingleFilter at index time and make the query as a phrase query?

Then the entry "abcd efgh ijkl" would be indexed as
(abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl)

and a subsequent query "abcd" would return this entry.
If this is so then this is not exact matching and not what we are
looking for.

The filter behaviour we are looking for is like:
   (i've included ^$ to denote the exact matching)

Original Query   --> Filtered Query
 abcd            -->  ^abcd$
"abcd efgh"      --> (^abcd$ ^"abcd efgh"$ ^efgh$)
"abcd efgh ijkl" --> (^abcd$ ^"abcd efgh"$ ^"abcd efgh ijkl"$ ^efgh$ ^"efgh ijkl"$ ^ijkl$)


~mck

--
"All stable processes we shall predict. All unstable processes we shall
control." John von Neumann
| semb.wever.org | sesat.no | sesam.no |

signature.asc (204 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no

Otis Gospodnetic-2
In reply to this post by Glenn-Erik Sandbakken
Just glancing over this.  I believe one of the recent shingle contributions over in Lucene contrib/ indeed has the option to add those begin/end marker characters, so if this will solve your exact matching needs, that's the thing to look at.  You'll have to write (and contribute?) a bit of glue to use it in Solr.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Mck <[hidden email]>
> To: [hidden email]
> Sent: Monday, September 8, 2008 4:43:50 AM
> Subject: Re: Replacing FAST functionality at sesam.no
>
> > I'm not very familiar with shingles but it seems to be that you should
> > have ShingleFilter at index time and make the query as a phrase query?
>
> Then the entry "abcd efgh ijkl" would be indexed as
> (abcd "abcd efgh" "abcd efgh ijkl" efgh "efgh ijkl" ijkl)
>
> and a subsequent query "abcd" would return this entry.
> If this is so then this is not exact matching and not what we are
> looking for.
>
> The filter behaviour we are looking for is like:
>    (i've included ^$ to denote the exact matching)
>
> Original Query   --> Filtered Query
> abcd            -->  ^abcd$
> "abcd efgh"      --> (^abcd$ ^"abcd efgh"$ ^efgh$)
> "abcd efgh ijkl" --> (^abcd$ ^"abcd efgh"$ ^"abcd efgh ijkl"$ ^efgh$ ^"efgh
> ijkl"$ ^ijkl$)
>
>
> ~mck
>
> --
> "All stable processes we shall predict. All unstable processes we shall
> control." John von Neumann
> | semb.wever.org | sesat.no | sesam.no |