Exact match

classic Classic list List threaded Threaded
6 messages Options
OTH
Reply | Threaded
Open this post in threaded view
|

Exact match

OTH
Hello,

What would be the best way to get exact matches (if any) to a query?

E.g.:  Let's the document text is:  "united states of america".
Currently, any query containing one or more of the three words "united",
"states", or "america" will match with the above document.  I would like a
way so that the document matches only and only if the query were also
"united states of america" (case-insensitive).

Document field type:  TextField
Index Analyzer: TokenizerChain
Index Tokenizer: StandardTokenizerFactory
Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
SnowballPorterFilterFactory
The Query Analyzer / Tokenizer / Token Filters are the same as the Index
ones above.

FYI I'm relatively novice at Solr / Lucene / Search.

Much appreciated
Omer
Reply | Threaded
Open this post in threaded view
|

Re: Exact match

David Hastings
if the query is in quotes it will work.  also, not sure if youve been
following, but get rid of:
StopFilterFactory and all stopwords, or just make your stop word file empty
if you need it to work in non quotes, add them to the query post
submission ?

On Mon, Dec 2, 2019 at 3:44 PM OTH <[hidden email]> wrote:

> Hello,
>
> What would be the best way to get exact matches (if any) to a query?
>
> E.g.:  Let's the document text is:  "united states of america".
> Currently, any query containing one or more of the three words "united",
> "states", or "america" will match with the above document.  I would like a
> way so that the document matches only and only if the query were also
> "united states of america" (case-insensitive).
>
> Document field type:  TextField
> Index Analyzer: TokenizerChain
> Index Tokenizer: StandardTokenizerFactory
> Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
> SnowballPorterFilterFactory
> The Query Analyzer / Tokenizer / Token Filters are the same as the Index
> ones above.
>
> FYI I'm relatively novice at Solr / Lucene / Search.
>
> Much appreciated
> Omer
>
Reply | Threaded
Open this post in threaded view
|

Re: Exact match

Emir Arnautović
In reply to this post by OTH
Hi Omer,
From performance perspective, it is the best if you index title as a single token: KeywordTokenizer + LowerCaseFilter

If you need to query that field in some other way, you can index it differently as some other field using copyField.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Dec 2019, at 21:43, OTH <[hidden email]> wrote:
>
> Hello,
>
> What would be the best way to get exact matches (if any) to a query?
>
> E.g.:  Let's the document text is:  "united states of america".
> Currently, any query containing one or more of the three words "united",
> "states", or "america" will match with the above document.  I would like a
> way so that the document matches only and only if the query were also
> "united states of america" (case-insensitive).
>
> Document field type:  TextField
> Index Analyzer: TokenizerChain
> Index Tokenizer: StandardTokenizerFactory
> Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
> SnowballPorterFilterFactory
> The Query Analyzer / Tokenizer / Token Filters are the same as the Index
> ones above.
>
> FYI I'm relatively novice at Solr / Lucene / Search.
>
> Much appreciated
> Omer

Reply | Threaded
Open this post in threaded view
|

Re: Exact match

Erick Erickson
There are two different interpretations of “exact match” going on here, don’t be confused!

Emir’s version is “the text has to match the _entire_ input. So a field with “a b c d” will NOT match “a b” or “a b c” or “b c", but only “a b c d”.

David’s version is “The text has to contain some sequence of words that exactly matches my query”, so a field with “a b c d” _would_ match “a b”, “a b c”, “a b c d”, “b c”, “c d”, etc.

Both are entirely valid use-cases, depending on what you mean by “exact match"

Best,
Erick

> On Dec 2, 2019, at 4:38 PM, Emir Arnautović <[hidden email]> wrote:
>
> Hi Omer,
> From performance perspective, it is the best if you index title as a single token: KeywordTokenizer + LowerCaseFilter
>
> If you need to query that field in some other way, you can index it differently as some other field using copyField.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 2 Dec 2019, at 21:43, OTH <[hidden email]> wrote:
>>
>> Hello,
>>
>> What would be the best way to get exact matches (if any) to a query?
>>
>> E.g.:  Let's the document text is:  "united states of america".
>> Currently, any query containing one or more of the three words "united",
>> "states", or "america" will match with the above document.  I would like a
>> way so that the document matches only and only if the query were also
>> "united states of america" (case-insensitive).
>>
>> Document field type:  TextField
>> Index Analyzer: TokenizerChain
>> Index Tokenizer: StandardTokenizerFactory
>> Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
>> SnowballPorterFilterFactory
>> The Query Analyzer / Tokenizer / Token Filters are the same as the Index
>> ones above.
>>
>> FYI I'm relatively novice at Solr / Lucene / Search.
>>
>> Much appreciated
>> Omer
>

Reply | Threaded
Open this post in threaded view
|

Re: Exact match

Paras Lehana
Hi Omer,

If you mean exact match with same number of words (Emir's), you can also
add an identifier in the beginning and end of the some other field like
title_exact. This can be done in your indexing script or using Pattern
Replace. During query side, you can use this identifier. For example,
indexing "united states" with "exactStart united states exactEnd" and
querying with the same. Obviously, you can have scoring issues here so only
use if you want it to debug or retrieve docs.

Just adding to the all possible ways. *Anyways, I like the Keyword method.*

On Tue, 3 Dec 2019 at 03:59, Erick Erickson <[hidden email]> wrote:

> There are two different interpretations of “exact match” going on here,
> don’t be confused!
>
> Emir’s version is “the text has to match the _entire_ input. So a field
> with “a b c d” will NOT match “a b” or “a b c” or “b c", but only “a b c d”.
>
> David’s version is “The text has to contain some sequence of words that
> exactly matches my query”, so a field with “a b c d” _would_ match “a b”,
> “a b c”, “a b c d”, “b c”, “c d”, etc.
>
> Both are entirely valid use-cases, depending on what you mean by “exact
> match"
>
> Best,
> Erick
>
> > On Dec 2, 2019, at 4:38 PM, Emir Arnautović <
> [hidden email]> wrote:
> >
> > Hi Omer,
> > From performance perspective, it is the best if you index title as a
> single token: KeywordTokenizer + LowerCaseFilter
> >
> > If you need to query that field in some other way, you can index it
> differently as some other field using copyField.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 2 Dec 2019, at 21:43, OTH <[hidden email]> wrote:
> >>
> >> Hello,
> >>
> >> What would be the best way to get exact matches (if any) to a query?
> >>
> >> E.g.:  Let's the document text is:  "united states of america".
> >> Currently, any query containing one or more of the three words "united",
> >> "states", or "america" will match with the above document.  I would
> like a
> >> way so that the document matches only and only if the query were also
> >> "united states of america" (case-insensitive).
> >>
> >> Document field type:  TextField
> >> Index Analyzer: TokenizerChain
> >> Index Tokenizer: StandardTokenizerFactory
> >> Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
> >> SnowballPorterFilterFactory
> >> The Query Analyzer / Tokenizer / Token Filters are the same as the Index
> >> ones above.
> >>
> >> FYI I'm relatively novice at Solr / Lucene / Search.
> >>
> >> Much appreciated
> >> Omer
> >
>
>

--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>
Reply | Threaded
Open this post in threaded view
|

Re: Exact match

Ere Maijala
Hi,

Here's our example of exact match fields:

https://github.com/NatLibFi/finna-solr/blob/master/vufind/biblio/conf/schema.xml#L48

textProper_l requires a partial match from the beginning. textProper_lr
requires a full match. I'm not sure if this works for you, but at least
we have this creative use of PathHierarchyTokenizerFactory allowing the
left-anchored search.

HTH,
Ere

Paras Lehana kirjoitti 3.12.2019 klo 13.49:

> Hi Omer,
>
> If you mean exact match with same number of words (Emir's), you can also
> add an identifier in the beginning and end of the some other field like
> title_exact. This can be done in your indexing script or using Pattern
> Replace. During query side, you can use this identifier. For example,
> indexing "united states" with "exactStart united states exactEnd" and
> querying with the same. Obviously, you can have scoring issues here so only
> use if you want it to debug or retrieve docs.
>
> Just adding to the all possible ways. *Anyways, I like the Keyword method.*
>
> On Tue, 3 Dec 2019 at 03:59, Erick Erickson <[hidden email]> wrote:
>
>> There are two different interpretations of “exact match” going on here,
>> don’t be confused!
>>
>> Emir’s version is “the text has to match the _entire_ input. So a field
>> with “a b c d” will NOT match “a b” or “a b c” or “b c", but only “a b c d”.
>>
>> David’s version is “The text has to contain some sequence of words that
>> exactly matches my query”, so a field with “a b c d” _would_ match “a b”,
>> “a b c”, “a b c d”, “b c”, “c d”, etc.
>>
>> Both are entirely valid use-cases, depending on what you mean by “exact
>> match"
>>
>> Best,
>> Erick
>>
>>> On Dec 2, 2019, at 4:38 PM, Emir Arnautović <
>> [hidden email]> wrote:
>>>
>>> Hi Omer,
>>> From performance perspective, it is the best if you index title as a
>> single token: KeywordTokenizer + LowerCaseFilter
>>>
>>> If you need to query that field in some other way, you can index it
>> differently as some other field using copyField.
>>>
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
>>>> On 2 Dec 2019, at 21:43, OTH <[hidden email]> wrote:
>>>>
>>>> Hello,
>>>>
>>>> What would be the best way to get exact matches (if any) to a query?
>>>>
>>>> E.g.:  Let's the document text is:  "united states of america".
>>>> Currently, any query containing one or more of the three words "united",
>>>> "states", or "america" will match with the above document.  I would
>> like a
>>>> way so that the document matches only and only if the query were also
>>>> "united states of america" (case-insensitive).
>>>>
>>>> Document field type:  TextField
>>>> Index Analyzer: TokenizerChain
>>>> Index Tokenizer: StandardTokenizerFactory
>>>> Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
>>>> SnowballPorterFilterFactory
>>>> The Query Analyzer / Tokenizer / Token Filters are the same as the Index
>>>> ones above.
>>>>
>>>> FYI I'm relatively novice at Solr / Lucene / Search.
>>>>
>>>> Much appreciated
>>>> Omer
>>>
>>
>>
>

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland