Lucene query with long strings

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene query with long strings

Aaron Schon
hi all, I have been playing with Lucene for a while now, but stuck on a perplexing issue.

I have an index, with a field "Affiliation", some example values are:

- "Stanford University School of Medicine, Palo Alto, CA USA",
- "Institute of Neurobiology, School of Medicine, Stanford University, Palo Alto, CA",
- "School of Medicine, Harvard University, Boston MA",
- "Brigham & Women's, Harvard University School of Medicine, Boston, MA"
- "Harvard University, Cambridge MA"

and so on... (the bottom-line being the affiliations are written in multiple ways with no apparent consistency)

I query the index on  the affiliation field using say "School of Medicine, Stanford University, Palo Alto, CA" (with QueryParser) to find all Stanford related documents, I get a lot of false +ves, presumably because of the presence of School of Medicine etc. etc. (note: I cannot use Phrase query because of variability in the way affiliation is constructed)

I have tried the following:

1. Use a SpanNearQuery by splitting the search phrase with a whitespace (here I get no results!)
2. Tried boosting (using ^) by splitting with the comma and boosting the last parts such as "Palo Alto CA" with a much higher boost than the initial phrases. Here I still get lots of false +ves.

Any suggestions on how to approach this? Is SpanNear the way to go? Any other ideas on why I get 0 results?

Thanks in advance for helping a newbie.

AS




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene query with long strings

iorixxx
> hi all, I have been playing
> with Lucene for a while now, but stuck on a perplexing
> issue.
>
> I have an index, with a field "Affiliation", some example
> values are:
>
> - "Stanford University School of Medicine, Palo Alto, CA
> USA",
> - "Institute of Neurobiology, School of Medicine, Stanford
> University, Palo Alto, CA",
> - "School of Medicine, Harvard University, Boston MA",
> - "Brigham & Women's, Harvard University School of
> Medicine, Boston, MA"
> - "Harvard University, Cambridge MA"
>
> and so on... (the bottom-line being the affiliations are
> written in multiple ways with no apparent consistency)
>
> I query the index on  the affiliation field using say
> "School of Medicine, Stanford University, Palo Alto, CA"
> (with QueryParser) to find all Stanford related documents,
> I get a lot of false +ves, presumably because of the
> presence of School of Medicine etc. etc. (note: I cannot use
> Phrase query because of variability in the way affiliation
> is constructed)
>
> I have tried the following:
>
> 1. Use a SpanNearQuery by splitting the search phrase with
> a whitespace (here I get no results!)
> 2. Tried boosting (using ^) by splitting with the comma and
> boosting the last parts such as "Palo Alto CA" with a much
> higher boost than the initial phrases. Here I still get lots
> of false +ves.
>
> Any suggestions on how to approach this? Is SpanNear the
> way to go? Any other ideas on why I get 0 results?
>
> Thanks in advance for helping a newbie.

If I were you, I would start with default operator as pure AND. (100% clauses must match) QueryParser.setDefaultOperator();

If this query does not return any documents I would switch to OR as an default operator and get documents matching 80% of the optional clauses. If not I would lower the percentage of the optional clauses that should match. Lets say till 50%. This param can be set using :

Query q = QueryParser.parse("School of Medicine, Stanford University, Palo Alto, CA");
if(q instanceof BooleanQuery)
q(BooleanQuery).setMinimumNumberShouldMatch()






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene query with long strings

steve_rowe
In reply to this post by Aaron Schon
Hi Aaron,

Your "false positives" comments point to a mismatch between what you're currently asking Lucene for (any document matching any one of the terms in the query) and what you want (only fully "correct" matches).

You need to identify the terms of the query that MUST match and tell Lucene about it ("+" syntax is understood by QueryParser to mean a required term).

If your queries come from sources that don't reliably match the indexes values, you may need to use synonyms to map between e.g. "California" and "CA", and then require that at least one of the synonyms matches (e.g. "+(California CA)").

Steve

On 03/23/2010 at 5:08 PM, Aaron Schon wrote:

> hi all, I have been playing with Lucene for a while now, but stuck on a
> perplexing issue.
>
> I have an index, with a field "Affiliation", some example values are:
>
> - "Stanford University School of Medicine, Palo Alto, CA USA", -
> "Institute of Neurobiology, School of Medicine, Stanford University,
> Palo Alto, CA", - "School of Medicine, Harvard University, Boston MA", -
> "Brigham & Women's, Harvard University School of Medicine, Boston, MA" -
> "Harvard University, Cambridge MA"
>
> and so on... (the bottom-line being the affiliations are written in
> multiple ways with no apparent consistency)
>
> I query the index on  the affiliation field using say "School of
> Medicine, Stanford University, Palo Alto, CA" (with QueryParser) to
> find all Stanford related documents, I get a lot of false +ves,
> presumably because of the presence of School of Medicine etc. etc.
> (note: I cannot use Phrase query because of variability in the way
> affiliation is constructed)
>
> I have tried the following:
>
> 1. Use a SpanNearQuery by splitting the search phrase with a whitespace
> (here I get no results!)
> 2. Tried boosting (using ^) by splitting with the comma and boosting
> the last parts such as "Palo Alto CA" with a much higher boost than the
> initial phrases. Here I still get lots of false +ves.
>
> Any suggestions on how to approach this? Is SpanNear the way to go? Any
> other ideas on why I get 0 results?
>
> Thanks in advance for helping a newbie.
>
> AS


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene query with long strings

Shashi Kant-2
In reply to this post by Aaron Schon
Add the common terms such as "University", "School", "Medicine",
"Institute" etc. to stopwords list, so you are left with Stanford,
"Palo Alto" etc.

Then use Ahmet's suggestion of using a booleanquery
.setMinimumNumberShouldMatch() to (say) 75% of the query string
length.

Finally, if you wish to be very precise, you can loop through the hits
collector and use a string comparison algorithm like Jaro-Winkler,
Levenstein etc. for a second-level filter.



On Tue, Mar 23, 2010 at 5:08 PM, Aaron Schon <[hidden email]> wrote:

> hi all, I have been playing with Lucene for a while now, but stuck on a perplexing issue.
>
> I have an index, with a field "Affiliation", some example values are:
>
> - "Stanford University School of Medicine, Palo Alto, CA USA",
> - "Institute of Neurobiology, School of Medicine, Stanford University, Palo Alto, CA",
> - "School of Medicine, Harvard University, Boston MA",
> - "Brigham & Women's, Harvard University School of Medicine, Boston, MA"
> - "Harvard University, Cambridge MA"
>
> and so on... (the bottom-line being the affiliations are written in multiple ways with no apparent consistency)
>
> I query the index on  the affiliation field using say "School of Medicine, Stanford University, Palo Alto, CA" (with QueryParser) to find all Stanford related documents, I get a lot of false +ves, presumably because of the presence of School of Medicine etc. etc. (note: I cannot use Phrase query because of variability in the way affiliation is constructed)
>
> I have tried the following:
>
> 1. Use a SpanNearQuery by splitting the search phrase with a whitespace (here I get no results!)
> 2. Tried boosting (using ^) by splitting with the comma and boosting the last parts such as "Palo Alto CA" with a much higher boost than the initial phrases. Here I still get lots of false +ves.
>
> Any suggestions on how to approach this? Is SpanNear the way to go? Any other ideas on why I get 0 results?
>
> Thanks in advance for helping a newbie.
>
> AS
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene query with long strings

Grant Ingersoll-2

On Mar 24, 2010, at 9:20 AM, Shashi Kant wrote:

> Add the common terms such as "University", "School", "Medicine",
> "Institute" etc. to stopwords list, so you are left with Stanford,
> "Palo Alto" etc.

I don't know if I would remove them, but you might consider using the CommonGram or n-gram approach whereby you associate these "stop words" with the words around them.

>
> Then use Ahmet's suggestion of using a booleanquery
> .setMinimumNumberShouldMatch() to (say) 75% of the query string
> length.
>
> Finally, if you wish to be very precise, you can loop through the hits
> collector and use a string comparison algorithm like Jaro-Winkler,
> Levenstein etc. for a second-level filter.

Note, this approach will be slow.




--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]