Machine Learning for search

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Machine Learning for search

Joe Obernberger
Hi All - One of the really neat features of solr 6 is the ability to
create machine learning models (information gain) and then use those
models as a query.  If I want a user to be able to execute a query for
the text Hawaii and use a machine learning model related to weather
data, how can I correctly rank the results?  It looks like I would need
to classify all the documents in some date range (assuming the query is
date restricted), look at the probability_d and pick the top n
documents.  Is there a better way to do this?

I'm using a stream like this:
classify(model(models,id="WeatherModel",cacheMillis=5000),search(COL1,df="FULL_DOCUMENT",q="Hawaii
AND DocTimestamp:[2017-07-23T04:00:00Z TO
2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
asc",rows="10000"),field="ClusterText")

This sends this to all the shards who can return at most 10,000 docs each.

Thanks!

-Joe

Reply | Threaded
Open this post in threaded view
|

Re: Machine Learning for search

Joel Bernstein
Can you describe the weather model?

In general the idea is to rerank the top N docs, because it will be too
slow to classify the whole result set.

In this scenario the search engine ranking will already be returning
relevant candidate documents and the model is only used to get a better
ordering of the top docs.



Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
[hidden email]> wrote:

> Hi All - One of the really neat features of solr 6 is the ability to
> create machine learning models (information gain) and then use those models
> as a query.  If I want a user to be able to execute a query for the text
> Hawaii and use a machine learning model related to weather data, how can I
> correctly rank the results?  It looks like I would need to classify all the
> documents in some date range (assuming the query is date restricted), look
> at the probability_d and pick the top n documents.  Is there a better way
> to do this?
>
> I'm using a stream like this:
> classify(model(models,id="WeatherModel",cacheMillis=5000),
> search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
> DocTimestamp:[2017-07-23T04:00:00Z TO 2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
> asc",rows="10000"),field="ClusterText")
>
> This sends this to all the shards who can return at most 10,000 docs each.
>
> Thanks!
>
> -Joe
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Machine Learning for search

Joe Obernberger
Thank you Joel.  I'm really having a good time with the machine learning
component in Solr.  In this case, the weather model was built by
classifying tweets as positive or negative.  I started by searching for
tweets with terms like tornado, storm, forecast, typhoon, hurricane,
blizzard, snow, lightning, flood warning, etc.. and making those
positive.  Then I grabbed some randoms tweets about Trump, ISIS,
Kardashian, etc. to make negative tweets.  At that point I started to
classify data and refine the model (adding more positive/negative) as
more data came into the system.

I hope that helps.  The model works very well at this point with just
650 tweets manually classified (pos/neg about split even) and using 150
terms.

I like your idea about using the model to re-rank the top n search
results.  That said, the results can be significantly 'better' if I
classify more data and reorder based on high probability scores; but as
you pointed out at the cost of much slower searches.  In some cases, I
would suspect a user may want to search just with a model and without
any search terms, but in those cases it may be best to classify data as
it comes in.  I guess it's a toss up between what is more important -
high probability from the classifier vs high rank from the search engine.
Thanks Joel.

-Joe


On 8/23/2017 3:08 PM, Joel Bernstein wrote:

> Can you describe the weather model?
>
> In general the idea is to rerank the top N docs, because it will be too
> slow to classify the whole result set.
>
> In this scenario the search engine ranking will already be returning
> relevant candidate documents and the model is only used to get a better
> ordering of the top docs.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
> [hidden email]> wrote:
>
>> Hi All - One of the really neat features of solr 6 is the ability to
>> create machine learning models (information gain) and then use those models
>> as a query.  If I want a user to be able to execute a query for the text
>> Hawaii and use a machine learning model related to weather data, how can I
>> correctly rank the results?  It looks like I would need to classify all the
>> documents in some date range (assuming the query is date restricted), look
>> at the probability_d and pick the top n documents.  Is there a better way
>> to do this?
>>
>> I'm using a stream like this:
>> classify(model(models,id="WeatherModel",cacheMillis=5000),
>> search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
>> DocTimestamp:[2017-07-23T04:00:00Z TO 2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
>> asc",rows="10000"),field="ClusterText")
>>
>> This sends this to all the shards who can return at most 10,000 docs each.
>>
>> Thanks!
>>
>> -Joe
>>
>>
>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>

Reply | Threaded
Open this post in threaded view
|

Fwd: Machine Learning for search

Joel Bernstein
I forgot to include the users list in my response below:
---------------

Interesting. I've been meaning to test the classifier in a similar way but
haven't had the time.

Basically what you did is created two classes:

1) A positive class
2) A very noisy negative class of "other stuff"

It was unclear from my reading on logistic regression whether this would
actually work. So I'm excited to hear that the classifier is indeed
providing good results with a noisy negative class, because this is a very
useful scenario.

One thing you may want to consider is taking some features from the model
and using them at query time. This would provide results that are better
candidates to fit the model and then you may not have to rerank such a
large set.








Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 23, 2017 at 6:02 PM, Joe Obernberger <
[hidden email]> wrote:

> Thank you Joel.  I'm really having a good time with the machine learning
> component in Solr.  In this case, the weather model was built by
> classifying tweets as positive or negative.  I started by searching for
> tweets with terms like tornado, storm, forecast, typhoon, hurricane,
> blizzard, snow, lightning, flood warning, etc.. and making those positive.
> Then I grabbed some randoms tweets about Trump, ISIS, Kardashian, etc. to
> make negative tweets.  At that point I started to classify data and refine
> the model (adding more positive/negative) as more data came into the system.
>
> I hope that helps.  The model works very well at this point with just 650
> tweets manually classified (pos/neg about split even) and using 150 terms.
>
> I like your idea about using the model to re-rank the top n search
> results.  That said, the results can be significantly 'better' if I
> classify more data and reorder based on high probability scores; but as you
> pointed out at the cost of much slower searches.  In some cases, I would
> suspect a user may want to search just with a model and without any search
> terms, but in those cases it may be best to classify data as it comes in.
> I guess it's a toss up between what is more important - high probability
> from the classifier vs high rank from the search engine.
> Thanks Joel.
>
> -Joe
>
>
>
> On 8/23/2017 3:08 PM, Joel Bernstein wrote:
>
>> Can you describe the weather model?
>>
>> In general the idea is to rerank the top N docs, because it will be too
>> slow to classify the whole result set.
>>
>> In this scenario the search engine ranking will already be returning
>> relevant candidate documents and the model is only used to get a better
>> ordering of the top docs.
>>
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
>> [hidden email]> wrote:
>>
>> Hi All - One of the really neat features of solr 6 is the ability to
>>> create machine learning models (information gain) and then use those
>>> models
>>> as a query.  If I want a user to be able to execute a query for the text
>>> Hawaii and use a machine learning model related to weather data, how can
>>> I
>>> correctly rank the results?  It looks like I would need to classify all
>>> the
>>> documents in some date range (assuming the query is date restricted),
>>> look
>>> at the probability_d and pick the top n documents.  Is there a better way
>>> to do this?
>>>
>>> I'm using a stream like this:
>>> classify(model(models,id="WeatherModel",cacheMillis=5000),
>>> search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
>>> DocTimestamp:[2017-07-23T04:00:00Z TO 2017-08-23T03:59:00Z]",fl="Clu
>>> sterText,id",sort="id
>>> asc",rows="10000"),field="ClusterText")
>>>
>>> This sends this to all the shards who can return at most 10,000 docs
>>> each.
>>>
>>> Thanks!
>>>
>>> -Joe
>>>
>>>
>>>
>> ---
>> This email has been checked for viruses by AVG.
>> http://www.avg.com
>>
>>
>