model building

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

model building

Joe Obernberger
I'm trying to build a model using tweets.  I've manually tagged 30
tweets as threatening, and 50 random tweets as non-threatening.  When I
build the mode with:

update(models2, batchSize="50",
              train(UNCLASS,
                       features(UNCLASS,
                                      q="ProfileID:PROFCLUST1",
                                      featureSet="threatFeatures3",
                                      field="ClusterText",
                                      outcome="out_i",
                                      positiveLabel=1,
                                      numTerms=250),
                       q="ProfileID:PROFCLUST1",
                       name="threatModel3",
                       field="ClusterText",
                       outcome="out_i",
                       maxIterations="100"))

It appears to work, but all the idfs_ds values are identical. The
terms_ss values look reasonable, but nearly all the weights_ds are 1.0.  
For out_i it is either -1 for non-threatening tweets, and +1 for
threatening tweets.  I'm trying to follow along with Joel Bernstein's
excellent post here:
http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-system-with-solrs.html

Tips?

Thank you!

-Joe

Reply | Threaded
Open this post in threaded view
|

Re: model building

Joe Obernberger
If I put the training data into its own collection and use q="*:*", then
it works correctly.  Is that a requirement?
Thank you.

-Joe


On 3/20/2017 3:47 PM, Joe Obernberger wrote:

> I'm trying to build a model using tweets.  I've manually tagged 30
> tweets as threatening, and 50 random tweets as non-threatening.  When
> I build the mode with:
>
> update(models2, batchSize="50",
>              train(UNCLASS,
>                       features(UNCLASS,
>                                      q="ProfileID:PROFCLUST1",
>                                      featureSet="threatFeatures3",
>                                      field="ClusterText",
>                                      outcome="out_i",
>                                      positiveLabel=1,
>                                      numTerms=250),
>                       q="ProfileID:PROFCLUST1",
>                       name="threatModel3",
>                       field="ClusterText",
>                       outcome="out_i",
>                       maxIterations="100"))
>
> It appears to work, but all the idfs_ds values are identical. The
> terms_ss values look reasonable, but nearly all the weights_ds are
> 1.0.  For out_i it is either -1 for non-threatening tweets, and +1 for
> threatening tweets.  I'm trying to follow along with Joel Bernstein's
> excellent post here:
> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-system-with-solrs.html 
>
>
> Tips?
>
> Thank you!
>
> -Joe
>

Reply | Threaded
Open this post in threaded view
|

Re: model building

Joel Bernstein
I've only tested with the training data in it's own collection, but it was
designed for multiple training sets in the same collection.

I suspect you're training set is too small to get a reliable model from.
The training sets we tested with were considerably larger.

All the idfs_ds values being the same seems odd though. The idfs_ds in
particular were designed to be accurate when there are multiple training
sets in the same collection.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
[hidden email]> wrote:

> If I put the training data into its own collection and use q="*:*", then
> it works correctly.  Is that a requirement?
> Thank you.
>
> -Joe
>
>
>
> On 3/20/2017 3:47 PM, Joe Obernberger wrote:
>
>> I'm trying to build a model using tweets.  I've manually tagged 30 tweets
>> as threatening, and 50 random tweets as non-threatening.  When I build the
>> mode with:
>>
>> update(models2, batchSize="50",
>>              train(UNCLASS,
>>                       features(UNCLASS,
>>                                      q="ProfileID:PROFCLUST1",
>>                                      featureSet="threatFeatures3",
>>                                      field="ClusterText",
>>                                      outcome="out_i",
>>                                      positiveLabel=1,
>>                                      numTerms=250),
>>                       q="ProfileID:PROFCLUST1",
>>                       name="threatModel3",
>>                       field="ClusterText",
>>                       outcome="out_i",
>>                       maxIterations="100"))
>>
>> It appears to work, but all the idfs_ds values are identical. The
>> terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
>> For out_i it is either -1 for non-threatening tweets, and +1 for
>> threatening tweets.  I'm trying to follow along with Joel Bernstein's
>> excellent post here:
>> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
>> ystem-with-solrs.html
>>
>> Tips?
>>
>> Thank you!
>>
>> -Joe
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: model building

Tim Casey
Joe,

To do this correctly, soundly, you will need to sample the data and mark
them as threatening or neutral.  You can probably expand on this quite a
bit, but that would be a good start.  You can then draw another set of
samples and see how you did.  You use one to train and one to validate.

What you are doing is probably just noise, from a model point of view, and
it will probably not make too much difference how you index/query/model
through the noise.

I don't mean this critically, just plainly.  Effectively the less
mathematically correctly you do this process, the more anecdotal the result.

tim


On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein <[hidden email]> wrote:

> I've only tested with the training data in it's own collection, but it was
> designed for multiple training sets in the same collection.
>
> I suspect you're training set is too small to get a reliable model from.
> The training sets we tested with were considerably larger.
>
> All the idfs_ds values being the same seems odd though. The idfs_ds in
> particular were designed to be accurate when there are multiple training
> sets in the same collection.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
> [hidden email]> wrote:
>
> > If I put the training data into its own collection and use q="*:*", then
> > it works correctly.  Is that a requirement?
> > Thank you.
> >
> > -Joe
> >
> >
> >
> > On 3/20/2017 3:47 PM, Joe Obernberger wrote:
> >
> >> I'm trying to build a model using tweets.  I've manually tagged 30
> tweets
> >> as threatening, and 50 random tweets as non-threatening.  When I build
> the
> >> mode with:
> >>
> >> update(models2, batchSize="50",
> >>              train(UNCLASS,
> >>                       features(UNCLASS,
> >>                                      q="ProfileID:PROFCLUST1",
> >>                                      featureSet="threatFeatures3",
> >>                                      field="ClusterText",
> >>                                      outcome="out_i",
> >>                                      positiveLabel=1,
> >>                                      numTerms=250),
> >>                       q="ProfileID:PROFCLUST1",
> >>                       name="threatModel3",
> >>                       field="ClusterText",
> >>                       outcome="out_i",
> >>                       maxIterations="100"))
> >>
> >> It appears to work, but all the idfs_ds values are identical. The
> >> terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
> >> For out_i it is either -1 for non-threatening tweets, and +1 for
> >> threatening tweets.  I'm trying to follow along with Joel Bernstein's
> >> excellent post here:
> >> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
> >> ystem-with-solrs.html
> >>
> >> Tips?
> >>
> >> Thank you!
> >>
> >> -Joe
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: model building

Joe Obernberger
Thank you Tim.  I appreciated the tips.  At this point, I'm just trying
to understand how to use it.  The 30 tweets that I've selected so far,
are, in fact threatening.  The things people say!  My favorite so far is
'disingenuous twat waffle'.  No kidding.

The issue that I'm having is not with the model, it's with creating the
model from a query other than *:*.

Example:

update(models2, batchSize="50",
              train(TRAINING,
                       features(TRAINING,
                                      q="*:*",
                                      featureSet="threat1",
                                      field="ClusterText",
                                      outcome="out_i",
                                      positiveLabel=1,
                                      numTerms=100),
                       q="*:*",
                       name="threat1",
                       field="ClusterText",
                       outcome="out_i",
                       maxIterations="100"))

Works great.  Makes a model - model works - can see reasonable results.  
However, say I've tagged a training set inside a larger collection
called COL1 with a field called JoeID - like this:

update(models2, batchSize="50",
              train(COL1,
                       features(COL1,
                                      q="JoeID:Training",
                                      featureSet="threat2",
                                      field="ClusterText",
                                      outcome="out_i",
                                      positiveLabel=1,
                                      numTerms=1000),
                       q="JoeID:Training",
                       name="threat2",
                       field="ClusterText",
                       outcome="out_i",
                       maxIterations="100"))

This does not work as expected.  I can query the COL1 collection for
JoeID:Training, and get a result set that I want to train on, but the
model creation seems to not work.  At this point, if I want to make a
model, I need to create a collection, put the training set into it, and
then train on *:*.  This is fine, but I'm not sure if it's how it is
supposed to work.

-Joe


On 3/21/2017 10:17 PM, Tim Casey wrote:

> Joe,
>
> To do this correctly, soundly, you will need to sample the data and mark
> them as threatening or neutral.  You can probably expand on this quite a
> bit, but that would be a good start.  You can then draw another set of
> samples and see how you did.  You use one to train and one to validate.
>
> What you are doing is probably just noise, from a model point of view, and
> it will probably not make too much difference how you index/query/model
> through the noise.
>
> I don't mean this critically, just plainly.  Effectively the less
> mathematically correctly you do this process, the more anecdotal the result.
>
> tim
>
>
> On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein <[hidden email]> wrote:
>
>> I've only tested with the training data in it's own collection, but it was
>> designed for multiple training sets in the same collection.
>>
>> I suspect you're training set is too small to get a reliable model from.
>> The training sets we tested with were considerably larger.
>>
>> All the idfs_ds values being the same seems odd though. The idfs_ds in
>> particular were designed to be accurate when there are multiple training
>> sets in the same collection.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
>> [hidden email]> wrote:
>>
>>> If I put the training data into its own collection and use q="*:*", then
>>> it works correctly.  Is that a requirement?
>>> Thank you.
>>>
>>> -Joe
>>>
>>>
>>>
>>> On 3/20/2017 3:47 PM, Joe Obernberger wrote:
>>>
>>>> I'm trying to build a model using tweets.  I've manually tagged 30
>> tweets
>>>> as threatening, and 50 random tweets as non-threatening.  When I build
>> the
>>>> mode with:
>>>>
>>>> update(models2, batchSize="50",
>>>>               train(UNCLASS,
>>>>                        features(UNCLASS,
>>>>                                       q="ProfileID:PROFCLUST1",
>>>>                                       featureSet="threatFeatures3",
>>>>                                       field="ClusterText",
>>>>                                       outcome="out_i",
>>>>                                       positiveLabel=1,
>>>>                                       numTerms=250),
>>>>                        q="ProfileID:PROFCLUST1",
>>>>                        name="threatModel3",
>>>>                        field="ClusterText",
>>>>                        outcome="out_i",
>>>>                        maxIterations="100"))
>>>>
>>>> It appears to work, but all the idfs_ds values are identical. The
>>>> terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
>>>> For out_i it is either -1 for non-threatening tweets, and +1 for
>>>> threatening tweets.  I'm trying to follow along with Joel Bernstein's
>>>> excellent post here:
>>>> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
>>>> ystem-with-solrs.html
>>>>
>>>> Tips?
>>>>
>>>> Thank you!
>>>>
>>>> -Joe
>>>>
>>>>

Reply | Threaded
Open this post in threaded view
|

Re: model building

Joel Bernstein
I did a review of the code and it was definitely written to support having
multiple training sets in the same collection. So, it sounds like something
is not working as designed.

I planned on testing out model building with different types of training
sets anyway, so I'll can comment on my findings in the ticket.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Mar 22, 2017 at 9:58 AM, Joe Obernberger <
[hidden email]> wrote:

> Thank you Tim.  I appreciated the tips.  At this point, I'm just trying to
> understand how to use it.  The 30 tweets that I've selected so far, are, in
> fact threatening.  The things people say!  My favorite so far is
> 'disingenuous twat waffle'.  No kidding.
>
> The issue that I'm having is not with the model, it's with creating the
> model from a query other than *:*.
>
> Example:
>
> update(models2, batchSize="50",
>              train(TRAINING,
>                       features(TRAINING,
>                                      q="*:*",
>                                      featureSet="threat1",
>                                      field="ClusterText",
>                                      outcome="out_i",
>                                      positiveLabel=1,
>                                      numTerms=100),
>                       q="*:*",
>                       name="threat1",
>                       field="ClusterText",
>                       outcome="out_i",
>                       maxIterations="100"))
>
> Works great.  Makes a model - model works - can see reasonable results.
> However, say I've tagged a training set inside a larger collection called
> COL1 with a field called JoeID - like this:
>
> update(models2, batchSize="50",
>              train(COL1,
>                       features(COL1,
>                                      q="JoeID:Training",
>                                      featureSet="threat2",
>                                      field="ClusterText",
>                                      outcome="out_i",
>                                      positiveLabel=1,
>                                      numTerms=1000),
>                       q="JoeID:Training",
>                       name="threat2",
>                       field="ClusterText",
>                       outcome="out_i",
>                       maxIterations="100"))
>
> This does not work as expected.  I can query the COL1 collection for
> JoeID:Training, and get a result set that I want to train on, but the model
> creation seems to not work.  At this point, if I want to make a model, I
> need to create a collection, put the training set into it, and then train
> on *:*.  This is fine, but I'm not sure if it's how it is supposed to work.
>
> -Joe
>
>
>
> On 3/21/2017 10:17 PM, Tim Casey wrote:
>
>> Joe,
>>
>> To do this correctly, soundly, you will need to sample the data and mark
>> them as threatening or neutral.  You can probably expand on this quite a
>> bit, but that would be a good start.  You can then draw another set of
>> samples and see how you did.  You use one to train and one to validate.
>>
>> What you are doing is probably just noise, from a model point of view, and
>> it will probably not make too much difference how you index/query/model
>> through the noise.
>>
>> I don't mean this critically, just plainly.  Effectively the less
>> mathematically correctly you do this process, the more anecdotal the
>> result.
>>
>> tim
>>
>>
>> On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein <[hidden email]>
>> wrote:
>>
>> I've only tested with the training data in it's own collection, but it was
>>> designed for multiple training sets in the same collection.
>>>
>>> I suspect you're training set is too small to get a reliable model from.
>>> The training sets we tested with were considerably larger.
>>>
>>> All the idfs_ds values being the same seems odd though. The idfs_ds in
>>> particular were designed to be accurate when there are multiple training
>>> sets in the same collection.
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
>>> [hidden email]> wrote:
>>>
>>> If I put the training data into its own collection and use q="*:*", then
>>>> it works correctly.  Is that a requirement?
>>>> Thank you.
>>>>
>>>> -Joe
>>>>
>>>>
>>>>
>>>> On 3/20/2017 3:47 PM, Joe Obernberger wrote:
>>>>
>>>> I'm trying to build a model using tweets.  I've manually tagged 30
>>>>>
>>>> tweets
>>>
>>>> as threatening, and 50 random tweets as non-threatening.  When I build
>>>>>
>>>> the
>>>
>>>> mode with:
>>>>>
>>>>> update(models2, batchSize="50",
>>>>>               train(UNCLASS,
>>>>>                        features(UNCLASS,
>>>>>                                       q="ProfileID:PROFCLUST1",
>>>>>                                       featureSet="threatFeatures3",
>>>>>                                       field="ClusterText",
>>>>>                                       outcome="out_i",
>>>>>                                       positiveLabel=1,
>>>>>                                       numTerms=250),
>>>>>                        q="ProfileID:PROFCLUST1",
>>>>>                        name="threatModel3",
>>>>>                        field="ClusterText",
>>>>>                        outcome="out_i",
>>>>>                        maxIterations="100"))
>>>>>
>>>>> It appears to work, but all the idfs_ds values are identical. The
>>>>> terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
>>>>> For out_i it is either -1 for non-threatening tweets, and +1 for
>>>>> threatening tweets.  I'm trying to follow along with Joel Bernstein's
>>>>> excellent post here:
>>>>> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
>>>>> ystem-with-solrs.html
>>>>>
>>>>> Tips?
>>>>>
>>>>> Thank you!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>>
>