Twitter Classification

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Twitter Classification

Jason Rutherglen
We've got Newsgroup classification. I'm kinda of interested in
creating a Twitter classification system, or at least playing
around with it. Also I think as a relevant growing large data
set, it seems Twitter fit well with Hadoop based machine
learning algorithms... Just throwing out into the wild!
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

zhao zhendong
Hi Jason,

That's awesome. Do you have any deeper thinkings about this topic?


On Tue, Jan 19, 2010 at 11:35 PM, Jason Rutherglen <
[hidden email]> wrote:

> We've got Newsgroup classification. I'm kinda of interested in
> creating a Twitter classification system, or at least playing
> around with it. Also I think as a relevant growing large data
> set, it seems Twitter fit well with Hadoop based machine
> learning algorithms... Just throwing out into the wild!
>



--
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

>>>>>>><><><><><><><><<><>><><<<<<<
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Ian Holsman (Lists)
In reply to this post by Jason Rutherglen
On 1/20/10 2:35 AM, Jason Rutherglen wrote:
> We've got Newsgroup classification. I'm kinda of interested in
> creating a Twitter classification system, or at least playing
> around with it. Also I think as a relevant growing large data
> set, it seems Twitter fit well with Hadoop based machine
> learning algorithms... Just throwing out into the wild!
>
>    
Hi Jason.
I think the biggest issues here are twofold.

1. access to the data, although I'm sure the ASF could work something
out here
2. training data. wouldn't you need a set of 'tweets' classified in some
manner? or were you thinking of using a different data source to base it on?
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Hannes Carl Meyer-2
Hi Jason,
to get access to the Twitter Data you could use the Twitter Streaming API:
http://apiwiki.twitter.com/Streaming-API-Documentation
Regards
Hannes

On Wed, Jan 20, 2010 at 10:02 PM, Ian Holsman <[hidden email]> wrote:

> On 1/20/10 2:35 AM, Jason Rutherglen wrote:
>
>> We've got Newsgroup classification. I'm kinda of interested in
>> creating a Twitter classification system, or at least playing
>> around with it. Also I think as a relevant growing large data
>> set, it seems Twitter fit well with Hadoop based machine
>> learning algorithms... Just throwing out into the wild!
>>
>>
>>
> Hi Jason.
> I think the biggest issues here are twofold.
>
> 1. access to the data, although I'm sure the ASF could work something out
> here
> 2. training data. wouldn't you need a set of 'tweets' classified in some
> manner? or were you thinking of using a different data source to base it on?
>
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Jason Rutherglen
Right I think this answers the previous questions? There are a
couple of main APIs a workbench could tie into. One is the
streaming API, the other is the older Search API:
http://apiwiki.twitter.com/Twitter-Search-API-Method%3A-search

Ted mentioned simply playing with the data visually is
the best way to start.  Perhaps we can build some helper tools?

As far as classification, it seems like search via Twitter is
going to evolve into somewhat uselessness quickly, and so value
added search, or perhaps personalized search via classification
could be more handy. I could see where various vertical web site
classify Tweets into categories based on their own custom
trained models. So rather than a one size fits all model, I'm
thinking some easy open source tools (like Mahout) will allow
anyone to build many different models to assist in organizing a stream of
Tweets. What happens after that is part of the fun!

> 1. access to the data, although I'm sure the ASF could work
something out here

I think we're providing software here, I can't see downloading
the data in ASF repositories. Mahout being on Hadoop is great
for archived Tweets, and then some realtime algorithms could be
useful for the streaming data.

> 2. training data. wouldn't you need a set of 'tweets'
classified in some manner? or were you thinking of using a
different data source to base it on?

It'd be nice to develop a workbench to easily build the training
set. Then allow easy retraining, which should occur quite often
with Twitter.

> Do you have any deeper thinkings about this topic?

We can try things out... I think Twitter offers some unique
challenges to machine learning, Ted do you agree?


On Wed, Jan 20, 2010 at 1:10 PM, Hannes Carl Meyer
<[hidden email]> wrote:

> Hi Jason,
> to get access to the Twitter Data you could use the Twitter Streaming API:
> http://apiwiki.twitter.com/Streaming-API-Documentation
> Regards
> Hannes
>
> On Wed, Jan 20, 2010 at 10:02 PM, Ian Holsman <[hidden email]> wrote:
>
>> On 1/20/10 2:35 AM, Jason Rutherglen wrote:
>>
>>> We've got Newsgroup classification. I'm kinda of interested in
>>> creating a Twitter classification system, or at least playing
>>> around with it. Also I think as a relevant growing large data
>>> set, it seems Twitter fit well with Hadoop based machine
>>> learning algorithms... Just throwing out into the wild!
>>>
>>>
>>>
>> Hi Jason.
>> I think the biggest issues here are twofold.
>>
>> 1. access to the data, although I'm sure the ASF could work something out
>> here
>> 2. training data. wouldn't you need a set of 'tweets' classified in some
>> manner? or were you thinking of using a different data source to base it on?
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Ted Dunning
I call them opportunities.

On Wed, Jan 20, 2010 at 4:24 PM, Jason Rutherglen <
[hidden email]> wrote:

> I think Twitter offers some unique
> challenges to machine learning, Ted do you agree?
>
>


--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Jason Rutherglen
Right, the power of positive thinking

On Wed, Jan 20, 2010 at 4:26 PM, Ted Dunning <[hidden email]> wrote:

> I call them opportunities.
>
> On Wed, Jan 20, 2010 at 4:24 PM, Jason Rutherglen <
> [hidden email]> wrote:
>
>> I think Twitter offers some unique
>> challenges to machine learning, Ted do you agree?
>>
>>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Olivier Grisel-3
In reply to this post by Ian Holsman (Lists)
2010/1/20 Ian Holsman <[hidden email]>:

> On 1/20/10 2:35 AM, Jason Rutherglen wrote:
>>
>> We've got Newsgroup classification. I'm kinda of interested in
>> creating a Twitter classification system, or at least playing
>> around with it. Also I think as a relevant growing large data
>> set, it seems Twitter fit well with Hadoop based machine
>> learning algorithms... Just throwing out into the wild!
>>
>>
>
> Hi Jason.
> I think the biggest issues here are twofold.
>
> 1. access to the data, although I'm sure the ASF could work something out
> here

Firehose (the live complete twitter stream) is going to be open to the
public this year. In the mean time the mean time it is possible to
gain access to a sample stream and to perform adhoc search queries on
specific terms or user profiles.

> 2. training data. wouldn't you need a set of 'tweets' classified in some
> manner? or were you thinking of using a different data source to base it on?

I see two obvious sources for labels in the twitter data:

 - #hastags placed by the users themselves (the 1000 most popular
hashtags or so must be consensual enough to extract signal from noise)
 - the twitter lists flagging users and their average tweet content by
transitivity. Again the top recurring listnames must mean something
somewhat universal enough.

There is also the location data of the authors in case you want the to
learn a model the sentiment of discussions by world countries for
instance.

--
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Ted Dunning
Sampling should be plenty for demo purposes.

Learning language models by using the geo code as a starting point sounds
like a quick thing to try.

Clustering with the tags you mentioned as a seed would be very interesting
as well.

On Wed, Jan 20, 2010 at 5:16 PM, Olivier Grisel <[hidden email]>wrote:

> > 1. access to the data, although I'm sure the ASF could work something out
> > here
>
> Firehose (the live complete twitter stream) is going to be open to the
> public this year. In the mean time the mean time it is possible to
> gain access to a sample stream and to perform adhoc search queries on
> specific terms or user profiles.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Olivier Grisel-3
In reply to this post by Olivier Grisel-3
2010/1/21 Olivier Grisel <[hidden email]>:

> 2010/1/20 Ian Holsman <[hidden email]>:
>> On 1/20/10 2:35 AM, Jason Rutherglen wrote:
>>>
>>> We've got Newsgroup classification. I'm kinda of interested in
>>> creating a Twitter classification system, or at least playing
>>> around with it. Also I think as a relevant growing large data
>>> set, it seems Twitter fit well with Hadoop based machine
>>> learning algorithms... Just throwing out into the wild!
>>>
>>>
>>
>> Hi Jason.
>> I think the biggest issues here are twofold.
>>
>> 1. access to the data, although I'm sure the ASF could work something out
>> here
>
> Firehose (the live complete twitter stream) is going to be open to the
> public this year. In the mean time the mean time it is possible to
> gain access to a sample stream and to perform adhoc search queries on
> specific terms or user profiles.

BTW, I just stumbeled upon the following project to dump a twitter
statuses stream directly to HDFS:

  http://github.com/ieure/Twidoop

--
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Robin Anil
In reply to this post by Olivier Grisel-3
What are you trying to classify? Some options are:

Topics? Label? Sentiments? Type(Humour, boredom, self expression, ad, spam..
to name a few) Location(town, city, state, country)?
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Jason Rutherglen
Robin,

I was thinking of how to build a demo that'd work as follows.
The demo isn't for any particular purpose other than the end
result could be somewhat useful to people who sell things. The
frictionless social selling model is outlined in a random blog
post:
http://jasonrutherglen.wordpress.com/category/socialselling/
which is my own rant about EBay being too expensive, and CL
requiring too much redundant work.

1) Use categories from Craigslist

2) Get all #forsale Tweets via the Twitter streaming API

3) Manually build the classification models (this is where tools
will aka a workbench will help)

4) Classify, see if the Tweets actually go the correct
categories (probably repeat 3 several times)

5) If 4 works, then dump these into Solr for a demo web thing
that somewhat resembles Craigslist

Not sure if a Tweet could fall under multiple categories. We'd
want to allow multiple classification systems, I'm thinking here
of the Stanford log-linear classifier, among others. If this
worked I'd personally want to start using it to because CL is
too painful.

Jason

On Thu, Jan 21, 2010 at 2:29 AM, Robin Anil <[hidden email]> wrote:
> What are you trying to classify? Some options are:
>
> Topics? Label? Sentiments? Type(Humour, boredom, self expression, ad, spam..
> to name a few) Location(town, city, state, country)?
>
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Ted Dunning
The SGD and Pegasos classifiers would be ideal for this.

The features should include the poster as well as the content.  Geo and date
might be useful.

On Thu, Jan 21, 2010 at 11:42 AM, Jason Rutherglen <
[hidden email]> wrote:

> Not sure if a Tweet could fall under multiple categories. We'd
> want to allow multiple classification systems, I'm thinking here
> of the Stanford log-linear classifier, among others. If this
> worked I'd personally want to start using it to because CL is
> too painful.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Ted Dunning
In reply to this post by Jason Rutherglen
Building the classification models could actually be an interesting exercise
for clustering as much as for classification.

The steps could be:

a) cluster into 2x necessary number of clusters

b) label each cluster according to a CL-derived category

c) mark positive and negative examples in the cluster according to labels
applied in (b)

d) train on labeled examples, possibly run a cluster step starting  from the
trained models

On Thu, Jan 21, 2010 at 11:42 AM, Jason Rutherglen <
[hidden email]> wrote:

> 2) Get all #forsale Tweets via the Twitter streaming API
>
> 3) Manually build the classification models (this is where tools
> will aka a workbench will help)
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Jason Rutherglen
In reply to this post by Ted Dunning
> The SGD and Pegasos classifiers would be ideal for this.

For multiple categories? From a user perspective, classifying
into multiple categories would be real sweet because it'd save
time and be better than how CL behaves today (eg, uni-category).
Also Solr/Lucene easily supports searching on items that are in
multiple categories.

On the geo front, it'd be handy to have a mobile app that
attaches GPS coordinates to Tweets posted from one's PC. Seems
like a useful app for Tweetdeck? It'd be brutally easy, there'd
be a background app posting coordinates to a DB, that the
Tweetdeck app gets one's latest coordinates from. This would
help make the whole posting Tweets with locations a lot easier
(eg, it'd happen automatically, instead of manually).

On Thu, Jan 21, 2010 at 12:08 PM, Ted Dunning <[hidden email]> wrote:

> The SGD and Pegasos classifiers would be ideal for this.
>
> The features should include the poster as well as the content.  Geo and date
> might be useful.
>
> On Thu, Jan 21, 2010 at 11:42 AM, Jason Rutherglen <
> [hidden email]> wrote:
>
>> Not sure if a Tweet could fall under multiple categories. We'd
>> want to allow multiple classification systems, I'm thinking here
>> of the Stanford log-linear classifier, among others. If this
>> worked I'd personally want to start using it to because CL is
>> too painful.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

Ted Dunning
Sure.

If you want 1 of n categories, then train a single n-way classifier.

If you want k of n, train n binary classifiers or hack the 1 of n classifier
slightly to remove the soft-max function.

If there are a relatively small number of category combinations, train on
each combination as a 1 of n target.

There are some fancier algorithms that make use of the categorical
combinations, but we don't need to worry about those to start.

On Thu, Jan 21, 2010 at 2:55 PM, Jason Rutherglen <
[hidden email]> wrote:

> > The SGD and Pegasos classifiers would be ideal for this.
>
> For multiple categories? From a user perspective, classifying
> into multiple categories would be real sweet because it'd save
> time and be better than how CL behaves today (eg, uni-category).
> Also Solr/Lucene easily supp




--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

zhao zhendong
In reply to this post by Jason Rutherglen
Currently, the Pegasos Classifier supports multiple categories.

On Fri, Jan 22, 2010 at 6:55 AM, Jason Rutherglen <
[hidden email]> wrote:

> > The SGD and Pegasos classifiers would be ideal for this.
>
> For multiple categories? From a user perspective, classifying
> into multiple categories would be real sweet because it'd save
> time and be better than how CL behaves today (eg, uni-category).
> Also Solr/Lucene easily supports searching on items that are in
> multiple categories.
>
> On the geo front, it'd be handy to have a mobile app that
> attaches GPS coordinates to Tweets posted from one's PC. Seems
> like a useful app for Tweetdeck? It'd be brutally easy, there'd
> be a background app posting coordinates to a DB, that the
> Tweetdeck app gets one's latest coordinates from. This would
> help make the whole posting Tweets with locations a lot easier
> (eg, it'd happen automatically, instead of manually).
>
> On Thu, Jan 21, 2010 at 12:08 PM, Ted Dunning <[hidden email]>
> wrote:
> > The SGD and Pegasos classifiers would be ideal for this.
> >
> > The features should include the poster as well as the content.  Geo and
> date
> > might be useful.
> >
> > On Thu, Jan 21, 2010 at 11:42 AM, Jason Rutherglen <
> > [hidden email]> wrote:
> >
> >> Not sure if a Tweet could fall under multiple categories. We'd
> >> want to allow multiple classification systems, I'm thinking here
> >> of the Stanford log-linear classifier, among others. If this
> >> worked I'd personally want to start using it to because CL is
> >> too painful.
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



--
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

>>>>>>><><><><><><><><<><>><><<<<<<
Reply | Threaded
Open this post in threaded view
|

Re: Twitter Classification

tdguest
This post has NOT been accepted by the mailing list yet.

   Very Interesting discussion! Are there are any tweet classification systems existing right now?  Would love to check them out!

   I have a site called http://www.tweetdynamics.com that categorizes tweets from users all over the world into topics. The classification is based on a semi-supervised classification method. Will be great if you can check this site out and send me feedback at - abhilasha@querydynamics.com

Regards,
Abhilasha