Hi,

Here is my proposal. Hope you can give me some advice. Thanks a lot!

*Overview*

Among those ten machine learning algorithms mentioned by Cheng-Tao Chu et

al.[1], I'm really interested in Logistic Regression(LR). I would like to

implement a LR program hadoop which can classify both binary and multiple

classes. Apparently, LR is an effective and robust algorithm as well-known

as Naive Bayes. Its principle is quite clear and formula is simple. So it's

widely used. However, its time complexity is comparable higher than other

algorithms, and we should deal all potential error which can occur during

calculating inversion matrix. If LR can be integrated in Mahout which can

take advantage of MapReduce's quick speed, then so many machine learning

researchers can save their times to build a time-consuming program just for

benchmark.

*Project*

Logistic Regression has been widely used in binary classification. And many

new brilliant learning algorithms are based on LR. So I would not only

implement the basic edition of LR, actually, I have to design the

architecture carefull so that it can change its core function flexibly, and

can receive cost-sensitive data(with 'cost' feature), and may easily adapt

to other framework. [1] can be a good reference framework.

For more detail information, in my implementation of LR, It will surpport

the following features:

- *basic binary classification.*

This function is almost same with the one described in [2]. It

receives data with the input format like SVMlight[3], and output a model for

the training set. Then use it to predict test data.

- *simplified binary classification*

In practical, many experiments use LR as a benchmark, and don't need

to calculate Hessian matrix in procedure of iteration. Actually, we often

use a small constant float *h* which is regarded as step length to

substitute Hessian matrix. This is because calculating Hessian matrix (this

procedure includes several times of multiplication of large matrices which

complexity is really high) is the most time-consuming part of the whole

algorithm. Also, due to the precision problem, inversion matrix sometimes

can be out-of-control in thousands times of iteration, while using a small

constant float in those cases can be safer and more accurate.

However, how to specify *h *is a problem. Usually, we estimate it

so that the change to vector will not too large, and then try some

*h*manually. Of course we can leave this specification job to user,

that is to

provide a interface for this parameter. But we can automatically estimate

proper scope of h in maybe the first five iterations and pick one randomly.

- *Automactic stop*

Like the previous feature, this is for user's convenience. When the

change of vector is tiny enough, the iteration should stop.

- *Cost-sensitive data*

In some special cases, we may not treat every training sample

equally, so the more important sample which is assigned larger weight can

have bigger influence to predictor. This function is easy to implement but

it plays an important role when it is integrated into other advanced

framework.

- *Multiclass case*

Multiclass classification's principle is similar to binary

classification of LR. But it will have more applications in real world. The

time complexity of this is K^3 times of binary one, in which K indicates the

number of classes.

All the features listed above are belong to 'traditional' LR, in my

opinion. As far as I know, LR can have some other applications in the

following cases:

- *Discriminative Learning*

In paper[7], authors proposed a novel algorithms that use both

training and test data to build a better predictor. Their intention is to

use unlabeled test data to correct the difference between the distributions

of training and test data. This phenomenon that training and test data are

under different distributions is quit common, and it's occurred by Sample

Selection Bias[6]. Especially in cross-domain experiments, using

discriminative can improve LR's performance obviously. So I would like

to add this function to LR.

- *Positive Only Learning*

In this case, training data consists of labeled data which is all

positive examples and unlabeled data. This situation may occur when getting

a comprehensive negative training data is too expensive. For example, if

users want to classify a kind of web pages such as 'homepage' from the all

the web pages, they may just simply submit some homepages which is the

positive training data. So we get only positive labeled data and a large

number of unlabeled data. Also, POL is very useful in filtering spam-email.

Because the definition of 'spam' is subjective to people, every spam mail

from one user can be considered as a positive data, and mails from all users

can be seen as unlabeled data. So we needn't bother to find exactly ham

mails for each user without worry one's ham maybe other's spam.

There have some effective way to solve this problem such as [8]. I

can use the similar algorithm to extract key negative features and then use

them to classify.

These two expansions to LR are important but not the least. I hope my

implementation of LR can give users as much convience and options as

possible.

Besides, I think that during this project, I may help on some related work:

- *Matrix calculation*

LR needs lots of matrix calculation. From Jira of mahout, I find

that some matrix related work has begun. I'll be very glad if I can

participate in this field.

LR uses two major functions of matrix calculation: multiplication

and inversion. For multiplication, we can use a simple *divide and

conquer*algorithm to reduce time complexity from O(n^3) to about O(n^

2.81) if both matrices' dimension is n*n[4]. Although it's not fastest, it's

really easier-implementing than Fast Fourier Transform. And as for

inversion, the precision is very crucial. Basically, I'll use LU

decomposition which is widely used by many math software. But with the

detail such as how to dealing with devide-by-zero, I'll consult to others.

- *Generalized linear model*

Logistic regression is one of a class of models known as

generalized linear models. When I told my idea of implementing LR to mahout

members, they suggested me to design my architecture carefully so that can

implement various kinds of linear regression such as Poisson all in one

step. So, besides LR, I'm willing to implement some other linear regression

algorithms.

*Functional Milestones*

The LR classifier will be able to do basic binary classification described

in [2]

The LR classifier will be able to do multiclass classification.

The LR classifier will be able to automatically stop and estimate *h*

Implement Poisson and some other Generalized linear models.

Implement Descriminative Learning.

Implement Positive Only Learning.

*Biography*

I'm an undergraduate student in Shanghai Jiao Tong University. My major is

computer science, and have taken most of key courses, such as Probability,

Algorithms, UML&Design Pattern etc. My programming ability is quite good. In

2005 and 2006, I participated in ACM/ICPC and got championship in Korea and

Vietnam sites respectively. Now I'm a research member of Apex Lab in SJTU

and I'm interested in machine learning, especially classification[9].

Though I'm not familiar with open software development, I'll try my best to

finish this job.

*Reference*

[1]

http://www.stat.columbia.edu/~gelman/standardize/<

http://www.stat.columbia.edu/%7Egelman/standardize/>

[2]Ch.T.Ch, S.K.Kim, Y.A.Lin, Y.Y.Yu, G.Bradski, and A.Y.Ng. Map-Reduce for

Machine Learning on Multicore.Advances in Neural Information Processing

Systems.

[3]

http://svmlight.joachims.org/[4]S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani. *Algorithms.* 66-67*.

*

[5]J.J.Heckman. Sample selection bias as a specification error. *

Econometrica*, 47:153-161,1979.

[6]B.Zadrozny. Learning and evaluating classifiers under sample selection

bias. In *Proceedings of the Twenty-First International Conference on

Machine Learning*, pages 114-121,2004.

[7]S Bickel, M Brückner, T Scheffer. Discriminative learning for dffering

training and test distributions. In *Proceedings of the 24th international

conference on Machine learning*, 2007.

[8]H Yu, J Han, KCC Chang.PEBL: positive example based learning for Web page

classification using SVM. In *Proceedings of the eighth ACM SIGKDD*.2002.

[9]X.Lin, W.Y.Dai, Y.Jiang, Y.Qiang.Classifying Chinese Web Pages based on

English Training Data.In *Proceedings of the 17th international World Wide

Web* *Conference.*2008

Yun Jiang.