Can Mahout Be Used for Real-Time Processing?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Can Mahout Be Used for Real-Time Processing?

Tim Bass
Dear Team,

Ref:  http://www.thecepblog.com/2009/01/27/predicting-events-with-logisitic-regression/#comment-22772

Can Mahout / Hadoop / Mapreduce be used for real-time processing?

Thanks.

Yours sincerely, Tim
Reply | Threaded
Open this post in threaded view
|

Re: Can Mahout Be Used for Real-Time Processing?

Ted Dunning
With off-line learning systems like logistic regression, you can definitely
work with real-time events, but the learning happens in a batch process, not
in real-time.

The part that Hadoop and Mahout can help with is the off-line portion.  The
on-line portion of logistic regression is generally so fast and so trivially
parallelizable that you don't need to worry about fancy stuff like
map-reduce.

So the answer is yes, but not the way you mean it in your question.

On Tue, Jan 27, 2009 at 9:21 AM, Tim Bass <[hidden email]> wrote:

> Dear Team,
>
> Ref:
> http://www.thecepblog.com/2009/01/27/predicting-events-with-logisitic-regression/#comment-22772
>
> Can Mahout / Hadoop / Mapreduce be used for real-time processing?
>
> Thanks.
>
> Yours sincerely, Tim
>



--
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)
Reply | Threaded
Open this post in threaded view
|

Re: Can Mahout Be Used for Real-Time Processing?

Ted Dunning
In reply to this post by Tim Bass
Btw… in real production systems such as fraud detection, it is not generally
acceptable for any model to adapt in an on-line fashion. Any model changes
have to be extensively tested and verified to avoid disastrous surprises.

This is a business requirement that is pretty robust even in the face of
arguments of improved performance. The perceived risk is simply too large to
stomach. This perception comes from a sober assessment of history where
experience shows that even carefully vetted models built using off-line
methods (which are easier to get right than on-line models) do not always
improve performance and sometimes decrease performance when deploy on real
decision traffic.

In addition, it is common for there to be constraints on model behavior that
are very difficult to encode into a learning algorithm whether off-line or
on-line.

There are many other applications where on-line learning can plausibly be
used (think spam detection), but these are generally applications that do
not have significant business rule or regulatory components. It is also
surprisingly common for on-line learning to have little or no performance
benefit compared to relatively frequent off-line updates.

Off-line updates also have the advantage of being amenable to techniques
such as map-reduce. The key benefits of map-reduce are not simply
parallelism. The first benefit is that almost all access to disk or memory
is highly sequential in nature. This can result in several orders of
magnitude in performance improvement. A second benefit is that map-reduce
programs are typically nearly scale-free. This means that higher performance
can dialed in at run-time. Off-line updates in many cases also provide
better convergence properties which leads directly to compute savings.

Overall, then, the situations where on-line learning is clearly better are
really pretty limited.


On Tue, Jan 27, 2009 at 9:21 AM, Tim Bass <[hidden email]> wrote:

> Dear Team,
>
> Ref:
> http://www.thecepblog.com/2009/01/27/predicting-events-with-logisitic-regression/#comment-22772
>
> Can Mahout / Hadoop / Mapreduce be used for real-time processing?
>
> Thanks.
>
> Yours sincerely, Tim
>



--
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)
Reply | Threaded
Open this post in threaded view
|

Re: Can Mahout Be Used for Real-Time Processing?

Jason Rennie-2
This is a minor clarification at best, so feel free to ignore...

It might be worth noting the difference between (1) what Machine Learning
people call an "on-line" algorithm, and (2) updating a (e.g. logistic
regression) model in real-time.  It sounded like the OP was talking about
#2.  An on-line ML algorithm is one with constant memory overhead which
trains by seeing a single stream of the data example.   Real-time updating
of a batch algorithm (like logistic regression) will store the entire data
set.  One can achieve real-time updates with a batch algorithm like logistic
regression by performing a few gradient descent (like) updates for each new
example.  However, to satisfy the "real-time" goal, it is likely necessary
to be able to store the entire example set in memory.  Even then, it could
be slow if your # of examples is in the millions.  It's not java, but python
scipy.optimize provides some nice, fast, general optimization routines (such
as conjugate gradients and l-bfgs).

Jason

On Tue, Jan 27, 2009 at 1:07 PM, Ted Dunning <[hidden email]> wrote:

> Btw… in real production systems such as fraud detection, it is not
> generally
> acceptable for any model to adapt in an on-line fashion. Any model changes
> have to be extensively tested and verified to avoid disastrous surprises.
>
> This is a business requirement that is pretty robust even in the face of
> arguments of improved performance. The perceived risk is simply too large
> to
> stomach. This perception comes from a sober assessment of history where
> experience shows that even carefully vetted models built using off-line
> methods (which are easier to get right than on-line models) do not always
> improve performance and sometimes decrease performance when deploy on real
> decision traffic.
>
> In addition, it is common for there to be constraints on model behavior
> that
> are very difficult to encode into a learning algorithm whether off-line or
> on-line.
>
> There are many other applications where on-line learning can plausibly be
> used (think spam detection), but these are generally applications that do
> not have significant business rule or regulatory components. It is also
> surprisingly common for on-line learning to have little or no performance
> benefit compared to relatively frequent off-line updates.
>
> Off-line updates also have the advantage of being amenable to techniques
> such as map-reduce. The key benefits of map-reduce are not simply
> parallelism. The first benefit is that almost all access to disk or memory
> is highly sequential in nature. This can result in several orders of
> magnitude in performance improvement. A second benefit is that map-reduce
> programs are typically nearly scale-free. This means that higher
> performance
> can dialed in at run-time. Off-line updates in many cases also provide
> better convergence properties which leads directly to compute savings.
>
> Overall, then, the situations where on-line learning is clearly better are
> really pretty limited.
>
>
> On Tue, Jan 27, 2009 at 9:21 AM, Tim Bass <[hidden email]> wrote:
>
> > Dear Team,
> >
> > Ref:
> >
> http://www.thecepblog.com/2009/01/27/predicting-events-with-logisitic-regression/#comment-22772
> >
> > Can Mahout / Hadoop / Mapreduce be used for real-time processing?
> >
> > Thanks.
> >
> > Yours sincerely, Tim
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
> 4600 Bohannon Drive, Suite 220
> Menlo Park, CA 94025
> www.deepdyve.com
> 650-324-0110, ext. 738
> 858-414-0013 (m)
>



--
Jason Rennie
Research Scientist, ITA Software
http://www.itasoftware.com/