[jira] Created: (MAHOUT-18) Embrace interoperability with other softwares

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (MAHOUT-18) Embrace interoperability with other softwares

JIRA jira@apache.org
Embrace interoperability with other softwares
---------------------------------------------

                 Key: MAHOUT-18
                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
             Project: Mahout
          Issue Type: New JIRA Project
            Reporter: Shunkai Fu
            Priority: Minor


This is an open issue. It is related with all possible components existing or to born in the future.

ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.

There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.

Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.

Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580199#action_12580199 ]

Ted Dunning commented on MAHOUT-18:
-----------------------------------


What are the possible formats?

Do any of the formats express parallel execution?

What are the criteria that we should use to decide which formats to support?



> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing or to born in the future.
> ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.
> There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.
> Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

答复: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

shunkai.fu
You can find some known format, PMML (http://www.dmg.org/products.html)

-----邮件原件-----
发件人: Ted Dunning (JIRA) [mailto:[hidden email]]
发送时间: 2008年3月19日 8:56
收件人: [hidden email]
主题: [jira] Commented: (MAHOUT-18) Embrace interoperability with other
softwares


    [
https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plug
in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580199#action_1
2580199 ]

Ted Dunning commented on MAHOUT-18:
-----------------------------------


What are the possible formats?

Do any of the formats express parallel execution?

What are the criteria that we should use to decide which formats to support?



> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing
or to born in the future.
> ML or DM models normally have two phases: training and scoring (or
predicting). If we agree "updating" is an independent one, we will have 3
phases.
> There are many softwares about ML/DM outside. We want the users of Mahout
be able to import models got built from other software here, update them
and/or use them for scoring. To achieve this goal, we need to recognize the
commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning.
After a model is ready, they may export the model trained, view it with some
visualization tool, or import it into other software or application for
scoring (or predicting). In this case, exporting into widely recognized
format is expected.
> Finally, I want to say that the importing and exporting will not influence
the ongoing projects, so developers of other components need not worry about
this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

答复: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

shunkai.fu
In reply to this post by JIRA jira@apache.org
The input and output format, from my view, have nothing to do with the
parallel execution. It is the nature of model and the design of learning
algorithm determine the parallel manner.

Criteria may contain:
(1) Mature;
(2) Reasonable;
(3) Wide acceptance;
(4) Easy for extension;
(5) Suitable for Mahout;
(6) more....

-----邮件原件-----
发件人: Ted Dunning (JIRA) [mailto:[hidden email]]
发送时间: 2008年3月19日 8:56
收件人: [hidden email]
主题: [jira] Commented: (MAHOUT-18) Embrace interoperability with other
softwares


    [
https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plug
in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580199#action_1
2580199 ]

Ted Dunning commented on MAHOUT-18:
-----------------------------------


What are the possible formats?

Do any of the formats express parallel execution?

What are the criteria that we should use to decide which formats to support?



> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing
or to born in the future.
> ML or DM models normally have two phases: training and scoring (or
predicting). If we agree "updating" is an independent one, we will have 3
phases.
> There are many softwares about ML/DM outside. We want the users of Mahout
be able to import models got built from other software here, update them
and/or use them for scoring. To achieve this goal, we need to recognize the
commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning.
After a model is ready, they may export the model trained, view it with some
visualization tool, or import it into other software or application for
scoring (or predicting). In this case, exporting into widely recognized
format is expected.
> Finally, I want to say that the importing and exporting will not influence
the ongoing projects, so developers of other components need not worry about
this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580212#action_12580212 ]

Grant Ingersoll commented on MAHOUT-18:
---------------------------------------

How does  this relate to MAHOUT-8?  Seems like that is a similar thing, trying to define common I/O, or am I misinterpreting?

> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing or to born in the future.
> ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.
> There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.
> Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

答复: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

shunkai.fu
Mahout-18 is about the description of physical data instances.

Here I refer to the model representation. For example, with Naieve Bayes, we
may need store the number of instances of (Class = 1) and (Class = 0), as
well as the cases of ( Class=1| X_i = 0 ). With this information, we can
re-store the model for updating or scoring purpose.

Another case is the Bayesian Network. We need store the graph structure as
well as the conditional probability table if we want to re-install it
somewhere.

Shunkai

-----邮件原件-----
发件人: Grant Ingersoll (JIRA) [mailto:[hidden email]]
发送时间: 2008年3月19日 10:04
收件人: [hidden email]
主题: [jira] Commented: (MAHOUT-18) Embrace interoperability with other
softwares


    [
https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plug
in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580212#action_1
2580212 ]

Grant Ingersoll commented on MAHOUT-18:
---------------------------------------

How does  this relate to MAHOUT-8?  Seems like that is a similar thing,
trying to define common I/O, or am I misinterpreting?

> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing
or to born in the future.
> ML or DM models normally have two phases: training and scoring (or
predicting). If we agree "updating" is an independent one, we will have 3
phases.
> There are many softwares about ML/DM outside. We want the users of Mahout
be able to import models got built from other software here, update them
and/or use them for scoring. To achieve this goal, we need to recognize the
commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning.
After a model is ready, they may export the model trained, view it with some
visualization tool, or import it into other software or application for
scoring (or predicting). In this case, exporting into widely recognized
format is expected.
> Finally, I want to say that the importing and exporting will not influence
the ongoing projects, so developers of other components need not worry about
this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580254#action_12580254 ]

Isabel Drost commented on MAHOUT-18:
------------------------------------


 
> How does this relate to MAHOUT-8? Seems like that is a similar thing, trying to define common I/O, or am I misinterpreting?

I think this is misinterpreted. Maybe the explanation on the PMML website helps:

PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves.

I wonder whether "the inputs" here means meta information to input data or the dataset itself.

According to the FAQ is implemented by JSR-73 (see Mahout-8):
> PMML is complementary to many other data mining standards. It's XML interchange formats is supported by several other standards, such as
> XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining.

Karl, you had a look at the FAQ, can you confirm this?


> What are the criteria that we should use to decide which formats to support?

I think one criterion should be how expressive the format is, the second should be the number of tools supporting the format. Of course there is an obvious criterion as well: The format should at least be open ;)

The group developing the format is part of the standards group xml.org, so there is some standardization process backing it up.

I what is supported by the format and what cannot be expressed.

> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing or to born in the future.
> ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.
> There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.
> Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

答复: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

shunkai.fu
A typical PMML file will describe (1) variables of the model; (2) built
model; (3) some self-definitions.

-----邮件原件-----
发件人: Isabel Drost (JIRA) [mailto:[hidden email]]
发送时间: 2008年3月19日 14:56
收件人: [hidden email]
主题: [jira] Commented: (MAHOUT-18) Embrace interoperability with other
softwares


    [
https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plug
in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580254#action_1
2580254 ]

Isabel Drost commented on MAHOUT-18:
------------------------------------


 
> How does this relate to MAHOUT-8? Seems like that is a similar thing,
trying to define common I/O, or am I misinterpreting?

I think this is misinterpreted. Maybe the explanation on the PMML website
helps:

PMML describes the inputs to data mining models, the transformations used
prior to prepare data for data mining, and the parameters which define the
models themselves.

I wonder whether "the inputs" here means meta information to input data or
the dataset itself.

According to the FAQ is implemented by JSR-73 (see Mahout-8):
> PMML is complementary to many other data mining standards. It's XML
interchange formats is supported by several other standards, such as
> XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining.

Karl, you had a look at the FAQ, can you confirm this?


> What are the criteria that we should use to decide which formats to
support?

I think one criterion should be how expressive the format is, the second
should be the number of tools supporting the format. Of course there is an
obvious criterion as well: The format should at least be open ;)

The group developing the format is part of the standards group xml.org, so
there is some standardization process backing it up.

I what is supported by the format and what cannot be expressed.

> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing
or to born in the future.
> ML or DM models normally have two phases: training and scoring (or
predicting). If we agree "updating" is an independent one, we will have 3
phases.
> There are many softwares about ML/DM outside. We want the users of Mahout
be able to import models got built from other software here, update them
and/or use them for scoring. To achieve this goal, we need to recognize the
commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning.
After a model is ready, they may export the model trained, view it with some
visualization tool, or import it into other software or application for
scoring (or predicting). In this case, exporting into widely recognized
format is expected.
> Finally, I want to say that the importing and exporting will not influence
the ongoing projects, so developers of other components need not worry about
this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: 答复: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

Thilo Goetz
In reply to this post by shunkai.fu
What are the licensing conditions for PMML?  I looked,
but couldn't find anything on the website.

Thanks,
Thilo

shunkai.fu wrote:

> You can find some known format, PMML (http://www.dmg.org/products.html)
>
> -----邮件原件-----
> 发件人: Ted Dunning (JIRA) [mailto:[hidden email]]
> 发送时间: 2008年3月19日 8:56
> 收件人: [hidden email]
> 主题: [jira] Commented: (MAHOUT-18) Embrace interoperability with other
> softwares
>
>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plug
> in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580199#action_1
> 2580199 ]
>
> Ted Dunning commented on MAHOUT-18:
> -----------------------------------
>
>
> What are the possible formats?
>
> Do any of the formats express parallel execution?
>
> What are the criteria that we should use to decide which formats to support?
>
>
>
>> Embrace interoperability with other softwares
>> ---------------------------------------------
>>
>>                 Key: MAHOUT-18
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>>             Project: Mahout
>>          Issue Type: New JIRA Project
>>            Reporter: Shunkai Fu
>>            Priority: Minor
>>
>> This is an open issue. It is related with all possible components existing
> or to born in the future.
>> ML or DM models normally have two phases: training and scoring (or
> predicting). If we agree "updating" is an independent one, we will have 3
> phases.
>> There are many softwares about ML/DM outside. We want the users of Mahout
> be able to import models got built from other software here, update them
> and/or use them for scoring. To achieve this goal, we need to recognize the
> commonly used formats.
>> Besides, users may choose Mahout because Mahout is speedy in learning.
> After a model is ready, they may export the model trained, view it with some
> visualization tool, or import it into other software or application for
> scoring (or predicting). In this case, exporting into widely recognized
> format is expected.
>> Finally, I want to say that the importing and exporting will not influence
> the ongoing projects, so developers of other components need not worry about
> this.
>

Reply | Threaded
Open this post in threaded view
|

答复: 答复: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

shunkai.fu
I think it is open standard, but not all have the proposal right.

-----邮件原件-----
发件人: Thilo Goetz [mailto:[hidden email]]
发送时间: 2008年3月19日 15:35
收件人: [hidden email]
主题: Re: 答复: [jira] Commented: (MAHOUT-18) Embrace interoperability with
other softwares

What are the licensing conditions for PMML?  I looked,
but couldn't find anything on the website.

Thanks,
Thilo

shunkai.fu wrote:

> You can find some known format, PMML (http://www.dmg.org/products.html)
>
> -----邮件原件-----
> 发件人: Ted Dunning (JIRA) [mailto:[hidden email]]
> 发送时间: 2008年3月19日 8:56
> 收件人: [hidden email]
> 主题: [jira] Commented: (MAHOUT-18) Embrace interoperability with other
> softwares
>
>
>     [
>
https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plug
>
in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580199#action_1

> 2580199 ]
>
> Ted Dunning commented on MAHOUT-18:
> -----------------------------------
>
>
> What are the possible formats?
>
> Do any of the formats express parallel execution?
>
> What are the criteria that we should use to decide which formats to
support?

>
>
>
>> Embrace interoperability with other softwares
>> ---------------------------------------------
>>
>>                 Key: MAHOUT-18
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>>             Project: Mahout
>>          Issue Type: New JIRA Project
>>            Reporter: Shunkai Fu
>>            Priority: Minor
>>
>> This is an open issue. It is related with all possible components
existing
> or to born in the future.
>> ML or DM models normally have two phases: training and scoring (or
> predicting). If we agree "updating" is an independent one, we will have 3
> phases.
>> There are many softwares about ML/DM outside. We want the users of Mahout
> be able to import models got built from other software here, update them
> and/or use them for scoring. To achieve this goal, we need to recognize
the
> commonly used formats.
>> Besides, users may choose Mahout because Mahout is speedy in learning.
> After a model is ready, they may export the model trained, view it with
some
> visualization tool, or import it into other software or application for
> scoring (or predicting). In this case, exporting into widely recognized
> format is expected.
>> Finally, I want to say that the importing and exporting will not
influence
> the ongoing projects, so developers of other components need not worry
about
> this.
>

Reply | Threaded
Open this post in threaded view
|

Re: 答复: 答复: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

Jason Rennie-2
Looks like the format already has formats for some popular models, including
SVM, regression, NNs.

Unclear to me how anyone could prevent us from using the standard unless it
were patented.  Copyright only protects works of art, which would include
specific PMML files, but not the format.  One thing I noticed is that open
source projects are allowed to take part in the PMML process for free...

My interpretation of PMML is that it represents a model.  As others have
mentioned, prediction models (e.g. classification, regression; not
clustering) basically have two parts: (1) learning, where the training data
is used to train (optimize parameters for) the model, (2) prediction, where
values are assigned to data points (documents/genes/etc.) based on the
model.  In some cases (e.g. Naive Bayes, kNN), the "learning" is virtually
non-existent and simply involves transforming the training data into a form
that makes prediction easy/efficient.  In other cases (e.g. SVM, ordinal
regression, NN, non-naive Bayesian Network), learning involves non-trivial
optimization, often requiring much more memory & computation than that of
prediction, and there is value in being able to "save" a model for use
elsewhere.

The format is, of course, algorithm specific, so it's probably best to
consider writing a PMML on an algorithm-by-algorithm basis...

Jason
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580644#action_12580644 ]

Karl Wettin commented on MAHOUT-18:
-----------------------------------

Isabel Drost - 18/Mar/08 11:54 PM
{quote}
PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves.

I wonder whether "the inputs" here means meta information to input data or the dataset itself.
{quote}

I think both. PMML seems to be an XML schema for feature attributes, data transformation, classifier parameter values, etc. It also defines a spare/dense matrix for instance data. All in the same XML file.

{quote}
According to the FAQ is implemented by JSR-73 (see Mahout-8):
> > PMML is complementary to many other data mining standards. It's XML interchange formats is supported by several other standards, such as
> > XML for Analysis, JSR 73, and SQL/MM Part 6: Data Mining.

Karl, you had a look at the FAQ, can you confirm this?
{quote}

JSR 73 says: http://jcp.org/en/jsr/detail?id=73
{quote}
JDMAPI will be based on a highly-generalized, object-oriented, data mining conceptual model leveraging emerging data mining standards such OMG's CWM, SQL/MM for Data Mining, and DMG's PMML. The JDMAPI model will support four conceptual areas that are generally of key interest to users of data mining systems: settings, models, transformations, and results.
{quote}

I have very little clue to what these meta model models really are. I also suppose they expect whoever that implement JSR 73 also implement the thing that read and write all these formats, but I'm just guessing here.

To me all this is a bit of overkill, at least right now. But something is needed. I have seen other speak of similar things and sort of need it right now. When calculating Jaccard index on the vector spaces of two text documets I store the values in a Mahout Vector along with a Map<Feature, Index>.  (I could just store in Map<Feature, Double>, but I thought it would be nice if other wanted to use the distance class.)

If one implements this map a new class and fill it with text on what it represents in JSR 73, PMML, CWM and what not, then at least people that wants to dig in will know where to start.



> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing or to born in the future.
> ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.
> There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.
> Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: ??: ??: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

Thilo Goetz
In reply to this post by Jason Rennie-2
Jason Rennie wrote:
> Looks like the format already has formats for some popular models, including
> SVM, regression, NNs.
>
> Unclear to me how anyone could prevent us from using the standard unless it
> were patented.  

Exactly.  Usually in a truly open standard, the companies and
individuals that contribute waive any patent rights on the
standard.  Given what you can get patents on these days, there
might well be some protected IP lurking there.

It's just strange that an organization with members like this
does not provide very clear and up-front statements about their
IP/licensing policy (or none that I could find, anyway).

I'm probably overreacting.  All I'm trying to say is: before
you spend a lot of time on this, find out what the deal is.

--Thilo

 > Copyright only protects works of art, which would include

> specific PMML files, but not the format.  One thing I noticed is that open
> source projects are allowed to take part in the PMML process for free...
>
> My interpretation of PMML is that it represents a model.  As others have
> mentioned, prediction models (e.g. classification, regression; not
> clustering) basically have two parts: (1) learning, where the training data
> is used to train (optimize parameters for) the model, (2) prediction, where
> values are assigned to data points (documents/genes/etc.) based on the
> model.  In some cases (e.g. Naive Bayes, kNN), the "learning" is virtually
> non-existent and simply involves transforming the training data into a form
> that makes prediction easy/efficient.  In other cases (e.g. SVM, ordinal
> regression, NN, non-naive Bayesian Network), learning involves non-trivial
> optimization, often requiring much more memory & computation than that of
> prediction, and there is value in being able to "save" a model for use
> elsewhere.
>
> The format is, of course, algorithm specific, so it's probably best to
> consider writing a PMML on an algorithm-by-algorithm basis...
>
> Jason
>

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581661#action_12581661 ]

Isabel Drost commented on MAHOUT-18:
------------------------------------

> To me all this is a bit of overkill, at least right now. But something is needed. I have seen other speak of similar things and sort of need it right
> now.

I think it is overkill for algorithms mainly used for data exploration - e.g. clustering is used for exploring large amounts of data, grouping it in manageable pieces. Once we start working on algorithms that create models of the data that are later applied to new incoming data (stuff like classification, regression, ...) we will need some way to store the resulting model. If that model can later be imported into one of the standard tools - all the better.

Maybe it is possible to start out with supporting just a subset of the standard that is really relevant for us?

> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing or to born in the future.
> ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.
> There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.
> Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778765#action_12778765 ]

Sean Owen commented on MAHOUT-18:
---------------------------------

Same, sounds like something to archive?

> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing or to born in the future.
> ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.
> There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.
> Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778781#action_12778781 ]

Ted Dunning commented on MAHOUT-18:
-----------------------------------


This will be important someday.

At that time, we should open a new JIRA and implement it.  Right now, we are working on getting relevant capabilities.  Until we have them, interchange is fruitless.

> Embrace interoperability with other softwares
> ---------------------------------------------
>
>                 Key: MAHOUT-18
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
>             Project: Mahout
>          Issue Type: New JIRA Project
>            Reporter: Shunkai Fu
>            Priority: Minor
>
> This is an open issue. It is related with all possible components existing or to born in the future.
> ML or DM models normally have two phases: training and scoring (or predicting). If we agree "updating" is an independent one, we will have 3 phases.
> There are many softwares about ML/DM outside. We want the users of Mahout be able to import models got built from other software here, update them and/or use them for scoring. To achieve this goal, we need to recognize the commonly used formats.
> Besides, users may choose Mahout because Mahout is speedy in learning. After a model is ready, they may export the model trained, view it with some visualization tool, or import it into other software or application for scoring (or predicting). In this case, exporting into widely recognized format is expected.
> Finally, I want to say that the importing and exporting will not influence the ongoing projects, so developers of other components need not worry about this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

Andrew Wang-5
Sorry to disturb you guys! I am data mining programmer using WEKA(
http://www.cs.waikato.ac.nz/ml/weka/) for several years. This funcation may
be very useful for us who'd like to import the trained model in WEKA into
Mahout.
Look forward to it.

On Tue, Nov 17, 2009 at 5:02 PM, Ted Dunning (JIRA) <[hidden email]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778781#action_12778781]
>
> Ted Dunning commented on MAHOUT-18:
> -----------------------------------
>
>
> This will be important someday.
>
> At that time, we should open a new JIRA and implement it.  Right now, we
> are working on getting relevant capabilities.  Until we have them,
> interchange is fruitless.
>
> > Embrace interoperability with other softwares
> > ---------------------------------------------
> >
> >                 Key: MAHOUT-18
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-18
> >             Project: Mahout
> >          Issue Type: New JIRA Project
> >            Reporter: Shunkai Fu
> >            Priority: Minor
> >
> > This is an open issue. It is related with all possible components
> existing or to born in the future.
> > ML or DM models normally have two phases: training and scoring (or
> predicting). If we agree "updating" is an independent one, we will have 3
> phases.
> > There are many softwares about ML/DM outside. We want the users of Mahout
> be able to import models got built from other software here, update them
> and/or use them for scoring. To achieve this goal, we need to recognize the
> commonly used formats.
> > Besides, users may choose Mahout because Mahout is speedy in learning.
> After a model is ready, they may export the model trained, view it with some
> visualization tool, or import it into other software or application for
> scoring (or predicting). In this case, exporting into widely recognized
> format is expected.
> > Finally, I want to say that the importing and exporting will not
> influence the ongoing projects, so developers of other components need not
> worry about this.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

Ted Dunning
Can you say more about what you really need?



On Tue, Nov 17, 2009 at 1:16 AM, Andrew Wang <[hidden email]>wrote:

> Sorry to disturb you guys! I am data mining programmer using WEKA(
> http://www.cs.waikato.ac.nz/ml/weka/) for several years. This funcation
> may
> be very useful for us who'd like to import the trained model in WEKA into
> Mahout.
> Look forward to it.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

Andrew Wang-5
As you know, i am new guy about the Mahout. suppose i have one model trained
in WEKA using distinct classifiers, if the Mahout have some port to import
the model, and using the model in the up-coming process, it will be very
cool.
Sorry to make you confuse!

On Tue, Nov 17, 2009 at 5:20 PM, Ted Dunning <[hidden email]> wrote:

> Can you say more about what you really need?
>
>
>
> On Tue, Nov 17, 2009 at 1:16 AM, Andrew Wang <[hidden email]
> >wrote:
>
> > Sorry to disturb you guys! I am data mining programmer using WEKA(
> > http://www.cs.waikato.ac.nz/ml/weka/) for several years. This funcation
> > may
> > be very useful for us who'd like to import the trained model in WEKA into
> > Mahout.
> > Look forward to it.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (MAHOUT-18) Embrace interoperability with other softwares

Robin Anil
Weka has its training and classification algorithm and many of them.
Infact, currently mahout only have the Naive Bayes implementation and
a improved version of the same. What i am saying is, the model trained
using Naive Bayes or a classifier where classification reduces to
multiplying the document vector with various label vectors in a model
could be imported(atleast theoretically) from Weka to Mahout. I am not
sure that is true for an SVM model or something that Mahout doesnt
have but Weka has. But its a nice suggestion. Could take a look at
Weka data format and see.


Robin


On Tue, Nov 17, 2009 at 3:14 PM, Andrew Wang <[hidden email]> wrote:

> As you know, i am new guy about the Mahout. suppose i have one model trained
> in WEKA using distinct classifiers, if the Mahout have some port to import
> the model, and using the model in the up-coming process, it will be very
> cool.
> Sorry to make you confuse!
>
> On Tue, Nov 17, 2009 at 5:20 PM, Ted Dunning <[hidden email]> wrote:
>
>> Can you say more about what you really need?
>>
>>
>>
>> On Tue, Nov 17, 2009 at 1:16 AM, Andrew Wang <[hidden email]
>> >wrote:
>>
>> > Sorry to disturb you guys! I am data mining programmer using WEKA(
>> > http://www.cs.waikato.ac.nz/ml/weka/) for several years. This funcation
>> > may
>> > be very useful for us who'd like to import the trained model in WEKA into
>> > Mahout.
>> > Look forward to it.
>> >
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
12