[jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Kenneth William Krugler (Jira)

     [ https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss reassigned MAHOUT-12:
---------------------------------

    Assignee: Dawid Weiss

> Point formatting and parsing improved (StringBuilder, no need for trailing comma).
> ----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-12
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-12
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Trivial
>         Attachments: mah-12.patch
>
>
> Added test case to point class, improved parsing (no need to recompile the pattern all over again) and concatenation of points (stringbuilder used internally).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

RE: Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Jeff Eastman-2
The main reason I put in the trailing comma (and also the leading comma
after [) is so that it is easy to slurp the resulting data into Excel
spreadsheets. Without the extra delimiters, the [] characters mix with
the data values and manual editing is required.

That said, the whole issue of formatting for Point (to be replaced with
Vector soon) and Matrix is a minimalist hack and begs for more
consideration. I do think the Excel use case is something that ought to
be addressed as we move forward.

Jeff

-----Original Message-----
From: Dawid Weiss (JIRA) [mailto:[hidden email]]
Sent: Thursday, March 06, 2008 5:01 AM
To: [hidden email]
Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
improved (StringBuilder, no need for trailing comma).


     [
https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.
plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss reassigned MAHOUT-12:
---------------------------------

    Assignee: Dawid Weiss

> Point formatting and parsing improved (StringBuilder, no need for
trailing comma).
>
------------------------------------------------------------------------
----------

>
>                 Key: MAHOUT-12
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-12
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Trivial
>         Attachments: mah-12.patch
>
>
> Added test case to point class, improved parsing (no need to recompile
the pattern all over again) and concatenation of points (stringbuilder
used internally).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Dawid Weiss

The Excel scenario doesn't really convince me much, Jeff. For one thing, I don't
have Excel, but this is a minor issue, for another -- I don't think anyone will
actually import stuff that's supposed to be very large (that's why we do it in
Hadoop, don't we) into a spreadsheet.

In fact I did have more thoughts about keeping the data as strings in general...
I would much more prefer to have records (Hadoop records) or their subclasses
instead -- they offer good flexibility and you could pass in your own subrecords
if you wished to have some payload attached to the data points... but I decided
it's too late for this to persue.

D.

Jeff Eastman wrote:

> The main reason I put in the trailing comma (and also the leading comma
> after [) is so that it is easy to slurp the resulting data into Excel
> spreadsheets. Without the extra delimiters, the [] characters mix with
> the data values and manual editing is required.
>
> That said, the whole issue of formatting for Point (to be replaced with
> Vector soon) and Matrix is a minimalist hack and begs for more
> consideration. I do think the Excel use case is something that ought to
> be addressed as we move forward.
>
> Jeff
>
> -----Original Message-----
> From: Dawid Weiss (JIRA) [mailto:[hidden email]]
> Sent: Thursday, March 06, 2008 5:01 AM
> To: [hidden email]
> Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
> improved (StringBuilder, no need for trailing comma).
>
>
>      [
> https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.
> plugin.system.issuetabpanels:all-tabpanel ]
>
> Dawid Weiss reassigned MAHOUT-12:
> ---------------------------------
>
>     Assignee: Dawid Weiss
>
>> Point formatting and parsing improved (StringBuilder, no need for
> trailing comma).
> ------------------------------------------------------------------------
> ----------
>>                 Key: MAHOUT-12
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-12
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>    Affects Versions: 0.1
>>            Reporter: Dawid Weiss
>>            Assignee: Dawid Weiss
>>            Priority: Trivial
>>         Attachments: mah-12.patch
>>
>>
>> Added test case to point class, improved parsing (no need to recompile
> the pattern all over again) and concatenation of points (stringbuilder
> used internally).
>
Reply | Threaded
Open this post in threaded view
|

RE: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Jeff Eastman-2
Ted noted an easy fix to my Excel use case that I wasn't aware of, so my
point is agreeably moot.

I concur that we ought to have additional Writable representations to
make intra-Hadoop transfers more streamlined. This is certainly *not*
too late to pursue. I would encourage you to propose a record for Point
(which is in trunk) and these could be added to Vector and Matrix later
(once we get past the diff-diffing stage).

Jeff

-----Original Message-----
From: Dawid Weiss [mailto:[hidden email]]
Sent: Friday, March 07, 2008 12:27 AM
To: [hidden email]
Subject: Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
improved (StringBuilder, no need for trailing comma).


The Excel scenario doesn't really convince me much, Jeff. For one thing,
I don't
have Excel, but this is a minor issue, for another -- I don't think
anyone will
actually import stuff that's supposed to be very large (that's why we do
it in
Hadoop, don't we) into a spreadsheet.

In fact I did have more thoughts about keeping the data as strings in
general...
I would much more prefer to have records (Hadoop records) or their
subclasses
instead -- they offer good flexibility and you could pass in your own
subrecords
if you wished to have some payload attached to the data points... but I
decided
it's too late for this to persue.

D.

Jeff Eastman wrote:
> The main reason I put in the trailing comma (and also the leading
comma
> after [) is so that it is easy to slurp the resulting data into Excel
> spreadsheets. Without the extra delimiters, the [] characters mix with
> the data values and manual editing is required.
>
> That said, the whole issue of formatting for Point (to be replaced
with
> Vector soon) and Matrix is a minimalist hack and begs for more
> consideration. I do think the Excel use case is something that ought
to

> be addressed as we move forward.
>
> Jeff
>
> -----Original Message-----
> From: Dawid Weiss (JIRA) [mailto:[hidden email]]
> Sent: Thursday, March 06, 2008 5:01 AM
> To: [hidden email]
> Subject: [jira] Assigned: (MAHOUT-12) Point formatting and parsing
> improved (StringBuilder, no need for trailing comma).
>
>
>      [
>
https://issues.apache.org/jira/browse/MAHOUT-12?page=com.atlassian.jira.

> plugin.system.issuetabpanels:all-tabpanel ]
>
> Dawid Weiss reassigned MAHOUT-12:
> ---------------------------------
>
>     Assignee: Dawid Weiss
>
>> Point formatting and parsing improved (StringBuilder, no need for
> trailing comma).
>
------------------------------------------------------------------------

> ----------
>>                 Key: MAHOUT-12
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-12
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>    Affects Versions: 0.1
>>            Reporter: Dawid Weiss
>>            Assignee: Dawid Weiss
>>            Priority: Trivial
>>         Attachments: mah-12.patch
>>
>>
>> Added test case to point class, improved parsing (no need to
recompile
> the pattern all over again) and concatenation of points (stringbuilder
> used internally).
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Assigned: (MAHOUT-12) Point formatting and parsing improved (StringBuilder, no need for trailing comma).

Dawid Weiss

> I concur that we ought to have additional Writable representations to
> make intra-Hadoop transfers more streamlined. This is certainly *not*
> too late to pursue. I would encourage you to propose a record for Point
> (which is in trunk) and these could be added to Vector and Matrix later
> (once we get past the diff-diffing stage).

I am going to go through Mahout issues tomorrow -- let's get the minor things
out of way first, then we will focus on further refactorings. I think it makes a
lot of sense to have custom Writables (and a way to pass user subclass for our
jobs, so that these can be further inherited). The way I imagine this could look
is something like this (pseudo code of course):

KMeansJob job = new KMeansJob();
job.setInputKeyClass(? extends MahoutDataType);
job.setInputValueClass(? extends OtherMahoutDataType);
...

Note that this way the user can pass arbitrary records that subclass
Mahout-defined classes and the job can still freely manipulate them. I think
this would be pretty neat.

D.