[jira] Created: (NUTCH-139) Standard metadata property names in the ParseData metadata

classic Classic list List threaded Threaded
68 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
Standard metadata property names in the ParseData metadata
-----------------------------------------------------------

         Key: NUTCH-139
         URL: http://issues.apache.org/jira/browse/NUTCH-139
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev    
 Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
    Reporter: Chris A. Mattmann
 Assigned to: Chris A. Mattmann
     Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6


Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
and "conTeNT_TyPE" and all the permutations are really the same). What about
if I named it "Content     Type", or "ContentType"?

 I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.

 The properties would be defined at the top of the ParseData class, something like:

 public class ParseData{

   .....

    public static final String CONTENT_TYPE = "content-type";
    public static final String CREATOR = "creator";

   ....

}


In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.

I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360389 ]

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

According to Andrzej:

"I agree, too. Perhaps we should use the names as they appear in the  Dublin Core for those properties that are defined there - just prepended  them with "X-nutch-" in order to avoid name-clashes with other  properties (e.g. blindly copied from the protocol headers). "


I will follow this notation when I devleop the code.

Thanks,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6

>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Chris A. Mattmann updated NUTCH-139:
------------------------------------

    Priority: Minor  (was: Major)

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6

>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Chris A. Mattmann updated NUTCH-139:
------------------------------------

    Attachment: NUTCH-139.Mattmann.patch.txt

Okay Folks,

 Here's the patch file. Phew. Spent the whole day working on this today. Finally got all the unit tests passing :-) It was pretty difficult to figure out all the different places that the metadata properties were being used, but I think I nailed it. I'd be happy to hear from anybody that finds differently of course. So, yeah, please test this in your environments, and let me know what you guys think. I didn't test extensively, just made sure the unit tests passed.

  The biggest changes were to the protocol and parsing layer, but overall I also had to update the index-more plugin. This patch also includes several removals of unused imports in Nutch classes (good ol' Eclipse, I love it how it tells you that ;) ). Also, I fixed a bug that I just reported in the parse-rtf plugin where the ParseImpl was not being constructed correctly (or maybe I had an old version?). Anyways, check it out and let me know what you think.

Thanks Guys!

Cheers,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360645 ]

Doug Cutting commented on NUTCH-139:
------------------------------------

I'm confused as to why all of the constant names have "X_nutch" in them.  I'd expect to see something like that in their string values, but their names are already qualified by org.apache.nutch.ParseData, no?  Also, it would be easier if these were all defined in an interface, something like MetadataNames.  That way a class can "implement" that interface and then simply use the short names in code, e.g. CONTENT_TYPE, AUTHOR, etc.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360659 ]

Jerome Charron commented on NUTCH-139:
--------------------------------------

+1 with Doug comments:

* Remove X_nutch to constants names
* Add "X-nutch-" prefix to constants values
* Move constants definitions to a MetadataNames interface.

I take it.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360681 ]

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Doug, Jerome,

>I'm confused as to why all of the constant names have "X_nutch" in them. I'd expect to see something like that in their string values, but their
> names are already qualified by org.apache.nutch.ParseData, no?

Err, whoops. My fault. I misinterpreted what Andrzej was saying his comment:

"I agree, too. Perhaps we should use the names as they appear in the Dublin Core for those properties that are defined there - just prepended them with "X-nutch-" in order to avoid name-clashes with other properties (e.g. blindly copied from the protocol headers). "

I'll fix this right quick.

>Also, it would be easier if these were all defined in an interface, something like MetadataNames. That way a class can "implement" that interface
>and then simply use the short names in code, e.g. CONTENT_TYPE, AUTHOR, etc.

Yuppers, I agree on this one too. In fact, while I was making the patch, I was thinking in my head ("hey this would probably be a good idea to have in its own interface class..."), but since no one objected to my initial proposition to the dev list to put in into ParseData, I just put them there. So, yeah I'll fix this right quick as well.

Updated patch...on its way!


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Jerome Charron updated NUTCH-139:
---------------------------------

    Attachment: NUTCH-139.jc.review.patch.txt

Here is a new patch from Chris. I reviewed it, tested it.
From my point of view, all seems to be ok.
So if no objections, I will commit it during the day.

Regards
Jérôme

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360901 ]

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

I have an objection, in fact I think the patches miss the main point of using of prefixed property names.

In this patch only some of the property names, specifically those corresponding to the Dublin Core, are prefixed with PREFIX. Why? The original reason for introducing the prefix was this: as Nutch processes the raw data, it extracts certain metadata, either directly or using heuristics (like with LANG or content type). In order to distinguish these values from the original raw values, the metadata processed by Nutch was to be prefixed by "X-nutch-", and all other metadata that we don't use was to be left alone as it was.

So, e.g. the Content-Type metadata is sometimes wrong. Nutch checks this with e.g. the mime-type detection plugin, and it should put the final value of Content-Type in metadata - but under the name of "X-nutch-Content-Type", in order to avoid overwriting the original value (Chris's comment in MSWordParser.java reflects this doubt - that's the reason for prefixing).

Now, this convention is not followed in the patches. E.g. LANG is missing (should be PREFIX + "lang"). CharEncodingForConversion doesn't have a prefix either. Properties extracted in plugins (e.g. msword, zip, file, etc) are put under the standard, non-prefixed names, thus overwriting the original values.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360902 ]

Jerome Charron commented on NUTCH-139:
--------------------------------------

Andrzej,

Thanks for taking time to take a look at the patch.
In fact, we have some discussion with Chris about this point
(that's why I don't commit the patch directly, I already have some doubts about this).

I will check right now how to handle things in this way.


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360906 ]

Jerome Charron commented on NUTCH-139:
--------------------------------------

Andrzej,

Here are more comments about my doubts, and how to handle metadata names.

if for instance a protocol plugin doesn't have any Content-Length information (no header like in FTP), then is should compute the content-length and add it in X-nutch-content-length attribute.
But what do you suggest if a protocol have a Content-Length header (HTTP may provide one)?
My feeling is adding the two metadata:
1. One for the Content-Length header in the Content-Length attribute
2. One for the real Content-Length (computed) in the X-nutch-content-length attribute.

In other words and more generally:
* When adding a native protocol header, if an equivalent x-nutch attribute exists in MetadataNames, then it must be added too with the same value, or with a more precise value.
* If no header information is available, tries to fill the more x-nutch attribute the protocol level can.

Do you agree with that?



> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360909 ]

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

Yes, that was again the reason for prefixing - we want to keep as much of the original metadata as we can, to facilitate various processing in plugins that know how to handle unreliable data. But at the same we need a place to keep the metadata that we believe in.

So, first of all the protocol plugins collect any metadata "as is" from the wire, and put the original values in properties. Then, those plugins that know what they are doing should put the reliable metadata under the "X-nutch-" namespace. If some metadata is required (e.g. content type), then the code that is responsible for handling this metadata should ensure that we have a reliable value under the "X-nutch-" namespace, even if the original value is missing.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360920 ]

Jerome Charron commented on NUTCH-139:
--------------------------------------

And why not using the fact that the ContentProperties object can now handles multi-valued properties.
Each piece of code that wants to add some more reliable content about a property simply add its own value to the property => the first value is the raw one (for instance from the protocol level) and the more you iterate over values of the property, the more youo have a reliable value (the last one should be the more reliable, and is generally the interesting one, or for other reasons, the original value may be needed, then it is simply the first value).

Yes, you loose one information with this solution: You cannot ensure that the first value of a multi-valued property is the one from the protocol level.
But it avoid searching the same kind of information (the content-type for instance) using many properties names ("Content-Type" for the protocol level and "X-Nutch-Content-Type" for other levels).

We can extend the Multi-Valued Properties by adding a provider attribute while adding a property:
public void addProperty(key, value, provider).
The provider can be one of PROTOCOL, CONTENT, OTHER, .... for instance (to be defined)



> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360929 ]

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Andrzej,

> I have an objection, in fact I think the patches miss the main point of using of prefixed property names.

D'oh!

> In this patch only some of the property names, specifically those corresponding to the Dublin Core, are prefixed with PREFIX. Why?

Well the reason behind this was kind of like this. I wanted the metadata property names to be reusable, across the protocol level code, the parser code, pretty much anywhere that you used what I would call  "common" metadata properties in Nutch. Now, at the protocol level especially, there were bits and pieces of code like, "readHeaders("Content-type"), or String someValue = getHeader("Content-length"), blah blah blah", where the code was physically reading properties that were already written to an object, and that nutch has no control over. In these cases, in order to make all the calls synonomous, e.g., a call to readHeaders("Content-type") gets replaced by readHeaders(CONTENT_TYPE), I couldn't use the "_X_nutch" prefix on the names, because I didn't write the value into those objects originally.

On the other hand, anywhere that I was able to physically add metadata properties that were under our control, at the protocol level, or parsing level, etc., in particular, all of the DC properties, we had control as to how they were getting added into the properties object that was being passed around: both input control, and control over where it was being read, so we could use the X_nutch prefix.

So, in my mind I saw two distinct types of standard metadata properties: those which we can control both the input and output data flow from, and those which we really can only control the output  data flow from.

> The original reason for introducing the prefix was this: as Nutch processes the raw data, it extracts certain metadata, either directly or > using heuristics (like with LANG or content type). In order to distinguish these values from the original raw values, the metadata
> processed by Nutch was to be prefixed by "X-nutch-", and all other metadata that we don't use was to be left alone as it was.

This was followed to the T, except for the case above, which I mention and which you pointed out. For example, what would have happened if I put CONTENT_TYPE="X_nutch_content_type", and then I had a call in getHeaders(CONTENT_TYPE) in the protocol level? Well, since we don't ever put CONTENT_TYPE into the headers properties object, that would really never help us, and then everywhere we read CONTENT_TYPE, the value would have nothing.

> So, e.g. the Content-Type metadata is sometimes wrong. Nutch checks this with e.g. the mime-type detection plugin, and it should
> put the final value of Content-Type in metadata - but under the name of "X-nutch-Content-Type", in order to avoid overwriting the
> original value (Chris's comment in MSWordParser.java reflects this doubt - that's the reason for prefixing).

Yup, exactly. Good job catching that comment!

> Now, this convention is not followed in the patches. E.g. LANG is missing (should be PREFIX + "lang").

Not sure I follow this one. In the patch, there's a line:

 public static final String LANGUAGE = NUTCH_PREFIX + "language";

?



> CharEncodingForConversion
> doesn't have a prefix either. Properties extracted in plugins (e.g. msword, zip, file, etc) are put under the standard, non-prefixed
> names, thus overwriting the original values.

This isn't really true at all. I didn't overwrite any of the original values. In fact, no values are really overwritten at all. There are only two cases really:

1. Places where I standardized on how the names are read: you see these at the bottom of MetadataNames.java. These are properties that we don't really have control over how they got written into properties object, or properties that I at least couldn't figure out how they got placed into the properties objects at their particular layers. In this case, I've omitted the NUTCH_PREFIX in order to make reading/(post-writing) of the properties work fine.

2. Places where I standardized on how the names are read/written. These are at the top of MetadataNames.java. I could find the entire data flow in and out of the properties objects at the respective layers for all of these properties, and what's why they have the X-nutch Prefix.  Make sense?





> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360931 ]

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hmm,

 Okay, I just finished reading the rest of the comments :-) Sorry, just woke up out here in Los  Angeles. Okay, I think I understand what you guys are getting at here. X-nutch should be the "reliable" metadata that we create, i.e., control the input and output dataflow from in Nutch right now. The other names, such as "Content-type", "Content-length" that are written at the protocol-layer, and that Nutch doesn't control how they are written into the properties objects, those names should be left alone then, no? Is that the jist of it. So, you guys propose we would have something like:

//some protocol layer plugin
...
String contentType = getHeader("Content-type");

//CONTENT_TYPE = "X-nutch-content-type";
propertiesObject.put(CONTENT_TYPE, contentType);

?

If this is the case, then I would still point out that there are still metadata names like "Content-type", that would be good to standardize on at the protocol level (shameless self plug of what I already did ;) ) on how they are read. You could call these other metadata names, since they aren't prefixed with X-nutch, like some other class, or something, but I think it's still important at the protocol level to just standarize the code if nothing less there. So the above example would become:

//some protocol layer plugin
...
String contentType = getHeader(PROTO_LYR_READ_ONLY_CONTENT_TYPE); //you're such a non-standard property, aren't you

//CONTENT_TYPE = "X-nutch-content-type";
propertiesObject.put(CONTENT_TYPE, contentType);


So, I think it's a good compromise to not only standardize on what we write/read, but also what we read only. Of course, I'm open to comments on this. :-)


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360933 ]

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

I like Jerome's proposal of using the new ContentProperties class; this could save a lot of work, especially this naming mess - and it seems to satisfy all requirements (having standard metadata names, preserving the original metadata, enabling Nutch to "shadow" the original values with the authoritative values).

A small comment, though: plugins that detect malformed or misnamed values should copy them to standard names. E.g. if a header name should be "Content-Type", and the real header name is "ContentType", then the plugin should copy this value to the proper name.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361041 ]

Jerome Charron commented on NUTCH-139:
--------------------------------------

Ok, Chris and me will implement MetadataNames in this way.
Just some few comments:

I plan to move the MetadataNames to a class rather than an interface. Two reasons:

1.1 I don't like the design of implementing an interface in order to import some constants in a class: It gives some javadoc with a lot of class with many public constants defined without any really needs to show these constants in the javadoc.

1.2 I want to add an utility method in MetadataNames that tries to find the appropriate Nutch normalized metadata name from a string. It will be based on the Levenshtein Distance (available in commons-lang). More about Levenshtein Distance at http://www.merriampark.com/ld.htm




> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361043 ]

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

Regarding the move to a class with public static fields: I don't have any problem with that.

Regarding the Levenshtein distance: I think we can do even better, before we resort to such generic methods:

1) bring all property names to lowercase
2) remove any non-letters

Example: "Content-type" vs. "ContentType":

1) "content-type" vs. "contenttype".
2) "contenttype" vs. "contenttype" -> match

These two steps could be simply implemented in a custom Comparator for the ContentProperties.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361045 ]

Jerome Charron commented on NUTCH-139:
--------------------------------------

Andrzej,

Do you read in my mind?
Yes of course, that's the way I want to do it: First checks for the most common cases (lower cases + keeps only letters), then use the Levenshtein distance if needed (last chance).
Regards

Jérôme

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Jerome Charron updated NUTCH-139:
---------------------------------

    Attachment: NUTCH-139.060105.patch

Attached (NUTCH-139.060105.patch) is a new patch for this issue.
Thanks for reviewing it so that I can commit it as soon as possible.

Regards

Jérôme

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

1234