[jira] Closed: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Closed: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-349?page=all ]

Sami Siren closed NUTCH-349.

    Resolution: Duplicate

I guess this was allready done in NUTCH-383

> Port Nutch to use Hadoop Text instead of UTF8
> ---------------------------------------------
>                 Key: NUTCH-349
>                 URL: http://issues.apache.org/jira/browse/NUTCH-349
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki
> Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. This class has been deprecated in Hadoop 0.5.0, and Text class should be used instead. Sooner or later we will need to move Nutch to use this class instead of UTF8.
> This raises numerous issues regarding the compatibility of existing data in CrawlDB, LinkDB and segments. I can see two ways to solve this:
> * add code in readers of respective formats to convert UTF8->Text on the fly. New writers would only use Text. This is less than ideal, because it complicates the code, and also at some point in time the UTF8 class will be removed.
> * create a converter (to be maintaines as long as UTF8 exists), which converts existing data in bulk from UTF8 to Text. This requires an additional processing step when upgrading to convert all existing data to the new format.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira