[jira] [Commented] (NUTCH-2746) Basic URL normalizer to normalize Unicode domain names

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2746) Basic URL normalizer to normalize Unicode domain names

Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951956#comment-16951956 ]

Sebastian Nagel commented on NUTCH-2746:
----------------------------------------

PR open. Of course, there are methods provided in URLUtil but these are not used in any of the URL normalizers. The patch tries to minimize the efforts and only does the IDN conversion if necessary. BasicURLNormalizer already operates with parts of the URL (host, path, query) which obsoletes additional parsing/splitting of URLs.

> Basic URL normalizer to normalize Unicode domain names
> ------------------------------------------------------
>
>                 Key: NUTCH-2746
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2746
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> The BasicURLNormalizer (plugin urlnormalizer-basic) lacks the possibility to normalize IDNs (Unicode host/domain names).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)