Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue

Rushikesh K
Hello Everyone,
I am having an issue while crawling the spanish website,some the accent
characters are not converting properly.
Here is an example  Infección (wrong one)should be Infección (correct ).

Note:This is with  *Bayan Group Extractor plugin.* Is there any change that
i need to make to convert correctly.

--
Regards
Rushikesh M
.Net Developer
Reply | Threaded
Open this post in threaded view
|

RE: Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue

Yossi Tamari
Hi Rushikesh,

I don't have any experience with this specific plugin, but I have run across similar problems, with 2 possible reasons:
1. It is possible that this specific site does not properly declare what encoding it is using, and the browser guesses the correct one.
2. You may have run across https://issues.apache.org/jira/browse/NUTCH-1807. I solved a similar problem by setting the environment variable LC_ALL to en_US.UTF-8 for all Hadoop processes (more specifically, adding `export LC_ALL=en_US.UTF-8` in ~hadoop/.bashrc on all Hadoop machines solved the problem for me).

        Yossi.

> -----Original Message-----
> From: Rushi [mailto:[hidden email]]
> Sent: 25 January 2018 16:32
> To: [hidden email]; Mark Vega <[hidden email]>
> Subject: Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue
>
> Hello Everyone,
> I am having an issue while crawling the spanish website,some the accent
> characters are not converting properly.
> Here is an example  Infección (wrong one)should be Infección (correct ).
>
> Note:This is with  *Bayan Group Extractor plugin.* Is there any change that i
> need to make to convert correctly.
>
> --
> Regards
> Rushikesh M
> .Net Developer