how to get the parsetext to be UTF-8 ?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

how to get the parsetext to be UTF-8 ?

beansproud
Hi,
I'm crawl some chinese pages, and when I dump the parsetext, it displays incorrectly as '?'.
Can anybody tell how to make it to be "utf-8" ?

thanks!
Reply | Threaded
Open this post in threaded view
|

Re: how to get the parsetext to be UTF-8 ?

brainstorm-2-2
The parsedtext extracted from nutch commandline is UTF-8 by default
(working for me on russian chars, for instance). Perhaps you refer to
the text seen throught tomcat, in that case, you can fix it:

http://wiki.apache.org/nutch/GettingNutchRunningWithUtf8

Regards,
Roman

On Fri, Jul 11, 2008 at 3:37 PM, beansproud <[hidden email]> wrote:

>
> Hi,
> I'm crawl some chinese pages, and when I dump the parsetext, it displays
> incorrectly as '?'.
> Can anybody tell how to make it to be "utf-8" ?
>
> thanks!
> --
> View this message in context: http://www.nabble.com/how-to-get-the-parsetext-to-be-UTF-8---tp18404034p18404034.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to get the parsetext to be UTF-8 ?

brainstorm-2-2
If the last url has not fixed the problem, you can contribute to a
similar (this same?) issue on JIRA:

http://issues.apache.org/jira/browse/NUTCH-540

On Sun, Jul 13, 2008 at 8:35 PM, brainstorm <[hidden email]> wrote:

> The parsedtext extracted from nutch commandline is UTF-8 by default
> (working for me on russian chars, for instance). Perhaps you refer to
> the text seen throught tomcat, in that case, you can fix it:
>
> http://wiki.apache.org/nutch/GettingNutchRunningWithUtf8
>
> Regards,
> Roman
>
> On Fri, Jul 11, 2008 at 3:37 PM, beansproud <[hidden email]> wrote:
>>
>> Hi,
>> I'm crawl some chinese pages, and when I dump the parsetext, it displays
>> incorrectly as '?'.
>> Can anybody tell how to make it to be "utf-8" ?
>>
>> thanks!
>> --
>> View this message in context: http://www.nabble.com/how-to-get-the-parsetext-to-be-UTF-8---tp18404034p18404034.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>