Parsed content in form of special characters

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsed content in form of special characters

David Philip
Hi,

  For some specific urls, the content fetched is in the form of special
characters, Is it character encoding issue? any settings need to be done at
nutch parsing level?


*url:*
http://service.sony.com.cn/vaio/Announcments/33412.htm

*content extracted is something like this: *
*
*
 SONY China
Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
其他产å“..................

*title: *
SONY China Service-关于建议使用正宗索尼电�适�器的声明


Thanks - David
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

kiran chitturi
Hi David,

Which version of Nutch are you using ? If 2.x, which backend are you using ?


On Thu, Mar 14, 2013 at 12:58 AM, David Philip
<[hidden email]>wrote:

> Hi,
>
>   For some specific urls, the content fetched is in the form of special
> characters, Is it character encoding issue? any settings need to be done at
> nutch parsing level?
>
>
> *url:*
> http://service.sony.com.cn/vaio/Announcments/33412.htm
>
> *content extracted is something like this: *
> *
> *
>  SONY China
> Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
> 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> 其他产å“..................
>
> *title: *
> SONY China Service-关于建议使用正宗索尼电�适�器的声明
>
>
> Thanks - David
>



--
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

Tejas Patil
In reply to this post by David Philip
I dont think so. The tool that you are using to view this must have support
for the desired languages. I had same problem while looking at the pages
having chinese content over putty. Installing language packs and tweaking
putty settings made this go away. I don't recall exact steps / details as I
did that about a year back.


On Wed, Mar 13, 2013 at 9:58 PM, David Philip
<[hidden email]>wrote:

> Hi,
>
>   For some specific urls, the content fetched is in the form of special
> characters, Is it character encoding issue? any settings need to be done at
> nutch parsing level?
>
>
> *url:*
> http://service.sony.com.cn/vaio/Announcments/33412.htm
>
> *content extracted is something like this: *
> *
> *
>  SONY China
> Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
> 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> 其他产å“..................
>
> *title: *
> SONY China Service-关于建议使用正宗索尼电�适�器的声明
>
>
> Thanks - David
>
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

David Philip
In reply to this post by kiran chitturi
Hi Kiran,

  I am using Nutch 1.6 and to index and search - solr3.6

Thanks -David



On Thu, Mar 14, 2013 at 10:36 AM, kiran chitturi
<[hidden email]>wrote:

> Hi David,
>
> Which version of Nutch are you using ? If 2.x, which backend are you using
> ?
>
>
> On Thu, Mar 14, 2013 at 12:58 AM, David Philip
> <[hidden email]>wrote:
>
> > Hi,
> >
> >   For some specific urls, the content fetched is in the form of special
> > characters, Is it character encoding issue? any settings need to be done
> at
> > nutch parsing level?
> >
> >
> > *url:*
> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >
> > *content extracted is something like this: *
> > *
> > *
> >  SONY China
> > Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
> > 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> > 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> > 其他产å“..................
> >
> > *title: *
> > SONY China Service-关于建议使用正宗索尼电�适�器的声明
> >
> >
> > Thanks - David
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

David Philip
In reply to this post by Tejas Patil
Hi Tejas,

   I used the redseg command:bin/nutch readseg -dump
test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
-nogenerate -noparse -nofetch -noparsedata

It generated the dump file,then I used less/cat command:
/Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt -
viewed the content as text file(gedit).


Below is brief of that text file(test459.txt):

Recno:: 0
URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm

ParseText::
 SONY China
Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
�影�产� 家庭影�产� 家庭音�产� 其他产�
æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
其他产� 选择产��类别 选择产�系列
/..........................
this is little huge.. so didn't paste everything.


Content::
Version: -1
url: http://service.sony.com.cn/vaio/Announcments/33412.htm
base: http://service.sony.com.cn/vaio/Announcments/33412.htm
contentType: application/xhtml+xml
metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
Content-Type=text/html Connection=close
Content:


Thanks - David






On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[hidden email]>wrote:

> I dont think so. The tool that you are using to view this must have support
> for the desired languages. I had same problem while looking at the pages
> having chinese content over putty. Installing language packs and tweaking
> putty settings made this go away. I don't recall exact steps / details as I
> did that about a year back.
>
>
> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> <[hidden email]>wrote:
>
> > Hi,
> >
> >   For some specific urls, the content fetched is in the form of special
> > characters, Is it character encoding issue? any settings need to be done
> at
> > nutch parsing level?
> >
> >
> > *url:*
> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >
> > *content extracted is something like this: *
> > *
> > *
> >  SONY China
> > Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
> > 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> > 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> > 其他产å“..................
> >
> > *title: *
> > SONY China Service-关于建议使用正宗索尼电�适�器的声明
> >
> >
> > Thanks - David
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

David Philip
I am attaching the extracted text file. not sure if you can receive and view it.

My observation:
When I compared the extracted text with url page (by doing view source). all most everything looks same other than data that is in ParseText:: section of the extracted text. 


Thanks -David



On Thu, Mar 14, 2013 at 11:59 AM, David Philip <[hidden email]> wrote:
Hi Tejas,

   I used the redseg command:bin/nutch readseg -dump test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test -nogenerate -noparse -nofetch -noparsedata

It generated the dump file,then I used less/cat command: /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt - viewed the content as text file(gedit).


Below is brief of that text file(test459.txt):

Recno:: 0

ParseText::
 SONY China Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�   æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�· 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“� 选择产å“�å­�类别 选择产å“�系列 /..........................
this is little huge.. so didn't paste everything.


Content::
Version: -1
contentType: application/xhtml+xml
metadata: cache-control=max-age=14400 Age=0 Content-Length=13187 Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33 nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT Content-Type=text/html Connection=close 
Content:


Thanks - David






On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[hidden email]> wrote:
I dont think so. The tool that you are using to view this must have support
for the desired languages. I had same problem while looking at the pages
having chinese content over putty. Installing language packs and tweaking
putty settings made this go away. I don't recall exact steps / details as I
did that about a year back.


On Wed, Mar 13, 2013 at 9:58 PM, David Philip
<[hidden email]>wrote:

> Hi,
>
>   For some specific urls, the content fetched is in the form of special
> characters, Is it character encoding issue? any settings need to be done at
> nutch parsing level?
>
>
> *url:*
> http://service.sony.com.cn/vaio/Announcments/33412.htm
>
> *content extracted is something like this: *
> *
> *
>  SONY China
> Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
> 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> 其他产å“..................
>
> *title: *
> SONY China Service-关于建议使用正宗索尼电�适�器的声明
>
>
> Thanks - David
>



test155.txt (95K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

David Philip
Hi,

  I did crawl through this
url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
its same issue.

Title extracted is in this format:SONY China
Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ

It was supposed to be like this :
<title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>

For specific urls like above it has this special characters problem. For
rest, characters extracted are proper. ex: this
url<http://service.sony.com.cn/9380.htm>it is proper parse.


Thanks David.


On Thu, Mar 14, 2013 at 12:17 PM, David Philip
<[hidden email]>wrote:

> I am attaching the extracted text file. not sure if you can receive and
> view it.
>
> My observation:
> When I compared the extracted text with url<http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> (by doing view source). all most everything looks same other than data that
> is in ParseText:: section of the extracted text.
>
>
> Thanks -David
>
>
>
> On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> [hidden email]> wrote:
>
>> Hi Tejas,
>>
>>    I used the redseg command:bin/nutch readseg -dump
>> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
>> -nogenerate -noparse -nofetch -noparsedata
>>
>> It generated the dump file,then I used less/cat command:
>> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt -
>> viewed the content as text file(gedit).
>>
>>
>> Below is brief of that text file(test459.txt):
>>
>> Recno:: 0
>> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
>>
>> ParseText::
>>  SONY China
>> Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
>> 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
>> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
>> �影�产� 家庭影�产� 家庭音�产� 其他产�
>> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
>> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
>> 其他产� 选择产��类别 选择产�系列
>> /..........................
>> this is little huge.. so didn't paste everything.
>>
>>
>> Content::
>> Version: -1
>> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> contentType: application/xhtml+xml
>> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
>> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
>> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
>> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
>> Content-Type=text/html Connection=close
>> Content:
>>
>>
>> Thanks - David
>>
>>
>>
>>
>>
>>
>> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[hidden email]>wrote:
>>
>>> I dont think so. The tool that you are using to view this must have
>>> support
>>> for the desired languages. I had same problem while looking at the pages
>>> having chinese content over putty. Installing language packs and tweaking
>>> putty settings made this go away. I don't recall exact steps / details
>>> as I
>>> did that about a year back.
>>>
>>>
>>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
>>> <[hidden email]>wrote:
>>>
>>> > Hi,
>>> >
>>> >   For some specific urls, the content fetched is in the form of special
>>> > characters, Is it character encoding issue? any settings need to be
>>> done at
>>> > nutch parsing level?
>>> >
>>> >
>>> > *url:*
>>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
>>> >
>>> > *content extracted is something like this: *
>>> > *
>>> > *
>>> >  SONY China
>>> > Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
>>> > 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
>>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
>>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
>>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>>> > 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
>>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>>> 家庭音�产�
>>> > 其他产å“..................
>>> >
>>> > *title: *
>>> > SONY China
>>> Service-关于建议使用正宗索尼电�适�器的声明
>>> >
>>> >
>>> > Thanks - David
>>> >
>>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

Rajinimaski
Hi David,

     Try setting the property : *parser.character.encoding.default to utf-8
* in nutch-site.xml and if you have already done this, make sure that you
have added URIEncoding=utf-8 in tomcat before executing bin/nutch solrindex
command to index to solr.

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>

tomcat :
<Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8445" URIEncoding="UTF-8" />


Thanks & Regards
Rajani Maski



On Thu, Mar 14, 2013 at 12:53 PM, David Philip
<[hidden email]>wrote:

> Hi,
>
>   I did crawl through this
> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
> its same issue.
>
> Title extracted is in this format:SONY China
>
> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
>
> It was supposed to be like this :
> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
>
> For specific urls like above it has this special characters problem. For
> rest, characters extracted are proper. ex: this
> url<http://service.sony.com.cn/9380.htm>it is proper parse.
>
>
> Thanks David.
>
>
> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> <[hidden email]>wrote:
>
> > I am attaching the extracted text file. not sure if you can receive and
> > view it.
> >
> > My observation:
> > When I compared the extracted text with url<
> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> > (by doing view source). all most everything looks same other than data
> that
> > is in ParseText:: section of the extracted text.
> >
> >
> > Thanks -David
> >
> >
> >
> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> > [hidden email]> wrote:
> >
> >> Hi Tejas,
> >>
> >>    I used the redseg command:bin/nutch readseg -dump
> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> >> -nogenerate -noparse -nofetch -noparsedata
> >>
> >> It generated the dump file,then I used less/cat command:
> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
> >test459.txt -
> >> viewed the content as text file(gedit).
> >>
> >>
> >> Below is brief of that text file(test459.txt):
> >>
> >> Recno:: 0
> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>
> >> ParseText::
> >>  SONY China
> >> Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
> >> 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> >> 其他产� 选择产��类别 选择产�系列
> >> /..........................
> >> this is little huge.. so didn't paste everything.
> >>
> >>
> >> Content::
> >> Version: -1
> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> contentType: application/xhtml+xml
> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> >> Content-Type=text/html Connection=close
> >> Content:
> >>
> >>
> >> Thanks - David
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[hidden email]
> >wrote:
> >>
> >>> I dont think so. The tool that you are using to view this must have
> >>> support
> >>> for the desired languages. I had same problem while looking at the
> pages
> >>> having chinese content over putty. Installing language packs and
> tweaking
> >>> putty settings made this go away. I don't recall exact steps / details
> >>> as I
> >>> did that about a year back.
> >>>
> >>>
> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> >>> <[hidden email]>wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >   For some specific urls, the content fetched is in the form of
> special
> >>> > characters, Is it character encoding issue? any settings need to be
> >>> done at
> >>> > nutch parsing level?
> >>> >
> >>> >
> >>> > *url:*
> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>> >
> >>> > *content extracted is something like this: *
> >>> > *
> >>> > *
> >>> >  SONY China
> >>> > Service-关于建议使用正宗索尼电�适�器的声明
> &nbsp
> >>> > 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >>> > 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >>> 家庭音�产�
> >>> > 其他产å“..................
> >>> >
> >>> > *title: *
> >>> > SONY China
> >>> Service-关于建议使用正宗索尼电�适�器的声明
> >>> >
> >>> >
> >>> > Thanks - David
> >>> >
> >>>
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

amuseme
In reply to this post by David Philip
Hi David

The problem is that parseHtml will detect the encoding of parsing html. The
page http://service.sony.com.cn/vaio/Announcments/33412.htm can not be
detected by EncodingDetector class. so it set to the default charactor
encoding. Maybe you can set this property parser.character.encoding.default
to utf-8 to fixed this problem temporarily.

<property>
  <name>parser.character.encoding.default</name>
  <value>utf-8</value>
  <description>The character encoding to fall back to when no other
information
  is available</description>
</property>

i test it in my computer and output is like this:

gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch plugin
parse-html org.apache.nutch.parse.html.HtmlParser ~/Downloads/45962.htm
data: Version: 5
Status: success(1,0)
Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知

.....






On Thu, Mar 14, 2013 at 3:23 PM, David Philip
<[hidden email]>wrote:

> Hi,
>
>   I did crawl through this
> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
> its same issue.
>
> Title extracted is in this format:SONY China
>
> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
>
> It was supposed to be like this :
> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
>
> For specific urls like above it has this special characters problem. For
> rest, characters extracted are proper. ex: this
> url<http://service.sony.com.cn/9380.htm>it is proper parse.
>
>
> Thanks David.
>
>
> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> <[hidden email]>wrote:
>
> > I am attaching the extracted text file. not sure if you can receive and
> > view it.
> >
> > My observation:
> > When I compared the extracted text with url<
> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> > (by doing view source). all most everything looks same other than data
> that
> > is in ParseText:: section of the extracted text.
> >
> >
> > Thanks -David
> >
> >
> >
> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> > [hidden email]> wrote:
> >
> >> Hi Tejas,
> >>
> >>    I used the redseg command:bin/nutch readseg -dump
> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> >> -nogenerate -noparse -nofetch -noparsedata
> >>
> >> It generated the dump file,then I used less/cat command:
> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
> >test459.txt -
> >> viewed the content as text file(gedit).
> >>
> >>
> >> Below is brief of that text file(test459.txt):
> >>
> >> Recno:: 0
> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>
> >> ParseText::
> >>  SONY China
> >> Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
> >> 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> >> 其他产� 选择产��类别 选择产�系列
> >> /..........................
> >> this is little huge.. so didn't paste everything.
> >>
> >>
> >> Content::
> >> Version: -1
> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> contentType: application/xhtml+xml
> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> >> Content-Type=text/html Connection=close
> >> Content:
> >>
> >>
> >> Thanks - David
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[hidden email]
> >wrote:
> >>
> >>> I dont think so. The tool that you are using to view this must have
> >>> support
> >>> for the desired languages. I had same problem while looking at the
> pages
> >>> having chinese content over putty. Installing language packs and
> tweaking
> >>> putty settings made this go away. I don't recall exact steps / details
> >>> as I
> >>> did that about a year back.
> >>>
> >>>
> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> >>> <[hidden email]>wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >   For some specific urls, the content fetched is in the form of
> special
> >>> > characters, Is it character encoding issue? any settings need to be
> >>> done at
> >>> > nutch parsing level?
> >>> >
> >>> >
> >>> > *url:*
> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>> >
> >>> > *content extracted is something like this: *
> >>> > *
> >>> > *
> >>> >  SONY China
> >>> > Service-关于建议使用正宗索尼电�适�器的声明
> &nbsp
> >>> > 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >>> > 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >>> 家庭音�产�
> >>> > 其他产å“..................
> >>> >
> >>> > *title: *
> >>> > SONY China
> >>> Service-关于建议使用正宗索尼电�适�器的声明
> >>> >
> >>> >
> >>> > Thanks - David
> >>> >
> >>>
> >>
> >>
> >
>



--
Don't Grow Old, Grow Up... :-)
Don't Grow Old, Grow Up.
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

amuseme
Hi

The problem is happen in HtmlParser#sniffCharacterEncoding. It read out
'charset' parameter in the meta tag  from the first <code>CHUNK_SIZE</code>
bytes.

In this page http://service.sony.com.cn/vaio/Announcments/33412.htm, the
meta tag is pass the first 2000 (CHUNK_SIZE) bytes. So that page encoding
will not be detected. But this CHUNK_SIZE param can not configured.




On Thu, Mar 14, 2013 at 5:18 PM, feng lu <[hidden email]> wrote:

> Hi David
>
> The problem is that parseHtml will detect the encoding of parsing html.
> The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not
> be detected by EncodingDetector class. so it set to the default charactor
> encoding. Maybe you can set this property parser.character.encoding.default
> to utf-8 to fixed this problem temporarily.
>
> <property>
>   <name>parser.character.encoding.default</name>
>   <value>utf-8</value>
>   <description>The character encoding to fall back to when no other
> information
>   is available</description>
> </property>
>
> i test it in my computer and output is like this:
>
> gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch
> plugin parse-html org.apache.nutch.parse.html.HtmlParser
> ~/Downloads/45962.htm
> data: Version: 5
> Status: success(1,0)
> Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知
>
> .....
>
>
>
>
>
>
> On Thu, Mar 14, 2013 at 3:23 PM, David Philip <[hidden email]
> > wrote:
>
>> Hi,
>>
>>   I did crawl through this
>> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
>> its same issue.
>>
>> Title extracted is in this format:SONY China
>>
>> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
>>
>> It was supposed to be like this :
>> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
>>
>> For specific urls like above it has this special characters problem. For
>> rest, characters extracted are proper. ex: this
>> url<http://service.sony.com.cn/9380.htm>it is proper parse.
>>
>>
>> Thanks David.
>>
>>
>> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
>> <[hidden email]>wrote:
>>
>> > I am attaching the extracted text file. not sure if you can receive and
>> > view it.
>> >
>> > My observation:
>> > When I compared the extracted text with url<
>> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
>> > (by doing view source). all most everything looks same other than data
>> that
>> > is in ParseText:: section of the extracted text.
>> >
>> >
>> > Thanks -David
>> >
>> >
>> >
>> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
>> > [hidden email]> wrote:
>> >
>> >> Hi Tejas,
>> >>
>> >>    I used the redseg command:bin/nutch readseg -dump
>> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
>> >> -nogenerate -noparse -nofetch -noparsedata
>> >>
>> >> It generated the dump file,then I used less/cat command:
>> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
>> >test459.txt -
>> >> viewed the content as text file(gedit).
>> >>
>> >>
>> >> Below is brief of that text file(test459.txt):
>> >>
>> >> Recno:: 0
>> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >>
>> >> ParseText::
>> >>  SONY China
>> >> Service-关于建议使用正宗索尼电æº�适é…�器的声明   &nbsp
>> >> 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
>> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
>> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
>> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> >> 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
>> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>> 家庭音�产�
>> >> 其他产� 选择产��类别 选择产�系列
>> >> /..........................
>> >> this is little huge.. so didn't paste everything.
>> >>
>> >>
>> >> Content::
>> >> Version: -1
>> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >> contentType: application/xhtml+xml
>> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
>> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
>> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
>> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
>> >> Content-Type=text/html Connection=close
>> >> Content:
>> >>
>> >>
>> >> Thanks - David
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <
>> [hidden email]>wrote:
>> >>
>> >>> I dont think so. The tool that you are using to view this must have
>> >>> support
>> >>> for the desired languages. I had same problem while looking at the
>> pages
>> >>> having chinese content over putty. Installing language packs and
>> tweaking
>> >>> putty settings made this go away. I don't recall exact steps / details
>> >>> as I
>> >>> did that about a year back.
>> >>>
>> >>>
>> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
>> >>> <[hidden email]>wrote:
>> >>>
>> >>> > Hi,
>> >>> >
>> >>> >   For some specific urls, the content fetched is in the form of
>> special
>> >>> > characters, Is it character encoding issue? any settings need to be
>> >>> done at
>> >>> > nutch parsing level?
>> >>> >
>> >>> >
>> >>> > *url:*
>> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >>> >
>> >>> > *content extracted is something like this: *
>> >>> > *
>> >>> > *
>> >>> >  SONY China
>> >>> > Service-关于建议使用正宗索尼电�适�器的声明
>> &nbsp
>> >>> > 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
>> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
>> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
>> 其他产�
>> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> >>> > 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
>> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>> >>> 家庭音�产�
>> >>> > 其他产å“..................
>> >>> >
>> >>> > *title: *
>> >>> > SONY China
>> >>> Service-关于建议使用正宗索尼电�适�器的声明
>> >>> >
>> >>> >
>> >>> > Thanks - David
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



--
Don't Grow Old, Grow Up... :-)
Don't Grow Old, Grow Up.
Reply | Threaded
Open this post in threaded view
|

Re: Parsed content in form of special characters

David Philip
Hi,

   Thank you Rajani Maski and feng lu. It worked for me. I had done the
tomcat setting but had missed nutch setting.
Thank you very much.

Thanks - David



On Thu, Mar 14, 2013 at 3:16 PM, feng lu <[hidden email]> wrote:

> Hi
>
> The problem is happen in HtmlParser#sniffCharacterEncoding. It read out
> 'charset' parameter in the meta tag  from the first <code>CHUNK_SIZE</code>
> bytes.
>
> In this page http://service.sony.com.cn/vaio/Announcments/33412.htm, the
> meta tag is pass the first 2000 (CHUNK_SIZE) bytes. So that page encoding
> will not be detected. But this CHUNK_SIZE param can not configured.
>
>
>
>
> On Thu, Mar 14, 2013 at 5:18 PM, feng lu <[hidden email]> wrote:
>
> > Hi David
> >
> > The problem is that parseHtml will detect the encoding of parsing html.
> > The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not
> > be detected by EncodingDetector class. so it set to the default charactor
> > encoding. Maybe you can set this property
> parser.character.encoding.default
> > to utf-8 to fixed this problem temporarily.
> >
> > <property>
> >   <name>parser.character.encoding.default</name>
> >   <value>utf-8</value>
> >   <description>The character encoding to fall back to when no other
> > information
> >   is available</description>
> > </property>
> >
> > i test it in my computer and output is like this:
> >
> > gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch
> > plugin parse-html org.apache.nutch.parse.html.HtmlParser
> > ~/Downloads/45962.htm
> > data: Version: 5
> > Status: success(1,0)
> > Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知
> >
> > .....
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 14, 2013 at 3:23 PM, David Philip <
> [hidden email]
> > > wrote:
> >
> >> Hi,
> >>
> >>   I did crawl through this
> >> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
> >> its same issue.
> >>
> >> Title extracted is in this format:SONY China
> >>
> >>
> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
> >>
> >> It was supposed to be like this :
> >> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
> >>
> >> For specific urls like above it has this special characters problem. For
> >> rest, characters extracted are proper. ex: this
> >> url<http://service.sony.com.cn/9380.htm>it is proper parse.
> >>
> >>
> >> Thanks David.
> >>
> >>
> >> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> >> <[hidden email]>wrote:
> >>
> >> > I am attaching the extracted text file. not sure if you can receive
> and
> >> > view it.
> >> >
> >> > My observation:
> >> > When I compared the extracted text with url<
> >> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> >> > (by doing view source). all most everything looks same other than data
> >> that
> >> > is in ParseText:: section of the extracted text.
> >> >
> >> >
> >> > Thanks -David
> >> >
> >> >
> >> >
> >> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> >> > [hidden email]> wrote:
> >> >
> >> >> Hi Tejas,
> >> >>
> >> >>    I used the redseg command:bin/nutch readseg -dump
> >> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> >> >> -nogenerate -noparse -nofetch -noparsedata
> >> >>
> >> >> It generated the dump file,then I used less/cat command:
> >> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
> >> >test459.txt -
> >> >> viewed the content as text file(gedit).
> >> >>
> >> >>
> >> >> Below is brief of that text file(test459.txt):
> >> >>
> >> >> Recno:: 0
> >> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >>
> >> >> ParseText::
> >> >>  SONY China
> >> >> Service-关于建议使用正宗索尼电�适�器的声明
> &nbsp
> >> >> 首页   新闻与公告   产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“�
> >> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> æ•°ç
> >> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
> >> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> >> 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> >> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >> 家庭音�产�
> >> >> 其他产� 选择产��类别 选择产�系列
> >> >> /..........................
> >> >> this is little huge.. so didn't paste everything.
> >> >>
> >> >>
> >> >> Content::
> >> >> Version: -1
> >> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >> contentType: application/xhtml+xml
> >> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> >> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> >> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> >> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> >> >> Content-Type=text/html Connection=close
> >> >> Content:
> >> >>
> >> >>
> >> >> Thanks - David
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <
> >> [hidden email]>wrote:
> >> >>
> >> >>> I dont think so. The tool that you are using to view this must have
> >> >>> support
> >> >>> for the desired languages. I had same problem while looking at the
> >> pages
> >> >>> having chinese content over putty. Installing language packs and
> >> tweaking
> >> >>> putty settings made this go away. I don't recall exact steps /
> details
> >> >>> as I
> >> >>> did that about a year back.
> >> >>>
> >> >>>
> >> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> >> >>> <[hidden email]>wrote:
> >> >>>
> >> >>> > Hi,
> >> >>> >
> >> >>> >   For some specific urls, the content fetched is in the form of
> >> special
> >> >>> > characters, Is it character encoding issue? any settings need to
> be
> >> >>> done at
> >> >>> > nutch parsing level?
> >> >>> >
> >> >>> >
> >> >>> > *url:*
> >> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >>> >
> >> >>> > *content extracted is something like this: *
> >> >>> > *
> >> >>> > *
> >> >>> >  SONY China
> >> >>> > Service-关于建议使用正宗索尼电�适�器的声明
> >> &nbsp
> >> >>> > 首页   新闻与公告   产å“�支æŒ�
> 个人电脑�周边产�
> >> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> >> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> >> 其他产�
> >> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> >>> > 按照产å“�åž‹å�·æ�œç´¢ 关键字     选择产å“�系列 / åž‹å�·
> >> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >> >>> 家庭音�产�
> >> >>> > 其他产å“..................
> >> >>> >
> >> >>> > *title: *
> >> >>> > SONY China
> >> >>> Service-关于建议使用正宗索尼电�适�器的声明
> >> >>> >
> >> >>> >
> >> >>> > Thanks - David
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>