html parsers and windows-1251 (ukrainian)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

html parsers and windows-1251 (ukrainian)

Ilia S. Yatsenko
Hello

 

Sorry my little English

 

I see incorrect characters pseudo graphics instead characters (which not
present in Russian) in summaries for Ukrainian 1251.

With Russian languages in summary all fine.

 

For example cached version http://search.kvitka.info/cached.jsp?idx=0
<http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679

On top you can find original document url and see difference :)

 

 

How can I fix that or can anybody help me with next issue?

 

Reply | Threaded
Open this post in threaded view
|

Re: html parsers and windows-1251 (ukrainian)

kkrugler
>I see incorrect characters pseudo graphics instead characters (which not
>present in Russian) in summaries for Ukrainian 1251.
>
>With Russian languages in summary all fine.
>
>For example cached version http://search.kvitka.info/cached.jsp?idx=0
><http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679
>
>On top you can find original document url and see difference :)
>
>How can I fix that or can anybody help me with next issue?

Both pages look OK to me, though I don't read Ukrainian - sorry :)

I'm running Mac OS X 10.3 w/the Safari browser. I can send you the
screenshot of the summary if you'd like.

When I looked at the source of the original page
(http://www.prosvita.kiev.ua/posyl_m.htm) I didn't see any charset
specified. I'm guessing it should have an explicit CP 1251 in there,
versus forcing browsers to guess.

In the summary page generated by Nutch
(http://search.kvitka.info/cached.jsp?idx=0&id=48679) it explicitly
specifies UTF-8.

So my guess is that your browser either can't handle UTF-8, or you've
got it configured to assume CP 1251.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

RE: html parsers and windows-1251 (ukrainian)

Ilia S. Yatsenko
In cached version present "??ц????" - sign "?" should be like English
character "i".

I checked meta tags in this page it not have charset in head tag. But
default charset in nutch is 1251. when head tag have charset windows-1251
Ukrainian is fine :)


-----Original Message-----
From: Ken Krugler [mailto:[hidden email]]
Sent: Monday, July 25, 2005 10:18 PM
To: [hidden email]
Subject: Re: html parsers and windows-1251 (ukrainian)

>I see incorrect characters pseudo graphics instead characters (which not
>present in Russian) in summaries for Ukrainian 1251.
>
>With Russian languages in summary all fine.
>
>For example cached version http://search.kvitka.info/cached.jsp?idx=0
><http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679
>
>On top you can find original document url and see difference :)
>
>How can I fix that or can anybody help me with next issue?

Both pages look OK to me, though I don't read Ukrainian - sorry :)

I'm running Mac OS X 10.3 w/the Safari browser. I can send you the
screenshot of the summary if you'd like.

When I looked at the source of the original page
(http://www.prosvita.kiev.ua/posyl_m.htm) I didn't see any charset
specified. I'm guessing it should have an explicit CP 1251 in there,
versus forcing browsers to guess.

In the summary page generated by Nutch
(http://search.kvitka.info/cached.jsp?idx=0&id=48679) it explicitly
specifies UTF-8.

So my guess is that your browser either can't handle UTF-8, or you've
got it configured to assume CP 1251.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Reply | Threaded
Open this post in threaded view
|

whats used from the segments dir when searching

em-13
In reply to this post by kkrugler
I'm trying to grasp something here, I need a quick confirmation about the
following, a yes/no would suffice:

When searching and generating summaries, tomcat uses only:
1. <segment>/index
2. <segment>/parse_text

When retrieving the "cached" copy of the document, tomcat uses:
1. <segment>/parse_data

Are <segment>/fetcher and <segment>/context used at during the searching
stage?


Regards,
E.

-----Original Message-----
From: Ken Krugler [mailto:[hidden email]]
Sent: Monday, July 25, 2005 3:18 PM
To: [hidden email]
Subject: Re: html parsers and windows-1251 (ukrainian)

>I see incorrect characters pseudo graphics instead characters (which not
>present in Russian) in summaries for Ukrainian 1251.
>
>With Russian languages in summary all fine.
>
>For example cached version http://search.kvitka.info/cached.jsp?idx=0
><http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679
>
>On top you can find original document url and see difference :)
>
>How can I fix that or can anybody help me with next issue?

Both pages look OK to me, though I don't read Ukrainian - sorry :)

I'm running Mac OS X 10.3 w/the Safari browser. I can send you the
screenshot of the summary if you'd like.

When I looked at the source of the original page
(http://www.prosvita.kiev.ua/posyl_m.htm) I didn't see any charset
specified. I'm guessing it should have an explicit CP 1251 in there,
versus forcing browsers to guess.

In the summary page generated by Nutch
(http://search.kvitka.info/cached.jsp?idx=0&id=48679) it explicitly
specifies UTF-8.

So my guess is that your browser either can't handle UTF-8, or you've
got it configured to assume CP 1251.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200


Reply | Threaded
Open this post in threaded view
|

RE: html parsers and windows-1251 (ukrainian)

kkrugler
In reply to this post by Ilia S. Yatsenko
>In cached version present "éÙÑ¢–Ñ¢ÈÌËÈ" - sign "Ñ¢" should be like English
>character "i".

If you send me a screenshot of exactly what it
should look like, I can verify that it's being
displayed properly with my browser.

If you do this, it's probably best to send the
image to me directly, versus posting it to the
entire list.

>I checked meta tags in this page it not have charset in head tag.

By "this page" you mean the original page, right?
The Nutch-generated search result page has the
UTF-8 charset specified.

>But
>default charset in nutch is 1251.

If you mean the parser.character.encoding.default
property is set to "windows-1251", I believe this
is only used by the HTML parser when a fetched
page doesn't have any explicit charset
information. I don't think it has anything to do
with the encoding of pages generated by Nutch.

>when head tag have charset windows-1251
>Ukrainian is fine :)

I'm not sure what you mean by this...were you
able to force Nutch to generate pages using the
1251 character encoding?

-- Ken


>-----Original Message-----
>From: Ken Krugler [mailto:[hidden email]]
>Sent: Monday, July 25, 2005 10:18 PM
>To: [hidden email]
>Subject: Re: html parsers and windows-1251 (ukrainian)
>
>>I see incorrect characters pseudo graphics instead characters (which not
>>present in Russian) in summaries for Ukrainian 1251.
>>
>>With Russian languages in summary all fine.
>>
>>For example cached version http://search.kvitka.info/cached.jsp?idx=0
>><http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679
>>
>>On top you can find original document url and see difference :)
>>
>>How can I fix that or can anybody help me with next issue?
>
>Both pages look OK to me, though I don't read Ukrainian - sorry :)
>
>I'm running Mac OS X 10.3 w/the Safari browser. I can send you the
>screenshot of the summary if you'd like.
>
>When I looked at the source of the original page
>(http://www.prosvita.kiev.ua/posyl_m.htm) I didn't see any charset
>specified. I'm guessing it should have an explicit CP 1251 in there,
>versus forcing browsers to guess.
>
>In the summary page generated by Nutch
>(http://search.kvitka.info/cached.jsp?idx=0&id=48679) it explicitly
>specifies UTF-8.
>
>So my guess is that your browser either can't handle UTF-8, or you've
>got it configured to assume CP 1251.
>
>-- Ken
>--
>Ken Krugler
>TransPac Software, Inc.
><http://www.transpac.com>
>+1 530-470-9200


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

Re: whats used from the segments dir when searching

Piotr Kosiorowski
In reply to this post by em-13
Hello,
<segment>/index - is used during searching
<segment>/parse_text - summary generation
<segment>/parse_data - returns metadata for given page (used eg. during
cached content display to determine content-type).
<segment>/content - cached content
<segment>/fetcher - returns e.g. anchors for given page
Regards,
Piotr



EM wrote:

> I'm trying to grasp something here, I need a quick confirmation about the
> following, a yes/no would suffice:
>
> When searching and generating summaries, tomcat uses only:
> 1. <segment>/index
> 2. <segment>/parse_text
>
> When retrieving the "cached" copy of the document, tomcat uses:
> 1. <segment>/parse_data
>
> Are <segment>/fetcher and <segment>/context used at during the searching
> stage?
>
>
> Regards,
> E.
>
> -----Original Message-----
> From: Ken Krugler [mailto:[hidden email]]
> Sent: Monday, July 25, 2005 3:18 PM
> To: [hidden email]
> Subject: Re: html parsers and windows-1251 (ukrainian)
>
>
>>I see incorrect characters pseudo graphics instead characters (which not
>>present in Russian) in summaries for Ukrainian 1251.
>>
>>With Russian languages in summary all fine.
>>
>>For example cached version http://search.kvitka.info/cached.jsp?idx=0
>><http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679
>>
>>On top you can find original document url and see difference :)
>>
>>How can I fix that or can anybody help me with next issue?
>
>
> Both pages look OK to me, though I don't read Ukrainian - sorry :)
>
> I'm running Mac OS X 10.3 w/the Safari browser. I can send you the
> screenshot of the summary if you'd like.
>
> When I looked at the source of the original page
> (http://www.prosvita.kiev.ua/posyl_m.htm) I didn't see any charset
> specified. I'm guessing it should have an explicit CP 1251 in there,
> versus forcing browsers to guess.
>
> In the summary page generated by Nutch
> (http://search.kvitka.info/cached.jsp?idx=0&id=48679) it explicitly
> specifies UTF-8.
>
> So my guess is that your browser either can't handle UTF-8, or you've
> got it configured to assume CP 1251.
>
> -- Ken

Reply | Threaded
Open this post in threaded view
|

RE: html parsers and windows-1251 (ukrainian)

kkrugler
In reply to this post by Ilia S. Yatsenko
>Hello
>
>>By "this page" you mean the original page, right?
>
>Yes, you are correct, original page not have any information about charset.
>
>>If you mean the parser.character.encoding.default
>>property is set to "windows-1251",
>
>yes, I mean "parser.character.encoding.default" in nutch-site.xml
>
>  >I'm not sure what you mean by this...were you
>>able to force Nutch to generate pages using the
>>1251 character encoding?
>
>I have other pages in Ukrainian. If page have charset info in head tag, all
>non-russian characters show correct(seems).

I think I see the problem.

Your original web page is missing charset info,
_and_ the correct charset to specify is "KOI8-U",
not "windows-1251".

When Nutch analyzes the page, it's going to
assume 1251 because of the
parser.character.encoding.default property value,
and thus its conversion to UTF-8 will be wrong
for the specific character that you mention.

So then when Nutch's summary page is generated
(and correctly tagged as UTF-8), you'll see an
incorrect character.

-- Ken


>  >-----Original Message-----
>>From: Ken Krugler [mailto:[hidden email]]
>>Sent: Monday, July 25, 2005 10:18 PM
>>To: [hidden email]
>>Subject: Re: html parsers and windows-1251 (ukrainian)
>>
>>>I see incorrect characters pseudo graphics instead characters (which not
>>>present in Russian) in summaries for Ukrainian 1251.
>>>
>>>With Russian languages in summary all fine.
>>>
>>>For example cached version http://search.kvitka.info/cached.jsp?idx=0
>>><http://search.kvitka.info/cached.jsp?idx=0&id=48679> &id=48679
>>>
>>>On top you can find original document url and see difference :)
>>>
>>>How can I fix that or can anybody help me with next issue?
>>
>>Both pages look OK to me, though I don't read Ukrainian - sorry :)
>>
>>I'm running Mac OS X 10.3 w/the Safari browser. I can send you the
>>screenshot of the summary if you'd like.
>>
>>When I looked at the source of the original page
>>(http://www.prosvita.kiev.ua/posyl_m.htm) I didn't see any charset
>>specified. I'm guessing it should have an explicit CP 1251 in there,
>>versus forcing browsers to guess.
>>
>>In the summary page generated by Nutch
>>(http://search.kvitka.info/cached.jsp?idx=0&id=48679) it explicitly
>>specifies UTF-8.
>>
>>So my guess is that your browser either can't handle UTF-8, or you've
>>got it configured to assume CP 1251.
>>
>>-- Ken
>>--
>>Ken Krugler
>>TransPac Software, Inc.
>><http://www.transpac.com>
>>+1 530-470-9200
>
>
>--
>Ken Krugler
>TransPac Software, Inc.
><http://www.transpac.com>
>+1 530-470-9200
>
>Attachment converted: HD:1251.png (PNGf/«IC») (001914FD)


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200