Encoding the content got from Fetcher

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding the content got from Fetcher

Santiago Pérez
Hej,

I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation.

In crawling proccess when the each of the FetcherThread get the content, this is in formatted in a way which deletes the new line characters ("\n") and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default encoding like: Ã?¡, Ã?³, Ã?­, Ã?³, Ã?º, Ã?±, Ã?¼.

I would like to know if it is possible to set this default encoding (is UTF-8?) to the one that I need (ASCII I guess).

Thanks in advance ;)
Reply | Threaded
Open this post in threaded view
|

Broken segments ?

Mischa Tuffield
Hello All,

I was wondering if there is any way to check the integrity of a segment? As it stands, I can't create the index I want due to a number of my segments freaking out like below :

Is there anyway to check if my segments are OK, I guess i could always re:fetch them if need be.

Regards, and thanks in advance :)

Mischa


<!--
java.io.IOException: Could not obtain block: blk_8431627671702898365_95075 file=/user/nutch/crawl/segments/20091012145602/crawl_generate/part-00000
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
        at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:166)
        at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:161)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)

...

java.io.IOException: Could not obtain block: blk_7970643458650610887_21674 file=/user/nutch/crawl/segments/20090618111426/content/part-00003/data
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
        at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat.getRecordReader(SegmentMerger.java:150)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)
-->


On 26 Nov 2009, at 12:03, Santiago Pérez wrote:

>
> Hej,
>
> I am a newbie in Nutch and I need some help with a problem because I do not
> find clear documentation.
>
> In crawling proccess when the each of the FetcherThread get the content,
> this is in formatted in a way which deletes the new line characters ("\n")
> and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default
> encoding like: �¡, �³, � , �³, �º, �±, �¼.
>
> I would like to know if it is possible to set this default encoding (is
> UTF-8?) to the one that I need (ASCII I guess).
>
> Thanks in advance ;)
> --
> View this message in context: http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

___________________________________
Mischa Tuffield
Email: [hidden email]
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Reply | Threaded
Open this post in threaded view
|

Re: Broken segments ?

Andrzej Białecki-2
Mischa Tuffield wrote:
> Hello All,

http://people.apache.org/~hossman/#threadhijack

"When starting a new discussion on a mailing list, please do not reply
to an existing message, instead start a fresh email.  Even if you change
the subject line of your email, other mail headers still track which
thread you replied to and your question is "hidden" in that thread and
gets less attention.   It makes following discussions in the mailing
list archives particularly difficult."


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Encoding the content got from Fetcher

Fadzi Ushewokunze-2
In reply to this post by Santiago Pérez
hi

have you tried to change this property:

parser.character.encoding.default



>
> Hej,
>
> I am a newbie in Nutch and I need some help with a problem because I do
> not
> find clear documentation.
>
> In crawling proccess when the each of the FetcherThread get the content,
> this is in formatted in a way which deletes the new line characters ("\n")
> and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the
> default
> encoding like: �¡, �³, �­, �³, �º, �±,
> �¼.
>
> I would like to know if it is possible to set this default encoding (is
> UTF-8?) to the one that I need (ASCII I guess).
>
> Thanks in advance ;)
> --
> View this message in context:
> http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Encoding the content got from Fetcher

Santiago Pérez
Yes, I tried in that configuration file setting with the latin encoding Windows-1250, but the value of this property does not affect to the encoding of the content (I also tried with unexistent encoding and the result is the same...)

<property>
  <name>parser.character.encoding.default</name>
  <value>Windows-1250</value>
  <description>The character encoding to fall back to when no other information
  is available</description>
</property>

Has anyone had the same problem? (Hungarian o Polish people sure...)

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Encoding the content got from Fetcher

Andrzej Białecki-2
Santiago Pérez wrote:

> Yes, I tried in that configuration file setting with the latin encoding
> Windows-1250, but the value of this property does not affect to the encoding
> of the content (I also tried with unexistent encoding and the result is the
> same...)
>
> <property>
>   <name>parser.character.encoding.default</name>
>   <value>Windows-1250</value>
>   <description>The character encoding to fall back to when no other
> information
>   is available</description>
> </property>
>
> Has anyone had the same problem? (Hungarian o Polish people sure...)

The appearance of characters that you quoted in your other email
indicates that the problem may be the opposite - your pages seem to use
UTF-8, and you are trying to convert them using Windows-1250 ... Try
putting UTF-8 in this property, and see what happens.

Generally speaking, pages should declare their encoding, either in HTTP
headers or in <meta> tags, but often this declaration is either missing
or completely wrong. Nutch uses ICU4J CharsetDetector plus its own
heuristic (in util.EncodingDetector and in HtmlParser) that tries to
detect character encoding if it's missing or even if it's wrong - but
this is a tricky issue and sometimes results are unpredictable.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Encoding the content got from Fetcher

Santiago Pérez
I had already tried with:

<property>
  <name>parser.character.encoding.default</name>
  <value>UTF-8</value>
  <description>The character encoding to fall back to when no other information
  is available</description>
</property>

and System.out.println(content.toString());
is still the HTML code with the incorrect encoding...