readseg dump and non-ASCII characters

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

readseg dump and non-ASCII characters

Michael Coffey
Greetings Nutchlings,
I have been using readseg-dump successfully to retrieve content crawled by nutch, but I have one significant problem: many non-ASCII characters appear as '???' in the dumped text file. This happens fairly frequently in the headlines of news sites that I crawl, for things like quotes, apostrophes, and dashes.
Am I doing something wrong, or is this a known bug? I use a python utf8 decoder, so it would be nice if everything were UTF8.
Here is the command that I use to dump each segment (using nutch 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata -noparsetext -nogenerate
It is so close to working perfectly!
Reply | Threaded
Open this post in threaded view
|

Re: readseg dump and non-ASCII characters

Sebastian Nagel
Hi Michael,

from the arguments I guess you're interested in the raw/binary HTML content, right?
After a closer look I have no simple answer:

 1. HTML has no fix encoding - it could be everything, pageA may have a different
    encoding than pageB.

 2. That's different for parsed text: it's a Java String internally

 3. "readseg dump" converts all data to a Java String using the default platform
    encoding. On Linux having these locales installed you may get different results for:
       LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
       LC_ALL=en_US       ./bin/nutch reaseg -dump
       LC_ALL=ru_RU       ./bin/nutch reaseg -dump
    In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays are UTF-8.
    Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.

 4. a more reliable solution would require to detect the HTML encoding (the code is available
    in Nutch) and then convert the byte[] content using the right encoding.

Best,
Sebastian



On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by nutch, but I have one significant problem: many non-ASCII characters appear as '???' in the dumped text file. This happens fairly frequently in the headlines of news sites that I crawl, for things like quotes, apostrophes, and dashes.
> Am I doing something wrong, or is this a known bug? I use a python utf8 decoder, so it would be nice if everything were UTF8.
> Here is the command that I use to dump each segment (using nutch 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata -noparsetext -nogenerate
> It is so close to working perfectly!
>

Reply | Threaded
Open this post in threaded view
|

Re: readseg dump and non-ASCII characters

Michael Coffey
Thanks for the note, Sebastian. Yes, it is the fetched HTML that I parse using python-based tools after getting it from readseg. This is an alternative I decided to use after having struggled with raw-binary-content and solr.
I figured it was a problem of readseg either decoding or encoding properly, but I didn't know which. Your point #3 seems to say it's the decode that goes wrong becasue it doesn't consider the encoding of the fetched page.

A follow-up: I don't quite understand how the "LC_ALL=en_US.utf8" would apply to a Hadoop job. Does it somehow propagate to all nodes in the cluster? Would it work just as well, or better, to use "-Dfile.encoding=UTF8" in the binNutch command?

      From: Sebastian Nagel <[hidden email]>
 To: [hidden email]
 Sent: Wednesday, November 15, 2017 5:18 AM
 Subject: Re: readseg dump and non-ASCII characters
   
Hi Michael,

from the arguments I guess you're interested in the raw/binary HTML content, right?
After a closer look I have no simple answer:

 1. HTML has no fix encoding - it could be everything, pageA may have a different
    encoding than pageB.

 2. That's different for parsed text: it's a Java String internally

 3. "readseg dump" converts all data to a Java String using the default platform
    encoding. On Linux having these locales installed you may get different results for:
      LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
      LC_ALL=en_US      ./bin/nutch reaseg -dump
      LC_ALL=ru_RU      ./bin/nutch reaseg -dump
    In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays are UTF-8.
    Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.

 4. a more reliable solution would require to detect the HTML encoding (the code is available
    in Nutch) and then convert the byte[] content using the right encoding.

Best,
Sebastian



On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by nutch, but I have one significant problem: many non-ASCII characters appear as '???' in the dumped text file. This happens fairly frequently in the headlines of news sites that I crawl, for things like quotes, apostrophes, and dashes.
> Am I doing something wrong, or is this a known bug? I use a python utf8 decoder, so it would be nice if everything were UTF8.
> Here is the command that I use to dump each segment (using nutch 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata -noparsetext -nogenerate
> It is so close to working perfectly!
>



   
Reply | Threaded
Open this post in threaded view
|

Re: readseg dump and non-ASCII characters

Michael Coffey
In reply to this post by Sebastian Nagel
Not sure it's practical to go around to all the hadoop machines and change their default encoding settings. Not sure it wouldn't break something else!

I'm wondering if there's a simple fix I could make to the source code to make nutch.segment.SegmentReader use utf-8 as a default when reading the segment data.



In SegmentReader.java, the only obvious file-reading code I see is in this append function.
  private int append(FileSystem fs, Configuration conf, Path src,
      PrintWriter writer, int currentRecordNumber) throws IOException {
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        fs.open(src)));
    try {
      String line = reader.readLine();
      while (line != null) {
        if (line.startsWith("Recno:: ")) {
          line = "Recno:: " + currentRecordNumber++;
        }
        writer.println(line);
        line = reader.readLine();
      }
      return currentRecordNumber;
    } finally {
      reader.close();
    }
  }


SegmentReader has three different lines that create an OutputStreamWriter. Two of those explicitly use "UTF-8", but the one that creates a PrintWriter implicitly uses default encoding.

If I insert a "UTF-8" arg into the InputStreamReader and OutputStreamWriter constructors, should that work? Is it likely to break something else?








________________________________
From: Sebastian Nagel <[hidden email]>
To: [hidden email]
Sent: Wednesday, November 15, 2017 5:18 AM
Subject: Re: readseg dump and non-ASCII characters



Hi Michael,

from the arguments I guess you're interested in the raw/binary HTML content, right?
After a closer look I have no simple answer:

1. HTML has no fix encoding - it could be everything, pageA may have a different
    encoding than pageB.

2. That's different for parsed text: it's a Java String internally

3. "readseg dump" converts all data to a Java String using the default platform
    encoding. On Linux having these locales installed you may get different results for:
       LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
       LC_ALL=en_US       ./bin/nutch reaseg -dump
       LC_ALL=ru_RU       ./bin/nutch reaseg -dump
    In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays are UTF-8.
    Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.

4. a more reliable solution would require to detect the HTML encoding (the code is available
    in Nutch) and then convert the byte[] content using the right encoding.

Best,
Sebastian




On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by nutch, but I have one significant problem: many non-ASCII characters appear as '???' in the dumped text file. This happens fairly frequently in the headlines of news sites that I crawl, for things like quotes, apostrophes, and dashes.
> Am I doing something wrong, or is this a known bug? I use a python utf8 decoder, so it would be nice if everything were UTF8.
> Here is the command that I use to dump each segment (using nutch 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata -noparsetext -nogenerate
> It is so close to working perfectly!
>
Reply | Threaded
Open this post in threaded view
|

RE: readseg dump and non-ASCII characters

Yossi Tamari
Hi Michael,

Not directly answering this question, but keep in mind that as mentioned in the issue Sebastian referenced, there are many more places in Nutch that have the same problem, so setting LC_ALL is probably a good idea in general (until that issue is fixed...).
If you're worried about other applications, I believe passing `-DLC_ALL=en_US.utf8` as a parameter to all Nutch jobs should also work.

        Yossi.


> -----Original Message-----
> From: Michael Coffey [mailto:[hidden email]]
> Sent: 14 December 2017 20:30
> To: [hidden email]
> Subject: Re: readseg dump and non-ASCII characters
>
> Not sure it's practical to go around to all the hadoop machines and change their
> default encoding settings. Not sure it wouldn't break something else!
>
> I'm wondering if there's a simple fix I could make to the source code to make
> nutch.segment.SegmentReader use utf-8 as a default when reading the segment
> data.
>
>
>
> In SegmentReader.java, the only obvious file-reading code I see is in this append
> function.
>   private int append(FileSystem fs, Configuration conf, Path src,
>       PrintWriter writer, int currentRecordNumber) throws IOException {
>     BufferedReader reader = new BufferedReader(new InputStreamReader(
>         fs.open(src)));
>     try {
>       String line = reader.readLine();
>       while (line != null) {
>         if (line.startsWith("Recno:: ")) {
>           line = "Recno:: " + currentRecordNumber++;
>         }
>         writer.println(line);
>         line = reader.readLine();
>       }
>       return currentRecordNumber;
>     } finally {
>       reader.close();
>     }
>   }
>
>
> SegmentReader has three different lines that create an OutputStreamWriter.
> Two of those explicitly use "UTF-8", but the one that creates a PrintWriter
> implicitly uses default encoding.
>
> If I insert a "UTF-8" arg into the InputStreamReader and OutputStreamWriter
> constructors, should that work? Is it likely to break something else?
>
>
>
>
>
>
>
>
> ________________________________
> From: Sebastian Nagel <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, November 15, 2017 5:18 AM
> Subject: Re: readseg dump and non-ASCII characters
>
>
>
> Hi Michael,
>
> from the arguments I guess you're interested in the raw/binary HTML content,
> right?
> After a closer look I have no simple answer:
>
> 1. HTML has no fix encoding - it could be everything, pageA may have a different
>     encoding than pageB.
>
> 2. That's different for parsed text: it's a Java String internally
>
> 3. "readseg dump" converts all data to a Java String using the default platform
>     encoding. On Linux having these locales installed you may get different results
> for:
>        LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
>        LC_ALL=en_US       ./bin/nutch reaseg -dump
>        LC_ALL=ru_RU       ./bin/nutch reaseg -dump
>     In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays
> are UTF-8.
>     Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.
>
> 4. a more reliable solution would require to detect the HTML encoding (the code
> is available
>     in Nutch) and then convert the byte[] content using the right encoding.
>
> Best,
> Sebastian
>
>
>
>
> On 11/15/2017 02:20 AM, Michael Coffey wrote:
> > Greetings Nutchlings,
> > I have been using readseg-dump successfully to retrieve content crawled by
> nutch, but I have one significant problem: many non-ASCII characters appear as
> '???' in the dumped text file. This happens fairly frequently in the headlines of
> news sites that I crawl, for things like quotes, apostrophes, and dashes.
> > Am I doing something wrong, or is this a known bug? I use a python utf8
> decoder, so it would be nice if everything were UTF8.
> > Here is the command that I use to dump each segment (using nutch
> 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata -
> noparsetext -nogenerate
> > It is so close to working perfectly!
> >