getting content from url - encoding problem

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

getting content from url - encoding problem

Onur Deniz

        hi,

        I am using nutch just to crawl some web-sites. I'm not using searching facility.
        I'm using nutch using only command line options. I did not make any change in source code( but in conf. files like url-filter)...
        I'm calling command line options from scripts and execute thoses scripts using Runtime.getRuntime.exec(...) in java. (well, a bit longer way, but it seemed easier than running from eclipse at first)

        I know how to get content/parsetext of an URL in commandline. ( bin/nutch readseg -get .... ).

        Getting parsetext is ok because nutch handles encoding of the site. But when I try to get content of the page using the command (bin/nutch readseg -get) I faced an encoding problem;
page is in windows-1254. but I think the command returns content in utf-8. because some special characters(ş,ç,ğ,ü,ı) are dislpayed with displayement character ( <?> ).  
        so, my questions are,
        how does the command (bin/nutch readseg -get ... -nofetch -nogenerate -noparse -noparsedata -noparsetext) returns the content of the page? i mean, does it parses the content according to its encoding? or does it returns the content in utf-8 defalut?
       
        any suggestions? any solutions?

        thanks all.


        onur deniz



Reply | Threaded
Open this post in threaded view
|

Re: getting content from url - encoding problem

Onur Deniz
hi again,

(again, I call the command form a script, and the script from java using Runtime.getRuntime.exec(...)  )

I looked at binary values of return of the command (bin/nutch readseg -get ... ...)  
for example, in 'content' of the link http://canli.sporx.com/live2/20070407_kayseri_fb/ , which is encoded windows-1254, characters other than compatible with ASCII ( special characters for turkish : ş,ğ,ü,ö,ç,ı,Ş,Ğ...) have ALL the SAME binary-hex- value (ffffffef ffffffbf ffffffbd).
with that, I concluded that, nutch casts content into a default char-set internally(utf-8), and the command returns that casted bytes.

also with help of my superior, we concluded that using that command calls SegmentReader.java
within this class at function public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) there is someting like;
    while (values.hasNext()) {
      Object value = ((ObjectWritable) values.next()).get(); // unwrap
      if (value instanceof CrawlDatum) {
  .....
      } else if (value instanceof Content) {
        dump.append("\nContent::\n").append(((Content) value).toString());
      } else if....
at line 177, expression ((Content) value).toString() seemded to me as it casts binary content value to string. this may be the reason.

(well, I don't know much about nutch or map-reduce technique or encoding basics. inferences above are all can I do with my limited knowledge about nutch. so please correct me if I am wrong at anypoint.)

so, how can I getContent of an url in byte format using command line options? (or can I?)
if not, is there any other way?


thanks


onur deniz


--- On Mon, 9/1/08, Onur Deniz <[hidden email]> wrote:

> From: Onur Deniz <[hidden email]>
> Subject: getting content from url - encoding problem
> To: [hidden email]
> Date: Monday, September 1, 2008, 12:00 PM
> hi,
>
> I am using nutch just to crawl some web-sites. I'm not
> using searching facility.
> I'm using nutch using only command line options. I did
> not make any change in source code( but in conf. files like
> url-filter)...
> I'm calling command line options from scripts and
> execute thoses scripts using Runtime.getRuntime.exec(...) in
> java. (well, a bit longer way, but it seemed easier than
> running from eclipse at first)
>
> I know how to get content/parsetext of an URL in
> commandline. ( bin/nutch readseg -get .... ).
>
> Getting parsetext is ok because nutch handles encoding of
> the site. But when I try to get content of the page using
> the command (bin/nutch readseg -get) I faced an encoding
> problem;
> page is in windows-1254. but I think the command returns
> content in utf-8. because some special
> characters(ş,ç,ğ,ü,ı) are dislpayed with displayement
> character ( <?> ).  
> so, my questions are,
> how does the command (bin/nutch readseg -get ... -nofetch
> -nogenerate -noparse -noparsedata -noparsetext) returns the
> content of the page? i mean, does it parses the content
> according to its encoding? or does it returns the content in
> utf-8 defalut?
>
> any suggestions? any solutions?
>
> thanks all.
>
>
> onur deniz



Reply | Threaded
Open this post in threaded view
|

Re:Re: getting content from url - encoding problem

郑世强
If you are using the plugin protocol-httpclient(suggested),in the
package package org.apache.nutch.protocol.httpclient, you can write some
codes to get the content.

For example,in the file Http.java ,you can call the
HttpResponse.getconten() method to get content by byte[] format.I have
tried this in nutch-0.7.2

在2008-09-01,"Onur Deniz" <[hidden email]> 写道:

>hi again,
>
>(again, I call the command form a script, and the script from java using Runtime.getRuntime.exec(...)  )
>
>I looked at binary values of return of the command (bin/nutch readseg -get ... ...)  
>for example, in 'content' of the link http://canli.sporx.com/live2/20070407_kayseri_fb/ , which is encoded windows-1254, characters other than compatible with ASCII ( special characters for turkish : ş,ğ,ü,ö,ç,ı,Ş,Ğ...) have ALL the SAME binary-hex- value (ffffffef ffffffbf ffffffbd).
>with that, I concluded that, nutch casts content into a default char-set internally(utf-8), and the command returns that casted bytes.
>
>also with help of my superior, we concluded that using that command calls SegmentReader.java
>within this class at function public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) there is someting like;
>    while (values.hasNext()) {
>      Object value = ((ObjectWritable) values.next()).get(); // unwrap
>      if (value instanceof CrawlDatum) {
>  .....
>      } else if (value instanceof Content) {
>        dump.append("\nContent::\n").append(((Content) value).toString());
>      } else if....
>at line 177, expression ((Content) value).toString() seemded to me as it casts binary content value to string. this may be the reason.
>
>(well, I don't know much about nutch or map-reduce technique or encoding basics. inferences above are all can I do with my limited knowledge about nutch. so please correct me if I am wrong at anypoint.)
>
>so, how can I getContent of an url in byte format using command line options? (or can I?)
>if not, is there any other way?
>
>
>thanks
>
>
>onur deniz
>
>
>--- On Mon, 9/1/08, Onur Deniz
Reply | Threaded
Open this post in threaded view
|

Re:Re: getting content from url - encoding problem

Onur Deniz
thanks for reply!

i actually looked for a command line or nutch-external solution firstly. but it seems there no solution like that.

well, since i should get in dirt with source code of nutch, i will try your solution.

by the way, is there tutorial(s) about how to use nutch.jar from a java application?. i mean, how to set (which) env. variables or configurations (internal of nucth) as it is done in bin/nutch commands.

regards


onur deniz  



--- On Tue, 9/2/08, 郑世强 <[hidden email]> wrote:

> From: 郑世强 <[hidden email]>
> Subject: Re:Re: getting content from url - encoding problem
> To: [hidden email], [hidden email]
> Date: Tuesday, September 2, 2008, 2:54 PM
> If you are using the plugin
> protocol-httpclient(suggested),in the
> package package org.apache.nutch.protocol.httpclient, you
> can write some
> codes to get the content.
>
> For example,in the file Http.java ,you can call the
> HttpResponse.getconten() method to get content by byte[]
> format.I have
> tried this in nutch-0.7.2
>
> 在2008-09-01,"Onur Deniz"
> <[hidden email]> 写道:
> >hi again,
> >
> >(again, I call the command form a script, and the
> script from java using Runtime.getRuntime.exec(...)  )
> >
> >I looked at binary values of return of the command
> (bin/nutch readseg -get ... ...)  
> >for example, in 'content' of the link
> http://canli.sporx.com/live2/20070407_kayseri_fb/ , which is
> encoded windows-1254, characters other than compatible with
> ASCII ( special characters for turkish :
> ş,ğ,ü,ö,ç,ı,Ş,Ğ...) have ALL the SAME binary-hex-
> value (ffffffef ffffffbf ffffffbd).
> >with that, I concluded that, nutch casts content into a
> default char-set internally(utf-8), and the command returns
> that casted bytes.
> >
> >also with help of my superior, we concluded that using
> that command calls SegmentReader.java
> >within this class at function public void
> reduce(WritableComparable key, Iterator values,
> OutputCollector output, Reporter reporter) there is someting
> like;
> >    while (values.hasNext()) {
> >      Object value = ((ObjectWritable)
> values.next()).get(); // unwrap
> >      if (value instanceof CrawlDatum) {
> >  .....
> >      } else if (value instanceof Content) {
> >      
> dump.append("\nContent::\n").append(((Content)
> value).toString());
> >      } else if....
> >at line 177, expression ((Content) value).toString()
> seemded to me as it casts binary content value to string.
> this may be the reason.
> >
> >(well, I don't know much about nutch or map-reduce
> technique or encoding basics. inferences above are all can I
> do with my limited knowledge about nutch. so please correct
> me if I am wrong at anypoint.)
> >
> >so, how can I getContent of an url in byte format using
> command line options? (or can I?)
> >if not, is there any other way?
> >
> >
> >thanks
> >
> >
> >onur deniz
> >
> >
> >--- On Mon, 9/1/08, Onur Deniz



Reply | Threaded
Open this post in threaded view
|

Re: Re:Re: getting content from url - encoding problem

郑世强

I don't use nutch.jar in ava application. I just built a java project in Eclipse from the source code of nutch.
so I am sorry that I can't help you.
 
You can try to built java project in Eclipse.In this way you can use every package in nutch.Nutch wiki may help you.
------------------------------------------------------
2008-09-02

thanks for reply!
 
i actually looked for a command line or nutch-external solution firstly. but it seems there no solution like that. 
 
well, since i should get in dirt with source code of nutch, i will try your solution.
 
by the way, is there tutorial(s) about how to use nutch.jar from a java application?. i mean, how to set (which) env. variables or configurations (internal of nucth) as it is done in bin/nutch commands. 
 
regards
 
 
onur deniz  
 
 
 
--- On Tue, 9/2/08, 郑世强  <[hidden email] > wrote:
 
> From: 郑世强  <[hidden email] >
> Subject: Re:Re: getting content from url - encoding problem
> Date: Tuesday, September 2, 2008, 2:54 PM
> If you are using the plugin
> protocol-httpclient(suggested),in the
> package package org.apache.nutch.protocol.httpclient, you
> can write some
> codes to get the content.
> For example,in the file Http.java ,you can call the
> HttpResponse.getconten() method to get content by byte[]
> format.I have
> tried this in nutch-0.7.2
> 在2008-09-01,"Onur Deniz"
>  <[hidden email] > 写道:
>  >hi again,
>  >
>  >(again, I call the command form a script, and the
> script from java using Runtime.getRuntime.exec(...)  )
>  >
>  >I looked at binary values of return of the command
> (bin/nutch readseg -get ... ...)  
>  >for example, in 'content' of the link
> encoded windows-1254, characters other than compatible with
> ASCII ( special characters for turkish :
> ş,ğ,ü,ö,ç,ı,Ş,Ğ...) have ALL the SAME binary-hex-
> value (ffffffef ffffffbf ffffffbd).
>  >with that, I concluded that, nutch casts content into a
> default char-set internally(utf-8), and the command returns
> that casted bytes.
>  >
>  >also with help of my superior, we concluded that using
> that command calls SegmentReader.java
>  >within this class at function public void
> reduce(WritableComparable key, Iterator values,
> OutputCollector output, Reporter reporter) there is someting
> like; 
>  >    while (values.hasNext()) {
>  >      Object value = ((ObjectWritable)
> values.next()).get(); // unwrap
>  >      if (value instanceof CrawlDatum) {
>  >  .....
>  >      } else if (value instanceof Content) {
>  >       
> dump.append("\nContent::\n").append(((Content)
> value).toString());
>  >      } else if....
>  >at line 177, expression ((Content) value).toString()
> seemded to me as it casts binary content value to string.
> this may be the reason.
>  >
>  >(well, I don't know much about nutch or map-reduce
> technique or encoding basics. inferences above are all can I
> do with my limited knowledge about nutch. so please correct
> me if I am wrong at anypoint.)
>  >
>  >so, how can I getContent of an url in byte format using
> command line options? (or can I?)
>  >if not, is there any other way?
>  >
>  >
>  >thanks
>  >
>  >
>  >onur deniz
>  >
>  >
>  >--- On Mon, 9/1/08, Onur Deniz