Quantcast

how to index response time for a url ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

how to index response time for a url ?

Eyeris
Hi all.
I need to get and index response time for each url that nutch crawl.
I have added a responseTime field in solr for this value.

Is there any way to do this with configurations only or i need to do my own plugin to extract this key from crawl datum "_rs_" ?
Please some help about the steps will be apprecciated.


Im have configured http.store.responsetime property to true, what im missing ?.



This is my nutch-site.xml property

<property>
  <name>http.store.responsetime</name>
  <value>true</value>
  <description>Enables us to record the response time of the
  host which is the time period between start connection to end
  connection of a pages host. The response time in milliseconds
  is stored in CrawlDb in CrawlDatum's meta data under key &quot;_rs_&quot;
  </description>
</property>

after i have put the key but when i do parsechecker i don´t see data related to responseTime in the output.

<property>
  <name>db.parsemeta.to.crawldb</name>
  <value>&quot;_rs_&quot;</value>
  <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
   Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
   will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
  </description>
</property>
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [MASSMAIL]how to index response time for a url ?

Eyeris
Please any body can help me or not?
this is only happening to me ?

----- Mensaje original -----
De: "Eyeris Rodriguez Rueda" <[hidden email]>
Para: [hidden email]
Enviados: Domingo, 29 de Enero 2017 22:28:01
Asunto: [MASSMAIL]how to index response time for a url ?

Hi all.
I need to get and index response time for each url that nutch crawl.
I have added a responseTime field in solr for this value.

Is there any way to do this with configurations only or i need to do my own plugin to extract this key from crawl datum &quot;_rs_&quot; ?
Please some help about the steps will be apprecciated.


Im have configured http.store.responsetime property to true, what im missing ?.



This is my nutch-site.xml property

<property>
  <name>http.store.responsetime</name>
  <value>true</value>
  <description>Enables us to record the response time of the
  host which is the time period between start connection to end
  connection of a pages host. The response time in milliseconds
  is stored in CrawlDb in CrawlDatum's meta data under key &quot;_rs_&quot;
  </description>
</property>

after i have put the key but when i do parsechecker i don´t see data related to responseTime in the output.

<property>
  <name>db.parsemeta.to.crawldb</name>
  <value>&quot;_rs_&quot;</value>
  <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
   Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
   will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
  </description>
</property>
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

The University of Informatics Sciences invites you to participate in the
Scientific Conference UCIENCIA 2016, November 24-26.
Conferencia Científica UCIENCIA 2016, del 24 al 26 de moviembre.
http://uciencia.eventos.uci.cu/
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [MASSMAIL]how to index response time for a url ?

Markus Jelsma-2
In reply to this post by Eyeris
I am not sure what is going on, but those HTML entities &quot; certainly do not belong there. _rs_ is good enough. Then you also need index-metadata, and have the indexer add _rs_ to your index.

<property>
  <name>db.parsemeta.to.crawldb</name>
  <value>&quot;_rs_&quot;</value>
  <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
   Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
   will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
  </description>

 
 
-----Original message-----

> From:Eyeris Rodriguez Rueda <[hidden email]>
> Sent: Tuesday 31st January 2017 14:32
> To: [hidden email]
> Subject: Re: [MASSMAIL]how to index response time for a url ?
>
> Please any body can help me or not?
> this is only happening to me ?
>
> ----- Mensaje original -----
> De: "Eyeris Rodriguez Rueda" <[hidden email]>
> Para: [hidden email]
> Enviados: Domingo, 29 de Enero 2017 22:28:01
> Asunto: [MASSMAIL]how to index response time for a url ?
>
> Hi all.
> I need to get and index response time for each url that nutch crawl.
> I have added a responseTime field in solr for this value.
>
> Is there any way to do this with configurations only or i need to do my own plugin to extract this key from crawl datum &quot;_rs_&quot; ?
> Please some help about the steps will be apprecciated.
>
>
> Im have configured http.store.responsetime property to true, what im missing ?.
>
>
>
> This is my nutch-site.xml property
>
> <property>
>   <name>http.store.responsetime</name>
>   <value>true</value>
>   <description>Enables us to record the response time of the
>   host which is the time period between start connection to end
>   connection of a pages host. The response time in milliseconds
>   is stored in CrawlDb in CrawlDatum's meta data under key &quot;_rs_&quot;
>   </description>
> </property>
>
> after i have put the key but when i do parsechecker i don´t see data related to responseTime in the output.
>
> <property>
>   <name>db.parsemeta.to.crawldb</name>
>   <value>&quot;_rs_&quot;</value>
>   <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
>    Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
>    will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
>   </description>
> </property>
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
> The University of Informatics Sciences invites you to participate in the
> Scientific Conference UCIENCIA 2016, November 24-26.
> Conferencia Científica UCIENCIA 2016, del 24 al 26 de moviembre.
> http://uciencia.eventos.uci.cu/
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [MASSMAIL]how to index response time for a url ?

Eyeris
thanks markus for help.
I have readed the description of this property(below) and it says that crawl datum save that value, i thought that it was necesary to take responseTime from it.
i will try using only _rs_ key.

 <property>
>   <name>http.store.responsetime</name>
>   <value>true</value>
>   <description>Enables us to record the response time of the
>   host which is the time period between start connection to end
>   connection of a pages host. The response time in milliseconds
>   is stored in CrawlDb in CrawlDatum's meta data under key &quot;_rs_&quot;
>   </description>
> </property>







----- Mensaje original -----
De: "Markus Jelsma" <[hidden email]>
Para: [hidden email]
Enviados: Martes, 31 de Enero 2017 9:55:10
Asunto: RE: [MASSMAIL]how to index response time for a url ?

I am not sure what is going on, but those HTML entities &quot; certainly do not belong there. _rs_ is good enough. Then you also need index-metadata, and have the indexer add _rs_ to your index.

<property>
  <name>db.parsemeta.to.crawldb</name>
  <value>&quot;_rs_&quot;</value>
  <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
   Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
   will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
  </description>

 
 
-----Original message-----

> From:Eyeris Rodriguez Rueda <[hidden email]>
> Sent: Tuesday 31st January 2017 14:32
> To: [hidden email]
> Subject: Re: [MASSMAIL]how to index response time for a url ?
>
> Please any body can help me or not?
> this is only happening to me ?
>
> ----- Mensaje original -----
> De: "Eyeris Rodriguez Rueda" <[hidden email]>
> Para: [hidden email]
> Enviados: Domingo, 29 de Enero 2017 22:28:01
> Asunto: [MASSMAIL]how to index response time for a url ?
>
> Hi all.
> I need to get and index response time for each url that nutch crawl.
> I have added a responseTime field in solr for this value.
>
> Is there any way to do this with configurations only or i need to do my own plugin to extract this key from crawl datum &quot;_rs_&quot; ?
> Please some help about the steps will be apprecciated.
>
>
> Im have configured http.store.responsetime property to true, what im missing ?.
>
>
>
> This is my nutch-site.xml property
>
> <property>
>   <name>http.store.responsetime</name>
>   <value>true</value>
>   <description>Enables us to record the response time of the
>   host which is the time period between start connection to end
>   connection of a pages host. The response time in milliseconds
>   is stored in CrawlDb in CrawlDatum's meta data under key &quot;_rs_&quot;
>   </description>
> </property>
>
> after i have put the key but when i do parsechecker i don´t see data related to responseTime in the output.
>
> <property>
>   <name>db.parsemeta.to.crawldb</name>
>   <value>&quot;_rs_&quot;</value>
>   <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
>    Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
>    will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
>   </description>
> </property>ç
******************************
this the end of the message.
Text below is added automatically by my email provider.
********************************
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [MASSMAIL]how to index response time for a url ?

katta surendra babu
Hi

 Can anyone help me to crawl json website in nutch 2.3 with hbase database

On Tue, Jan 31, 2017 at 8:01 PM, Eyeris Rodriguez Rueda <[hidden email]>
wrote:

> thanks markus for help.
> I have readed the description of this property(below) and it says that
> crawl datum save that value, i thought that it was necesary to take
> responseTime from it.
> i will try using only _rs_ key.
>
>  <property>
> >   <name>http.store.responsetime</name>
> >   <value>true</value>
> >   <description>Enables us to record the response time of the
> >   host which is the time period between start connection to end
> >   connection of a pages host. The response time in milliseconds
> >   is stored in CrawlDb in CrawlDatum's meta data under key
> &quot;_rs_&quot;
> >   </description>
> > </property>
>
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Markus Jelsma" <[hidden email]>
> Para: [hidden email]
> Enviados: Martes, 31 de Enero 2017 9:55:10
> Asunto: RE: [MASSMAIL]how to index response time for a url ?
>
> I am not sure what is going on, but those HTML entities &quot; certainly
> do not belong there. _rs_ is good enough. Then you also need
> index-metadata, and have the indexer add _rs_ to your index.
>
> <property>
>   <name>db.parsemeta.to.crawldb</name>
>   <value>&quot;_rs_&quot;</value>
>   <description>Comma-separated list of parse metadata keys to transfer to
> the crawldb (NUTCH-779).
>    Assuming for instance that the languageidentifier plugin is enabled,
> setting the value to 'lang'
>    will copy both the key 'lang' and its value to the corresponding entry
> in the crawldb.
>   </description>
>
>
>
> -----Original message-----
> > From:Eyeris Rodriguez Rueda <[hidden email]>
> > Sent: Tuesday 31st January 2017 14:32
> > To: [hidden email]
> > Subject: Re: [MASSMAIL]how to index response time for a url ?
> >
> > Please any body can help me or not?
> > this is only happening to me ?
> >
> > ----- Mensaje original -----
> > De: "Eyeris Rodriguez Rueda" <[hidden email]>
> > Para: [hidden email]
> > Enviados: Domingo, 29 de Enero 2017 22:28:01
> > Asunto: [MASSMAIL]how to index response time for a url ?
> >
> > Hi all.
> > I need to get and index response time for each url that nutch crawl.
> > I have added a responseTime field in solr for this value.
> >
> > Is there any way to do this with configurations only or i need to do my
> own plugin to extract this key from crawl datum &quot;_rs_&quot; ?
> > Please some help about the steps will be apprecciated.
> >
> >
> > Im have configured http.store.responsetime property to true, what im
> missing ?.
> >
> >
> >
> > This is my nutch-site.xml property
> >
> > <property>
> >   <name>http.store.responsetime</name>
> >   <value>true</value>
> >   <description>Enables us to record the response time of the
> >   host which is the time period between start connection to end
> >   connection of a pages host. The response time in milliseconds
> >   is stored in CrawlDb in CrawlDatum's meta data under key
> &quot;_rs_&quot;
> >   </description>
> > </property>
> >
> > after i have put the key but when i do parsechecker i don´t see data
> related to responseTime in the output.
> >
> > <property>
> >   <name>db.parsemeta.to.crawldb</name>
> >   <value>&quot;_rs_&quot;</value>
> >   <description>Comma-separated list of parse metadata keys to transfer
> to the crawldb (NUTCH-779).
> >    Assuming for instance that the languageidentifier plugin is enabled,
> setting the value to 'lang'
> >    will copy both the key 'lang' and its value to the corresponding
> entry in the crawldb.
> >   </description>
> > </property>ç
> ******************************
> this the end of the message.
> Text below is added automatically by my email provider.
> ********************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>


--
Thanks & Regards
Surendra Babu Katta
8886747555
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [MASSMAIL]how to index response time for a url ?

Markus Jelsma-2
In reply to this post by Eyeris
Ah, i see. The docs are misleading, the quotes are not meant to be copied verbatim.


 
 
-----Original message-----

> From:Eyeris Rodriguez Rueda <[hidden email]>
> Sent: Tuesday 31st January 2017 15:32
> To: [hidden email]
> Subject: Re: [MASSMAIL]how to index response time for a url ?
>
> thanks markus for help.
> I have readed the description of this property(below) and it says that crawl datum save that value, i thought that it was necesary to take responseTime from it.
> i will try using only _rs_ key.
>
>  <property>
> >   <name>http.store.responsetime</name>
> >   <value>true</value>
> >   <description>Enables us to record the response time of the
> >   host which is the time period between start connection to end
> >   connection of a pages host. The response time in milliseconds
> >   is stored in CrawlDb in CrawlDatum's meta data under key &quot;_rs_&quot;
> >   </description>
> > </property>
>
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Markus Jelsma" <[hidden email]>
> Para: [hidden email]
> Enviados: Martes, 31 de Enero 2017 9:55:10
> Asunto: RE: [MASSMAIL]how to index response time for a url ?
>
> I am not sure what is going on, but those HTML entities &quot; certainly do not belong there. _rs_ is good enough. Then you also need index-metadata, and have the indexer add _rs_ to your index.
>
> <property>
>   <name>db.parsemeta.to.crawldb</name>
>   <value>&quot;_rs_&quot;</value>
>   <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
>    Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
>    will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
>   </description>
>
>  
>  
> -----Original message-----
> > From:Eyeris Rodriguez Rueda <[hidden email]>
> > Sent: Tuesday 31st January 2017 14:32
> > To: [hidden email]
> > Subject: Re: [MASSMAIL]how to index response time for a url ?
> >
> > Please any body can help me or not?
> > this is only happening to me ?
> >
> > ----- Mensaje original -----
> > De: "Eyeris Rodriguez Rueda" <[hidden email]>
> > Para: [hidden email]
> > Enviados: Domingo, 29 de Enero 2017 22:28:01
> > Asunto: [MASSMAIL]how to index response time for a url ?
> >
> > Hi all.
> > I need to get and index response time for each url that nutch crawl.
> > I have added a responseTime field in solr for this value.
> >
> > Is there any way to do this with configurations only or i need to do my own plugin to extract this key from crawl datum &quot;_rs_&quot; ?
> > Please some help about the steps will be apprecciated.
> >
> >
> > Im have configured http.store.responsetime property to true, what im missing ?.
> >
> >
> >
> > This is my nutch-site.xml property
> >
> > <property>
> >   <name>http.store.responsetime</name>
> >   <value>true</value>
> >   <description>Enables us to record the response time of the
> >   host which is the time period between start connection to end
> >   connection of a pages host. The response time in milliseconds
> >   is stored in CrawlDb in CrawlDatum's meta data under key &quot;_rs_&quot;
> >   </description>
> > </property>
> >
> > after i have put the key but when i do parsechecker i don´t see data related to responseTime in the output.
> >
> > <property>
> >   <name>db.parsemeta.to.crawldb</name>
> >   <value>&quot;_rs_&quot;</value>
> >   <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
> >    Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang'
> >    will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
> >   </description>
> > </property>ç
> ******************************
> this the end of the message.
> Text below is added automatically by my email provider.
> ********************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>
Loading...