Quantcast

make responseTime native in nutch

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

make responseTime native in nutch

Eyeris
Hi all.
Nutch has a configuration that permit save responseTime for every url that is fetched, and this value is stored in crawl Datum under the key _rs_ but not indexed.
Will be very usefull to index this value also.
This value is very important in all cases and it is very easy to make this native in nutch.
A little change to index basic plugin (or other) can make this happend.


//index responseTime for each url if http.store.responsetime is true
    boolean property= conf.getBoolean("http.store.responsetime",true);
    if (property == true){
      String value=datum.getMetaData().get(new Text("_rs_")).toString();
      doc.add("responseTime",value);
    }

I can do the jira ticket ant patch for this.
What you think about it ?
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: make responseTime native in nutch

Markus Jelsma-2
Try this:

<property>
  <name>index.db.md</name>
  <value></value>
  <description>
     Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
     Can be used to index values propagated from the seeds with the plugin urlmeta
  </description>
</property>

And enable index-metadata (iirc) plugin, you are good to go!

Cheers,
Markus

 
 
-----Original message-----

> From:Eyeris Rodriguez Rueda <[hidden email]>
> Sent: Monday 6th February 2017 15:56
> To: [hidden email]
> Subject: make responseTime native in nutch
>
> Hi all.
> Nutch has a configuration that permit save responseTime for every url that is fetched, and this value is stored in crawl Datum under the key _rs_ but not indexed.
> Will be very usefull to index this value also.
> This value is very important in all cases and it is very easy to make this native in nutch.
> A little change to index basic plugin (or other) can make this happend.
>
>
> //index responseTime for each url if http.store.responsetime is true
>     boolean property= conf.getBoolean("http.store.responsetime",true);
>     if (property == true){
>       String value=datum.getMetaData().get(new Text("_rs_")).toString();
>       doc.add("responseTime",value);
>     }
>
> I can do the jira ticket ant patch for this.
> What you think about it ?
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [MASSMAIL]RE: make responseTime native in nutch

Eyeris
Hello Markus.
I have tried your recomendation using
<property>
  <name>index.db.md</name>
  <value>_rs_</value>
  <description>
     Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
     Can be used to index values propagated from the seeds with the plugin urlmeta
  </description>
</property>


but i get the Exception(see below), by the indexer.
******************************************************************
2017-02-06 18:18:28,905 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: on
2017-02-06 18:18:29,024 INFO  more.MoreIndexingFilter - Reading content type mappings from file contenttype-mapping.txt
2017-02-06 18:18:29,849 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: content dest: content
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: title dest: title
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: host dest: host
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: segment dest: segment
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: boost dest: boost
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: digest
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: metatag.description dest: description
2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: metatag.keywords dest: keywords
2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local15168888_0001
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
        at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
        at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
******************************************************************

Y have looked index-metadata plugin and i think that the problem is when Writable object is forced to Text
see
https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58

**********************************
// add the fields from crawldb
    if (dbFieldnames != null) {
      for (String metatag : dbFieldnames) {
        Text metadata = (Text) datum.getMetaData().get(new Text(metatag));
        if (metadata != null)
          doc.add(metatag, metadata.toString());
      }
    }
***************************************
The line 58 need to be changed also to this:
Text metadata = (Text) datum.getMetaData().get(new Text(metatag)).toString();

If you agree i can do the jira ticket and patch for this.












----- Mensaje original -----
De: "Markus Jelsma" <[hidden email]>
Para: [hidden email]
Enviados: Lunes, 6 de Febrero 2017 17:54:39
Asunto: [MASSMAIL]RE: make responseTime native in nutch

Try this:

<property>
  <name>index.db.md</name>
  <value></value>
  <description>
     Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
     Can be used to index values propagated from the seeds with the plugin urlmeta
  </description>
</property>

And enable index-metadata (iirc) plugin, you are good to go!

Cheers,
Markus

 
 
-----Original message-----

> From:Eyeris Rodriguez Rueda <[hidden email]>
> Sent: Monday 6th February 2017 15:56
> To: [hidden email]
> Subject: make responseTime native in nutch
>
> Hi all.
> Nutch has a configuration that permit save responseTime for every url that is fetched, and this value is stored in crawl Datum under the key _rs_ but not indexed.
> Will be very usefull to index this value also.
> This value is very important in all cases and it is very easy to make this native in nutch.
> A little change to index basic plugin (or other) can make this happend.
>
>
> //index responseTime for each url if http.store.responsetime is true
>     boolean property= conf.getBoolean("http.store.responsetime",true);
>     if (property == true){
>       String value=datum.getMetaData().get(new Text("_rs_")).toString();
>       doc.add("responseTime",value);
>     }
>
> I can do the jira ticket ant patch for this.
> What you think about it ?

//End of email,text below is autogenerated by my email server.
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [MASSMAIL]RE: make responseTime native in nutch

Sebastian Nagel
Hi,

> Y have looked index-metadata plugin and i think that the problem is when Writable object is forced
to Text
> see
>
https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58

Good catch! The metadata value must implement Writable and must not necessarily be an instance of
Text. The actual class doesn't matter, the toString() method should return something meaningful.
Please, open a Jira issue to fix this problem of index-metadata.

@Eyeris: it's on you whether to open a separate issue to make the response time a "built-in" index
field.  We could add it to index-more which already provides fields related to the crawling:
last-modified, MIME type, content length.

Best,
Sebastian

On 02/07/2017 12:08 AM, Eyeris Rodriguez Rueda wrote:

> Hello Markus.
> I have tried your recomendation using
> <property>
>   <name>index.db.md</name>
>   <value>_rs_</value>
>   <description>
>      Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
>      Can be used to index values propagated from the seeds with the plugin urlmeta
>   </description>
> </property>
>
>
> but i get the Exception(see below), by the indexer.
> ******************************************************************
> 2017-02-06 18:18:28,905 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: on
> 2017-02-06 18:18:29,024 INFO  more.MoreIndexingFilter - Reading content type mappings from file contenttype-mapping.txt
> 2017-02-06 18:18:29,849 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: content dest: content
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: title dest: title
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: host dest: host
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: segment dest: segment
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: boost dest: boost
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local15168888_0001
> java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
> at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
> at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
> at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
> at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
> at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> ******************************************************************
>
> Y have looked index-metadata plugin and i think that the problem is when Writable object is forced to Text
> see
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
>
> **********************************
> // add the fields from crawldb
>     if (dbFieldnames != null) {
>       for (String metatag : dbFieldnames) {
>         Text metadata = (Text) datum.getMetaData().get(new Text(metatag));
>         if (metadata != null)
>           doc.add(metatag, metadata.toString());
>       }
>     }
> ***************************************
> The line 58 need to be changed also to this:
> Text metadata = (Text) datum.getMetaData().get(new Text(metatag)).toString();
>
> If you agree i can do the jira ticket and patch for this.
>
>
>
>
>
>
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Markus Jelsma" <[hidden email]>
> Para: [hidden email]
> Enviados: Lunes, 6 de Febrero 2017 17:54:39
> Asunto: [MASSMAIL]RE: make responseTime native in nutch
>
> Try this:
>
> <property>
>   <name>index.db.md</name>
>   <value></value>
>   <description>
>      Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
>      Can be used to index values propagated from the seeds with the plugin urlmeta
>   </description>
> </property>
>
> And enable index-metadata (iirc) plugin, you are good to go!
>
> Cheers,
> Markus
>
>  
>  
> -----Original message-----
>> From:Eyeris Rodriguez Rueda <[hidden email]>
>> Sent: Monday 6th February 2017 15:56
>> To: [hidden email]
>> Subject: make responseTime native in nutch
>>
>> Hi all.
>> Nutch has a configuration that permit save responseTime for every url that is fetched, and this value is stored in crawl Datum under the key _rs_ but not indexed.
>> Will be very usefull to index this value also.
>> This value is very important in all cases and it is very easy to make this native in nutch.
>> A little change to index basic plugin (or other) can make this happend.
>>
>>
>> //index responseTime for each url if http.store.responsetime is true
>>     boolean property= conf.getBoolean("http.store.responsetime",true);
>>     if (property == true){
>>       String value=datum.getMetaData().get(new Text("_rs_")).toString();
>>       doc.add("responseTime",value);
>>     }
>>
>> I can do the jira ticket ant patch for this.
>> What you think about it ?
>
> //End of email,text below is autogenerated by my email server.
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [MASSMAIL]RE: make responseTime native in nutch

Eyeris
Thanks Sebastian.
I have open a ticket for the problem in index metadata
This the url.
https://issues.apache.org/jira/browse/NUTCH-2357



About the responseTime in index more its looks great.
One new field is needed into indexer also(solr,elastic),if not, nutch will thrown an Exception.

One more thing.
There are some plugins that add fields to doc that will be indexed, but i can't find one place that describe every fields that nutch send to index.
Index basic send (domain,host,url,content,title,cache,tstamp).
Index more send (type,date,contentLength)
and others.
I have look into nutch code but i think that nutch don't use schema.xml
Is there any way to know all fields that nutch send into indexer (solr or other)?
i mean, apart of look at the code of every index-* plugin of course.
If i delete one of this fields into solr, then nutch thrown an Exception, because every field is named static in code, for example

doc.add("host", host);
I think that before to add a field to a doc, nutch should check if that field is present in schema.xml or not.








----- Mensaje original -----
De: "Sebastian Nagel" <[hidden email]>
Para: [hidden email]
Enviados: Martes, 7 de Febrero 2017 4:28:03
Asunto: Re: [MASSMAIL]RE: make responseTime native in nutch

Hi,

> Y have looked index-metadata plugin and i think that the problem is when Writable object is forced
to Text
> see
>
https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58

Good catch! The metadata value must implement Writable and must not necessarily be an instance of
Text. The actual class doesn't matter, the toString() method should return something meaningful.
Please, open a Jira issue to fix this problem of index-metadata.

@Eyeris: it's on you whether to open a separate issue to make the response time a "built-in" index
field.  We could add it to index-more which already provides fields related to the crawling:
last-modified, MIME type, content length.

Best,
Sebastian

On 02/07/2017 12:08 AM, Eyeris Rodriguez Rueda wrote:

> Hello Markus.
> I have tried your recomendation using
> <property>
>   <name>index.db.mdname>
>   <value>_rs_value>
>   <description>
>      Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
>      Can be used to index values propagated from the seeds with the plugin urlmeta
>   description>
> property>
>
>
> but i get the Exception(see below), by the indexer.
> ******************************************************************
> 2017-02-06 18:18:28,905 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: on
> 2017-02-06 18:18:29,024 INFO  more.MoreIndexingFilter - Reading content type mappings from file contenttype-mapping.txt
> 2017-02-06 18:18:29,849 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: content dest: content
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: title dest: title
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: host dest: host
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: segment dest: segment
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: boost dest: boost
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local15168888_0001
> java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>         at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>         at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>         at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>         at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>         at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> ******************************************************************
>
> Y have looked index-metadata plugin and i think that the problem is when Writable object is forced to Text
> see
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
>
> **********************************
> // add the fields from crawldb
>     if (dbFieldnames != null) {
>       for (String metatag : dbFieldnames) {
>         Text metadata = (Text) datum.getMetaData().get(new Text(metatag));
>         if (metadata != null)
>           doc.add(metatag, metadata.toString());
>       }
>     }
> ***************************************
> The line 58 need to be changed also to this:
> Text metadata = (Text) datum.getMetaData().get(new Text(metatag)).toString();
>
> If you agree i can do the jira ticket and patch for this.
>
>
>
>
>
>
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Markus Jelsma" <[hidden email]>
> Para: [hidden email]
> Enviados: Lunes, 6 de Febrero 2017 17:54:39
> Asunto: [MASSMAIL]RE: make responseTime native in nutch
>
> Try this:
>
> <property>
>   <name>index.db.mdname>
>   <value>value>
>   <description>
>      Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
>      Can be used to index values propagated from the seeds with the plugin urlmeta
>   description>
> property>
>
> And enable index-metadata (iirc) plugin, you are good to go!
>
> Cheers,
> Markus
>
>  
>  
> -----Original message-----
>> From:Eyeris Rodriguez Rueda <[hidden email]>
>> Sent: Monday 6th February 2017 15:56
>> To: [hidden email]
>> Subject: make responseTime native in nutch
>>
>> Hi all.
>> Nutch has a configuration that permit save responseTime for every url that is fetched, and this value is stored in crawl Datum under the key _rs_ but not indexed.
>> Will be very usefull to index this value also.
>> This value is very important in all cases and it is very easy to make this native in nutch.
>> A little change to index basic plugin (or other) can make this happend.
>>
>>
>> //index responseTime for each url if http.store.responsetime is true
>>     boolean property= conf.getBoolean("http.store.responsetime",true);
>>     if (property == true){
>>       String value=datum.getMetaData().get(new Text("_rs_")).toString();
>>       doc.add("responseTime",value);
>>     }
>>
>> I can do the jira ticket ant patch for this.
>> What you think about it ?
>
> //End of email,text below is autogenerated by my email server.

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [MASSMAIL]RE: make responseTime native in nutch

Sebastian Nagel-2
Hi,

> https://issues.apache.org/jira/browse/NUTCH-2357

thanks!

> There are some plugins that add fields to doc that will be indexed, but i can't find one place
that describe every fields that nutch send to index.

conf/schema.xml
 - should list all fields filled by core or any of the indexing filter plugins

Yes, names are hard-coded, a change to the name of an index field must be applied to both schema
and Java code.

Ideally, the fields are also listed in the wiki
  https://wiki.apache.org/nutch/IndexStructure
(but currently, some plugins and their fields are not listed there)

> I think that before to add a field to a doc, nutch should check if that field is present in
schema.xml or not.

Both IndexingFilters which add fields and indexer plugins (interface IndexWriter) are plugins
and should work independently.  A possible improvement could be to add methods which let
- indexing filters announce the filled fields, resp.
- index writers list the required or optionally accepted fields
The indexing job can then check in advance for undeclared index fields.

Best,
Sebastian


On 02/07/2017 03:22 PM, Eyeris Rodriguez Rueda wrote:

> Thanks Sebastian.
> I have open a ticket for the problem in index metadata
> This the url.
> https://issues.apache.org/jira/browse/NUTCH-2357
>
>
>
> About the responseTime in index more its looks great.
> One new field is needed into indexer also(solr,elastic),if not, nutch will thrown an Exception.
>
> One more thing.
> There are some plugins that add fields to doc that will be indexed, but i can't find one place that describe every fields that nutch send to index.
> Index basic send (domain,host,url,content,title,cache,tstamp).
> Index more send (type,date,contentLength)
> and others.
> I have look into nutch code but i think that nutch don't use schema.xml
> Is there any way to know all fields that nutch send into indexer (solr or other)?
> i mean, apart of look at the code of every index-* plugin of course.
> If i delete one of this fields into solr, then nutch thrown an Exception, because every field is named static in code, for example
>
> doc.add("host", host);
> I think that before to add a field to a doc, nutch should check if that field is present in schema.xml or not.
>
>
>
>
>
>
>
>
> ----- Mensaje original -----
> De: "Sebastian Nagel" <[hidden email]>
> Para: [hidden email]
> Enviados: Martes, 7 de Febrero 2017 4:28:03
> Asunto: Re: [MASSMAIL]RE: make responseTime native in nutch
>
> Hi,
>
>> Y have looked index-metadata plugin and i think that the problem is when Writable object is forced
> to Text
>> see
>>
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
>
> Good catch! The metadata value must implement Writable and must not necessarily be an instance of
> Text. The actual class doesn't matter, the toString() method should return something meaningful.
> Please, open a Jira issue to fix this problem of index-metadata.
>
> @Eyeris: it's on you whether to open a separate issue to make the response time a "built-in" index
> field.  We could add it to index-more which already provides fields related to the crawling:
> last-modified, MIME type, content length.
>
> Best,
> Sebastian
>
> On 02/07/2017 12:08 AM, Eyeris Rodriguez Rueda wrote:
>> Hello Markus.
>> I have tried your recomendation using
>> <property>
>>   <name>index.db.mdname>
>>   <value>_rs_value>
>>   <description>
>>      Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
>>      Can be used to index values propagated from the seeds with the plugin urlmeta
>>   description>
>> property>
>>
>>
>> but i get the Exception(see below), by the indexer.
>> ******************************************************************
>> 2017-02-06 18:18:28,905 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: on
>> 2017-02-06 18:18:29,024 INFO  more.MoreIndexingFilter - Reading content type mappings from file contenttype-mapping.txt
>> 2017-02-06 18:18:29,849 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: content dest: content
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: title dest: title
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: host dest: host
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: segment dest: segment
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: boost dest: boost
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: digest
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: metatag.description dest: description
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: metatag.keywords dest: keywords
>> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local15168888_0001
>> java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>>         at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
>> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>>         at org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>>         at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>>         at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>>         at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>>         at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>>         at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
>> ******************************************************************
>>
>> Y have looked index-metadata plugin and i think that the problem is when Writable object is forced to Text
>> see
>> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
>>
>> **********************************
>> // add the fields from crawldb
>>     if (dbFieldnames != null) {
>>       for (String metatag : dbFieldnames) {
>>         Text metadata = (Text) datum.getMetaData().get(new Text(metatag));
>>         if (metadata != null)
>>           doc.add(metatag, metadata.toString());
>>       }
>>     }
>> ***************************************
>> The line 58 need to be changed also to this:
>> Text metadata = (Text) datum.getMetaData().get(new Text(metatag)).toString();
>>
>> If you agree i can do the jira ticket and patch for this.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Mensaje original -----
>> De: "Markus Jelsma" <[hidden email]>
>> Para: [hidden email]
>> Enviados: Lunes, 6 de Febrero 2017 17:54:39
>> Asunto: [MASSMAIL]RE: make responseTime native in nutch
>>
>> Try this:
>>
>> <property>
>>   <name>index.db.mdname>
>>   <value>value>
>>   <description>
>>      Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
>>      Can be used to index values propagated from the seeds with the plugin urlmeta
>>   description>
>> property>
>>
>> And enable index-metadata (iirc) plugin, you are good to go!
>>
>> Cheers,
>> Markus
>>
>>  
>>  
>> -----Original message-----
>>> From:Eyeris Rodriguez Rueda <[hidden email]>
>>> Sent: Monday 6th February 2017 15:56
>>> To: [hidden email]
>>> Subject: make responseTime native in nutch
>>>
>>> Hi all.
>>> Nutch has a configuration that permit save responseTime for every url that is fetched, and this value is stored in crawl Datum under the key _rs_ but not indexed.
>>> Will be very usefull to index this value also.
>>> This value is very important in all cases and it is very easy to make this native in nutch.
>>> A little change to index basic plugin (or other) can make this happend.
>>>
>>>
>>> //index responseTime for each url if http.store.responsetime is true
>>>     boolean property= conf.getBoolean("http.store.responsetime",true);
>>>     if (property == true){
>>>       String value=datum.getMetaData().get(new Text("_rs_")).toString();
>>>       doc.add("responseTime",value);
>>>     }
>>>
>>> I can do the jira ticket ant patch for this.
>>> What you think about it ?
>>
>> //End of email,text below is autogenerated by my email server.
>
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>

Loading...