get CrawlDatum

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

get CrawlDatum

Uroš Gruber-2
Hi,

Could someone point me how to get CrawlDatum data from key url in
ParseOutputFormat.write [83].
I would like to add data to link urls but this data depend on data of
url being crawled.

I hope I was clear enough about my problem.

regards

Uros
Reply | Threaded
Open this post in threaded view
|

Re: get CrawlDatum

Andrzej Białecki-2
Uroš Gruber wrote:
> Hi,
>
> Could someone point me how to get CrawlDatum data from key url in
> ParseOutputFormat.write [83].
> I would like to add data to link urls but this data depend on data of
> url being crawled.

You can't, because that instance of CrawlDatum is not available at this
place. Either you need to provide it on the input to the map/reduce job
(but then you will have to change input and output formats), or you
should prepare this information in advance during parsing, and put it
into ParseData.metadata.

>
> I hope I was clear enough about my problem.
I hope so too ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: get CrawlDatum

Uroš Gruber-2
Andrzej Bialecki wrote:

> Uroš Gruber wrote:
>> Hi,
>>
>> Could someone point me how to get CrawlDatum data from key url in
>> ParseOutputFormat.write [83].
>> I would like to add data to link urls but this data depend on data of
>> url being crawled.
>
> You can't, because that instance of CrawlDatum is not available at
> this place. Either you need to provide it on the input to the
> map/reduce job (but then you will have to change input and output
> formats), or you should prepare this information in advance during
> parsing, and put it into ParseData.metadata.
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]

but i'm not sure that datum holds info of url being fetched

>>
>> I hope I was clear enough about my problem.
> I hope so too ;)
>


Reply | Threaded
Open this post in threaded view
|

Re: get CrawlDatum

Andrzej Białecki-2
Uroš Gruber wrote:
> ParseData.metadata sounds nice, but I think I'm lost again :)
> If I understand code flow the best place would be in Fetcher [262]
>
> but i'm not sure that datum holds info of url being fetched

On the input to the fetcher you get a URL and a CrawlDatum (originally
coming from the crawldb). Check for example how the segment name is
passed around in metadata, you can use the same method.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: get CrawlDatum

Uroš Gruber-2
Andrzej Bialecki wrote:

> Uroš Gruber wrote:
>> ParseData.metadata sounds nice, but I think I'm lost again :)
>> If I understand code flow the best place would be in Fetcher [262]
>>
>> but i'm not sure that datum holds info of url being fetched
>
> On the input to the fetcher you get a URL and a CrawlDatum (originally
> coming from the crawldb). Check for example how the segment name is
> passed around in metadata, you can use the same method.
>
Hi,

I made some draft patch. But there is still some problems I see. I know
code needs to be cleaned and test. But right now I don't know what
number set to external urls. For internal linking works great.

What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop
value is incremented by 1. (still no idea what to do with external
link). Then I can add config value max_hop etc. to limit fetcher and
generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.

regards,

Uros

Index: java/org/apache/nutch/crawl/CrawlDatum.java
===================================================================
--- java/org/apache/nutch/crawl/CrawlDatum.java (revision 437981)
+++ java/org/apache/nutch/crawl/CrawlDatum.java (working copy)
@@ -57,6 +57,7 @@
   private byte status;
   private long fetchTime = System.currentTimeMillis();
   private byte retries;
+  private int hop;
   private float fetchInterval;
   private float score = 1.0f;
   private byte[] signature = null;
@@ -82,6 +83,8 @@
   public byte getStatus() { return status; }
   public void setStatus(int status) { this.status = (byte)status; }
 
+  public int getHop() { return hop; }
+  public void setHop (int hop) {this.hop = hop; }
   public long getFetchTime() { return fetchTime; }
   public void setFetchTime(long fetchTime) { this.fetchTime = fetchTime; }
 
@@ -151,6 +154,7 @@
     retries = in.readByte();
     fetchInterval = in.readFloat();
     score = in.readFloat();
+    hop = in.readInt();
     if (version > 2) {
       modifiedTime = in.readLong();
       int cnt = in.readByte();
@@ -186,6 +190,7 @@
     out.writeByte(retries);
     out.writeFloat(fetchInterval);
     out.writeFloat(score);
+    out.writeInt(hop);
     out.writeLong(modifiedTime);
     if (signature == null) {
       out.writeByte(0);
@@ -210,6 +215,7 @@
     this.score = that.score;
     this.modifiedTime = that.modifiedTime;
     this.signature = that.signature;
+    this.hop = that.hop;
     this.metaData = new MapWritable(that.metaData); // make a deep copy
   }
 
@@ -290,6 +296,7 @@
     buf.append("Retries since fetch: " + getRetriesSinceFetch() + "\n");
     buf.append("Retry interval: " + getFetchInterval() + " days\n");
     buf.append("Score: " + getScore() + "\n");
+    buf.append("Hop: " + getHop() + "\n");
     buf.append("Signature: " + StringUtil.toHexString(getSignature()) + "\n");
     buf.append("Metadata: " + (metaData != null ? metaData.toString() : "null") + "\n");
     return buf.toString();
Index: java/org/apache/nutch/crawl/Injector.java
===================================================================
--- java/org/apache/nutch/crawl/Injector.java (revision 437981)
+++ java/org/apache/nutch/crawl/Injector.java (working copy)
@@ -77,6 +77,7 @@
         value.set(url);                           // collect it
         CrawlDatum datum = new CrawlDatum(CrawlDatum.STATUS_DB_UNFETCHED, interval);
         datum.setScore(scoreInjected);
+        datum.setHop(0);
         try {
           scfilters.initialScore(value, datum);
         } catch (ScoringFilterException e) {
Index: java/org/apache/nutch/fetcher/Fetcher.java
===================================================================
--- java/org/apache/nutch/fetcher/Fetcher.java (revision 437981)
+++ java/org/apache/nutch/fetcher/Fetcher.java (working copy)
@@ -260,6 +260,8 @@
       Metadata metadata = content.getMetadata();
       // add segment to metadata
       metadata.set(SEGMENT_NAME_KEY, segmentName);
+
+      metadata.set("hop", Integer.toString(datum.getHop()));
       // add score to content metadata so that ParseSegment can pick it up.
       try {
         scfilters.passScoreBeforeParsing(key, datum, content);
Index: java/org/apache/nutch/parse/ParseOutputFormat.java
===================================================================
--- java/org/apache/nutch/parse/ParseOutputFormat.java (revision 437981)
+++ java/org/apache/nutch/parse/ParseOutputFormat.java (working copy)
@@ -85,8 +85,8 @@
           String fromHost = null;
           String toHost = null;          
           textOut.append(key, new ParseText(parse.getText()));
-          
           ParseData parseData = parse.getData();
+          String pd = parseData.getContentMeta().get("hop");
           // recover the signature prepared by Fetcher or ParseSegment
           String sig = parseData.getContentMeta().get(Fetcher.SIGNATURE_KEY);
           if (sig != null) {
@@ -151,6 +151,7 @@
               }
               continue;
             }
+            target.setHop(Integer.parseInt(pd)+1);
             crawlOut.append(targetUrl, target);
             if (adjust != null) crawlOut.append(key, adjust);
           }
Reply | Threaded
Open this post in threaded view
|

RE: get CrawlDatum

HUYLEBROECK Jeremy RD-ILAB-SSF-2
In reply to this post by Uroš Gruber-2

My current solution is having a modified Fetcher putting info in the Parse Metadata in the output method.

Then this info can be used during parsing and so on.
As Andrzej said, I also had to create my own OutputFormat.


-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Wednesday, August 30, 2006 12:59 AM
To: [hidden email]
Subject: Re: get CrawlDatum

Uroš Gruber wrote:
> Hi,
>
> Could someone point me how to get CrawlDatum data from key url in
> ParseOutputFormat.write [83].
> I would like to add data to link urls but this data depend on data of
> url being crawled.

You can't, because that instance of CrawlDatum is not available at this place. Either you need to provide it on the input to the map/reduce job (but then you will have to change input and output formats), or you should prepare this information in advance during parsing, and put it into ParseData.metadata.

>
> I hope I was clear enough about my problem.
I hope so too ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|  ||  |  Embedded Unix, System Integration http://www.sigram.com  Contact: info at sigram dot com