Why isn't this working?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Why isn't this working?

Paul Tomblin
After applying the patch I sent earlier, I got it so that it correctly
skips downloading pages that haven't changed.  And after doing the
generate/fetch/updatedb loop, and merging the segments with mergeseg,
dumping the segment file seems to show that it still has the old
content as well as the new content.  But when I then ran the
invertlinks and index step, the resulting index consists of very small
files compared to the files from the previous crawl, indicating that
it only indexed the stuff that it had newly fetched.  I tried the
NutchBean, and sure enough it could only find things I knew were on
the newly loaded pages, and couldn't find things that occur hundreds
of times on the pages that haven't changed.  "merge" doesn't seem to
help, since the resulting merged index is still the same size as
before merging.

Is there a way to fix this, or should I just admit that Nutch is
hopelessly broken when it comes to trying to avoid hitting pages that
haven't changed and roll out my changes?

--
http://www.linkedin.com/in/paultomblin
Reply | Threaded
Open this post in threaded view
|

Re: Why isn't this working?

Alex McLintock
I've been wondering about this problem. When you did the invertlinks
and index steps did you do it just on the current/most recent segment
or all the segments?

Presumably this is why you tried to do a merge?

Alex

2009/8/10 Paul Tomblin <[hidden email]>:

> After applying the patch I sent earlier, I got it so that it correctly
> skips downloading pages that haven't changed.  And after doing the
> generate/fetch/updatedb loop, and merging the segments with mergeseg,
> dumping the segment file seems to show that it still has the old
> content as well as the new content.  But when I then ran the
> invertlinks and index step, the resulting index consists of very small
> files compared to the files from the previous crawl, indicating that
> it only indexed the stuff that it had newly fetched.  I tried the
> NutchBean, and sure enough it could only find things I knew were on
> the newly loaded pages, and couldn't find things that occur hundreds
> of times on the pages that haven't changed.  "merge" doesn't seem to
> help, since the resulting merged index is still the same size as
> before merging.
Reply | Threaded
Open this post in threaded view
|

Re: Why isn't this working?

Paul Tomblin
I followed the script (with minor variations) in the wiki at
http://wiki.apache.org/nutch/Crawl
however, I think I found another bug.  Apply this patch and it will
index pages with a status of STATUS_FETCH_NOTMODIFIED as well as
STATUS_FETCH_SUCCESS.

Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
===================================================================
--- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 802632)
+++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy)
@@ -84,8 +84,10 @@
         if (CrawlDatum.hasDbStatus(datum))
           dbDatum = datum;
         else if (CrawlDatum.hasFetchStatus(datum)) {
-          // don't index unmodified (empty) pages
-          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
+          /*
+           * Where did this person get the idea that unmodified pages
are empty?
+           // don't index unmodified (empty) pages
+          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */
             fetchDatum = datum;
         } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                    CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
@@ -108,7 +110,7 @@
     }

     if (!parseData.getStatus().isSuccess() ||
-        fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
+        (fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS &&
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) {
       return;
     }

Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
===================================================================
--- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (revision
802632)
+++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (working
copy)
@@ -124,11 +124,15 @@
         reqStr.append("\r\n");
       }

-      reqStr.append("\r\n");
       if (datum.getModifiedTime() > 0) {
         reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getModifiedTime()));
         reqStr.append("\r\n");
       }
+      else if (datum.getFetchTime() > 0) {
+          reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getFetchTime()));
+          reqStr.append("\r\n");
+      }
+      reqStr.append("\r\n");

       byte[] reqBytes= reqStr.toString().getBytes();



On Tue, Aug 11, 2009 at 5:35 AM, Alex McLintock<[hidden email]> wrote:

> I've been wondering about this problem. When you did the invertlinks
> and index steps did you do it just on the current/most recent segment
> or all the segments?
>
> Presumably this is why you tried to do a merge?
>
> Alex
>
> 2009/8/10 Paul Tomblin <[hidden email]>:
>> After applying the patch I sent earlier, I got it so that it correctly
>> skips downloading pages that haven't changed.  And after doing the
>> generate/fetch/updatedb loop, and merging the segments with mergeseg,
>> dumping the segment file seems to show that it still has the old
>> content as well as the new content.  But when I then ran the
>> invertlinks and index step, the resulting index consists of very small
>> files compared to the files from the previous crawl, indicating that
>> it only indexed the stuff that it had newly fetched.  I tried the
>> NutchBean, and sure enough it could only find things I knew were on
>> the newly loaded pages, and couldn't find things that occur hundreds
>> of times on the pages that haven't changed.  "merge" doesn't seem to
>> help, since the resulting merged index is still the same size as
>> before merging.
>



--
http://www.linkedin.com/in/paultomblin