Can we add this to nutch?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Can we add this to nutch?

misc

Hi all-

    I asked for this before but no one answered, so I will try again.

    I have included a svn diff with a small proposed change to the code that would allow users to track found but filtered content in the crawl.  This is useful both as a diagnostic tool (let's see what we are skipping) as well as a way to find content and links pointed to by a page or site without having to actually download that content.

    I have set the log level to info, perhaps it should be debug.

    I think this would be a useful addition for many users.

    Could someone make this change?  If I am misunderstanding something and there are better ways to already do this, what are they?

                        see you
                            -Jim


Index: ParseOutputFormat.java
===================================================================
--- ParseOutputFormat.java      (revision 593619)
+++ ParseOutputFormat.java      (working copy)
@@ -193,17 +193,20 @@
                 toHost = null;
               }
               if (toHost == null || !toHost.equals(fromHost)) { // external links
+               LOG.info("filtering externalLink " + toUrl + " linked to by " + fromUrl);
+
                 continue; // skip it
               }
             }
             try {
               toUrl = normalizers.normalize(toUrl,
                           URLNormalizers.SCOPE_OUTLINK); // normalize the url
-              toUrl = filters.filter(toUrl);   // filter the url
-              if (toUrl == null) {
-                continue;
-              }
-            } catch (Exception e) {
+
+             if (filters.filter(toUrl) == null) {   // filter the url
+                     LOG.info("filtering content " + toUrl + " linked to by " + fromUrl);
+                     continue;
+                 }
+           } catch (Exception e) {
               continue;
             }
             CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);
Reply | Threaded
Open this post in threaded view
|

Re: Can we add this to nutch?

Dennis Kubes-2
The best way to get this included is to submit a JIRA ticket and include
your patch below.  One or more of the commiters, time allowing, will
then take a look at your patch for inclusion.

Dennis Kubes

misc wrote:

> Hi all-
>
>     I asked for this before but no one answered, so I will try again.
>
>     I have included a svn diff with a small proposed change to the code that would allow users to track found but filtered content in the crawl.  This is useful both as a diagnostic tool (let's see what we are skipping) as well as a way to find content and links pointed to by a page or site without having to actually download that content.
>
>     I have set the log level to info, perhaps it should be debug.
>
>     I think this would be a useful addition for many users.
>
>     Could someone make this change?  If I am misunderstanding something and there are better ways to already do this, what are they?
>
>                         see you
>                             -Jim
>
>
> Index: ParseOutputFormat.java
> ===================================================================
> --- ParseOutputFormat.java      (revision 593619)
> +++ ParseOutputFormat.java      (working copy)
> @@ -193,17 +193,20 @@
>                  toHost = null;
>                }
>                if (toHost == null || !toHost.equals(fromHost)) { // external links
> +               LOG.info("filtering externalLink " + toUrl + " linked to by " + fromUrl);
> +
>                  continue; // skip it
>                }
>              }
>              try {
>                toUrl = normalizers.normalize(toUrl,
>                            URLNormalizers.SCOPE_OUTLINK); // normalize the url
> -              toUrl = filters.filter(toUrl);   // filter the url
> -              if (toUrl == null) {
> -                continue;
> -              }
> -            } catch (Exception e) {
> +
> +             if (filters.filter(toUrl) == null) {   // filter the url
> +                     LOG.info("filtering content " + toUrl + " linked to by " + fromUrl);
> +                     continue;
> +                 }
> +           } catch (Exception e) {
>                continue;
>              }
>              CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);
>