Reporter interface

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Reporter interface

Andrew McNabb
I'm looking at the Reporter interface, and I would like to verify my
understanding of what it is.  It appears to me that Reporter.setStatus()
is called periodically during an operation to give a human-readable
description of how far the progress is so far.  Is that correct?

If so, is there a reason that RecordWriter.close() requires a Reporter
(are there situations where it takes a long time)?  Also, is there a
standard "NullReporter" class for situations where updating is not
needed?

If not, what does setStatus() do exactly?

Thanks in advance for the clarification.

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

attachment0 (193 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Stefan Groschupf-2

Am 07.01.2006 um 00:43 schrieb Andrew McNabb:

> I'm looking at the Reporter interface, and I would like to verify my
> understanding of what it is.  It appears to me that  
> Reporter.setStatus()
> is called periodically during an operation to give a human-readable
> description of how far the progress is so far.  Is that correct?
>
yes.
> If so, is there a reason that RecordWriter.close() requires a Reporter
> (are there situations where it takes a long time)?  Also, is there a
> standard "NullReporter" class for situations where updating is not
> needed?
>
I guess yes.

Stefan
Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Doug Cutting-2
In reply to this post by Andrew McNabb
Andrew McNabb wrote:
> I'm looking at the Reporter interface, and I would like to verify my
> understanding of what it is.  It appears to me that Reporter.setStatus()
> is called periodically during an operation to give a human-readable
> description of how far the progress is so far.  Is that correct?

Yes.  These strings appear in the web interface and in logs.

Reporter also has another function, to tell the MapReduce system that
things are not hung, that progress is still being made.  If an
individual operation (map, reduce, close) may take longer than the task
timeout (10 minutes by default?) then this should be called or the task
will be assumed to be hung and it will be killed.

> If so, is there a reason that RecordWriter.close() requires a Reporter
> (are there situations where it takes a long time)?

Some reduce processes (e.g., Lucene indexing) write to temporary local
files and then copy their final output to NDFS on close.

> Also, is there a
> standard "NullReporter" class for situations where updating is not
> needed?

A NullReporter would be easy to define, but I'm not sure why you ask
since Reporter's are not usually created by user code but rather by the
MapReduce system.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Andrew McNabb
On Mon, Jan 09, 2006 at 11:45:09AM -0800, Doug Cutting wrote:
> A NullReporter would be easy to define, but I'm not sure why you ask
> since Reporter's are not usually created by user code but rather by
> the MapReduce system.
>

One of the great things about open source is that projects can be used
for unintended purposes.  In fact, Nutch works well for parallel
computing in general, not just for web indexing.  Apparently Google has
thousands of projects that use MapReduce.

I'm using Nutch right now (and I love it), but I currently have very
little interest in web indexing.  I have a project with a custom Mapper
and Reducer, and I needed to be able to read in the data from a
SequenceFile, which led me to the issue I emailed about.

I'd send you a patch with a NullReporter, but it's only four or five
lines. :)

Thanks for everything.

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Doug Cutting-2
Andrew McNabb wrote:
> One of the great things about open source is that projects can be used
> for unintended purposes.  In fact, Nutch works well for parallel
> computing in general, not just for web indexing.  Apparently Google has
> thousands of projects that use MapReduce.

The plan is to move NDFS and MapReduce from Nutch to a new Lucene
sub-project, probably sometime in the next few months.

> I'm using Nutch right now (and I love it), but I currently have very
> little interest in web indexing.  I have a project with a custom Mapper
> and Reducer, and I needed to be able to read in the data from a
> SequenceFile, which led me to the issue I emailed about.
>
> I'd send you a patch with a NullReporter, but it's only four or five
> lines. :)

I'm still not clear why one might need a NullReporter.

Doug
Reply | Threaded
Open this post in threaded view
|

HTMLMetaProcessor a bug?

Gal Nitzan
Hi,

I was going over the code and I noticed the following in

class org.apache.nutch.parse.html.HTMLMetaProcessor

method getMetaTagsHelper

the following code would fail in case the meta tags are in upper case

        Node nameNode = attrs.getNamedItem("name");
        Node equivNode = attrs.getNamedItem("http-equiv");
        Node contentNode = attrs.getNamedItem("content");


G.


Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Andrew McNabb
In reply to this post by Doug Cutting-2
On Mon, Jan 09, 2006 at 03:28:45PM -0800, Doug Cutting wrote:
>
> I'm still not clear why one might need a NullReporter.

To be more clear I should be a little more specific.  I had to read in
from a SequenceFile to interpret results of a string of MapReduce
stages.  Here's a simplified snippet.  In this case I made a Reporter
called nullreporter that just does nothing.

SequenceFileInputFormat inputformat = new SequenceFileInputFormat();
RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob, nullreporter);

I don't like having to specify a Reporter to getRecordReader().
Actually, as I've thought more about it, it's probably a bad idea to
make a NullReporter class (although that might be better than nothing).
Maybe a better solution would be simply to allow null to be passed in,
but before calling setStatus(), check to make sure that it isn't null.
Is that a good idea?

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Doug Cutting-2
Andrew McNabb wrote:
> SequenceFileInputFormat inputformat = new SequenceFileInputFormat();
> RecordReader in = inputformat.getRecordReader(fshandle, split[i], logjob, nullreporter);

To read sequence files directly outside of MapReduce, just use
SequenceFile directly, e.g., something like:

MyKey key = new MyKey();
MyValue value = new MyValue();

SequenceFile.Reader reader =
   new SequenceFile.reader(NutchFileSystem.get("local"), "file");

while (reader.next(key, value)) {
   ... process key/value pair ...
}

Wouldn't that be simpler?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: HTMLMetaProcessor a bug?

Jérôme Charron
In reply to this post by Gal Nitzan
> the following code would fail in case the meta tags are in upper case
>
>         Node nameNode = attrs.getNamedItem("name");
>         Node equivNode = attrs.getNamedItem("http-equiv");
>         Node contentNode = attrs.getNamedItem("content");

This code works well, because Nutch HTML Parser uses Xerces implementation
HTMLDocumentImpl object that lowercased attributes (instead of elements
names that are uppercased).
For consistency and to decouple a little Nutch HTML Parser and Xerces
implementation, I suggest to change these lines by something like:
Node nameNode = null;
Node equivNode = null;
Node contentNode = null;
for (int i=0; i<attrs.getLength(); i++) {
  Node attr = attrs.item(i);
  String attrName = attr.getNodeName().toLowerCase();
  if (attrName.equals("name")) {
    nameNode = attr;
  } else if (attrName.equals("http-equiv")) {
    equivNode = attr;
  } else if (attrName.equals("content")) {
    contentNode = attr;
  }
}


Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: HTMLMetaProcessor a bug?

Gal Nitzan
Thanks, I was checking something with the default from jdk...

On Tue, 2006-01-10 at 11:06 +0100, Jérôme Charron wrote:

> > the following code would fail in case the meta tags are in upper case
> >
> >         Node nameNode = attrs.getNamedItem("name");
> >         Node equivNode = attrs.getNamedItem("http-equiv");
> >         Node contentNode = attrs.getNamedItem("content");
>
> This code works well, because Nutch HTML Parser uses Xerces implementation
> HTMLDocumentImpl object that lowercased attributes (instead of elements
> names that are uppercased).
> For consistency and to decouple a little Nutch HTML Parser and Xerces
> implementation, I suggest to change these lines by something like:
> Node nameNode = null;
> Node equivNode = null;
> Node contentNode = null;
> for (int i=0; i<attrs.getLength(); i++) {
>   Node attr = attrs.item(i);
>   String attrName = attr.getNodeName().toLowerCase();
>   if (attrName.equals("name")) {
>     nameNode = attr;
>   } else if (attrName.equals("http-equiv")) {
>     equivNode = attr;
>   } else if (attrName.equals("content")) {
>     contentNode = attr;
>   }
> }
>
>
> Jérôme
>
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/


Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Andrew McNabb
In reply to this post by Doug Cutting-2
On Mon, Jan 09, 2006 at 05:00:00PM -0800, Doug Cutting wrote:

> To read sequence files directly outside of MapReduce, just use
> SequenceFile directly, e.g., something like:
>
> MyKey key = new MyKey();
> MyValue value = new MyValue();
>
> SequenceFile.Reader reader =
>   new SequenceFile.reader(NutchFileSystem.get("local"), "file");
>
> while (reader.next(key, value)) {
>   ... process key/value pair ...
> }
>
> Wouldn't that be simpler?
Who knows?  Maybe it would be. :)

With the approach that you just described, what's the easiest way to get
all of the files in a directory (the full output of a reduce)?

Thanks.

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Doug Cutting-2
Andrew McNabb wrote:

> On Mon, Jan 09, 2006 at 05:00:00PM -0800, Doug Cutting wrote:
>
>>To read sequence files directly outside of MapReduce, just use
>>SequenceFile directly, e.g., something like:
>>
>>MyKey key = new MyKey();
>>MyValue value = new MyValue();
>>
>>SequenceFile.Reader reader =
>>  new SequenceFile.reader(NutchFileSystem.get("local"), "file");
>>
>>while (reader.next(key, value)) {
>>  ... process key/value pair ...
>>}
>>
>>Wouldn't that be simpler?
>
>
> Who knows?  Maybe it would be. :)
>
> With the approach that you just described, what's the easiest way to get
> all of the files in a directory (the full output of a reduce)?

NutchFileSystem fs = NutchFileSystem.get();
File[] files = fs.listFiles(directory);

Doug
Reply | Threaded
Open this post in threaded view
|

Re: HTMLMetaProcessor a bug?

Doug Cutting-2
In reply to this post by Jérôme Charron
Jérôme Charron wrote:

> For consistency and to decouple a little Nutch HTML Parser and Xerces
> implementation, I suggest to change these lines by something like:
> Node nameNode = null;
> Node equivNode = null;
> Node contentNode = null;
> for (int i=0; i<attrs.getLength(); i++) {
>   Node attr = attrs.item(i);
>   String attrName = attr.getNodeName().toLowerCase();
>   if (attrName.equals("name")) {
>     nameNode = attr;
>   } else if (attrName.equals("http-equiv")) {
>     equivNode = attr;
>   } else if (attrName.equals("content")) {
>     contentNode = attr;
>   }
> }

+1
Reply | Threaded
Open this post in threaded view
|

Re: Reporter interface

Andrew McNabb
In reply to this post by Doug Cutting-2
On Tue, Jan 10, 2006 at 08:44:46AM -0800, Doug Cutting wrote:
>
> NutchFileSystem fs = NutchFileSystem.get();
> File[] files = fs.listFiles(directory);
>

Thanks.  I'll try doing it this way instead of how I was doing it
earlier.

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

attachment0 (193 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: HTMLMetaProcessor a bug?

Gal Nitzan
In reply to this post by Doug Cutting-2
Because I needed to add two more fields from the meta tags in the html
page I have revised some of the code in HTMLMetaProcessor and in
DOMContentUtils.

I believe it to be a little more generic than the existing code (look at
DOMContentUtils.GetMetaAttributes) and from the sample here from Jérôme
since the existing code can handle only http-equiv or name...

Since I am not too familiar with svn. I paste it down this email, it
might be useful to someone.

On Tue, 2006-01-10 at 08:48 -0800, Doug Cutting wrote:

> Jérôme Charron wrote:
> > For consistency and to decouple a little Nutch HTML Parser and Xerces
> > implementation, I suggest to change these lines by something like:
> > Node nameNode = null;
> > Node equivNode = null;
> > Node contentNode = null;
> > for (int i=0; i<attrs.getLength(); i++) {
> >   Node attr = attrs.item(i);
> >   String attrName = attr.getNodeName().toLowerCase();
> >   if (attrName.equals("name")) {
> >     nameNode = attr;
> >   } else if (attrName.equals("http-equiv")) {
> >     equivNode = attr;
> >   } else if (attrName.equals("content")) {
> >     contentNode = attr;
> >   }
> > }
>
> +1
>


/**
 * Copyright 2005 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.nutch.parse.html;

import java.net.URL;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.HashMap;

import org.apache.nutch.parse.Outlink;

import org.w3c.dom.*;

/**
 * A collection of methods for extracting content from DOM trees.
 * <p/>
 * This class holds a few utility methods for pulling content out of
 * DOM nodes, such as getOutlinks, getText, etc.
 */
public class DOMContentUtils {

  public static class LinkParams {
    public String elName;
    public String attrName;
    public int childLen;

    public LinkParams(String elName, String attrName, int childLen) {
      this.elName = elName;
      this.attrName = attrName;
      this.childLen = childLen;
    }

    public String toString() {
      return "LP[el=" + elName + ",attr=" + attrName + ",len=" +
childLen + "]";
    }
  }

  public static HashMap linkParams = new HashMap();

  static {
    linkParams.put("a", new LinkParams("a", "href", 1));
    linkParams.put("area", new LinkParams("area", "href", 0));
    linkParams.put("form", new LinkParams("form", "action", 1));
    linkParams.put("frame", new LinkParams("frame", "src", 0));
    linkParams.put("iframe", new LinkParams("iframe", "src", 0));
    linkParams.put("script", new LinkParams("script", "src", 0));
    linkParams.put("link", new LinkParams("link", "href", 0));
    linkParams.put("img", new LinkParams("img", "src", 0));
  }

  /**
   * This method takes a {@link StringBuffer} and a DOM {@link Node},
   * and will append all the content text found beneath the DOM node to
   * the <code>StringBuffer</code>.
   * <p/>
   * <p/>
   * <p/>
   * If <code>abortOnNestedAnchors</code> is true, DOM traversal will
   * be aborted and the <code>StringBuffer</code> will not contain
   * any text encountered after a nested anchor is found.
   * <p/>
   * <p/>
   *
   * @return true if nested anchors were found
   */
  public static final boolean getText(StringBuffer sb, Node node,
                                      boolean abortOnNestedAnchors) {
    if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) {
      return true;
    }
    return false;
  }


  /**
   * This is a convinience method, equivalent to {@link
   * #getText(StringBuffer,Node,boolean) getText(sb, node, false)}.
   */
  public static final void getText(StringBuffer sb, Node node) {
    getText(sb, node, false);
  }

  // returns true if abortOnNestedAnchors is true and we find nested
  // anchors
  private static final boolean getTextHelper(StringBuffer sb, Node node,
                                             boolean
abortOnNestedAnchors,
                                             int anchorDepth) {
    if ("script".equalsIgnoreCase(node.getNodeName())) {
      return false;
    }
    if ("style".equalsIgnoreCase(node.getNodeName())) {
      return false;
    }
    if (abortOnNestedAnchors &&
"a".equalsIgnoreCase(node.getNodeName())) {
      anchorDepth++;
      if (anchorDepth > 1)
        return true;
    }
    if (node.getNodeType() == Node.COMMENT_NODE) {
      return false;
    }
    if (node.getNodeType() == Node.TEXT_NODE) {
      // cleanup and trim the value
      String text = node.getNodeValue();
      text = text.replaceAll("\\s+", " ");
      text = text.trim();
      if (text.length() > 0) {
        if (sb.length() > 0) sb.append(' ');
        sb.append(text);
      }
    }
    boolean abort = false;
    NodeList children = node.getChildNodes();
    if (children != null) {
      int len = children.getLength();
      for (int i = 0; i < len; i++) {
        if (getTextHelper(sb, children.item(i),
            abortOnNestedAnchors, anchorDepth)) {
          abort = true;
          break;
        }
      }
    }
    return abort;
  }

  /**
   * This method takes a {@link StringBuffer} and a DOM {@link Node},
   * and will append the content text found beneath the first
   * <code>title</code> node to the <code>StringBuffer</code>.
   *
   * @return true if a title node was found, false otherwise
   */
  public static final boolean getTitle(StringBuffer sb, Node node) {
    if ("body".equalsIgnoreCase(node.getNodeName())) // stop after HEAD
      return false;

    if (node.getNodeType() == Node.ELEMENT_NODE) {
      if ("title".equalsIgnoreCase(node.getNodeName())) {
        getText(sb, node);
        return true;
      }
    }
    NodeList children = node.getChildNodes();
    if (children != null) {
      int len = children.getLength();
      for (int i = 0; i < len; i++) {
        if (getTitle(sb, children.item(i))) {
          return true;
        }
      }
    }
    return false;
  }

  public static final String GetMetaAttributes(Node node, String
nodeName, String nodeValue) {
    String ret = null;
    if ("body".equalsIgnoreCase(node.getNodeName()))
      return ret;

    if (node.getNodeType() == Node.ELEMENT_NODE) {
      if ("meta".equalsIgnoreCase(node.getNodeName())) {
        if (!node.hasAttributes())
          return ret;

        NamedNodeMap attr = node.getAttributes();

        if (attr.getLength() != 2)
          return ret;

        Node n1 = attr.item(0);
        Node n2 = attr.item(1);

        if (nodeName.equalsIgnoreCase(n1.getNodeName()))
        {
          if (!nodeValue.equalsIgnoreCase(n1.getNodeValue()))
            return ret;

          if (!"content".equalsIgnoreCase(n2.getNodeName()))
            return ret;

          ret = n2.getNodeValue().toLowerCase();

          return ret;
        }

        if (nodeName.equalsIgnoreCase(n2.getNodeName()))
        {
          if (!nodeValue.equalsIgnoreCase(n2.getNodeValue()))
            return ret;

          if (!"content".equalsIgnoreCase(n1.getNodeName()))
            return ret;

          ret = n1.getNodeValue().toLowerCase();

          return ret;
        }

        return ret;
      }
    }

    NodeList children = node.getChildNodes();
    if (children != null) {
      int len = children.getLength();
      for (int i = 0; i < len; i++) {
        if ((ret = GetMetaAttributes(children.item(i), nodeName,
nodeValue)) != null) {
          return ret;
        }
      }
    }

    return ret;
  }

  /**
   * If Node contains a BASE tag then it's HREF is returned.
   */
  public static final URL getBase(Node node) {

    // is this node a BASE tag?
    if (node.getNodeType() == Node.ELEMENT_NODE) {

      if ("body".equalsIgnoreCase(node.getNodeName())) // stop after
HEAD
        return null;


      if ("base".equalsIgnoreCase(node.getNodeName())) {
        NamedNodeMap attrs = node.getAttributes();
        for (int i = 0; i < attrs.getLength(); i++) {
          Node attr = attrs.item(i);
          if ("href".equalsIgnoreCase(attr.getNodeName())) {
            try {
              return new URL(attr.getNodeValue());
            } catch (MalformedURLException e) {
            }
          }
        }
      }
    }

    // does it contain a base tag?
    NodeList children = node.getChildNodes();
    if (children != null) {
      int len = children.getLength();
      for (int i = 0; i < len; i++) {
        URL base = getBase(children.item(i));
        if (base != null)
          return base;
      }
    }

    // no.
    return null;
  }


  private static boolean hasOnlyWhiteSpace(Node node) {
    String val = node.getNodeValue();
    for (int i = 0; i < val.length(); i++) {
      if (!Character.isWhitespace(val.charAt(i)))
        return false;
    }
    return true;
  }

  // this only covers a few cases of empty links that are symptomatic
  // of nekohtml's DOM-fixup process...
  private static boolean shouldThrowAwayLink(Node node, NodeList
children,
                                             int childLen, LinkParams
params) {
    if (childLen == 0) {
      // this has no inner structure
      if (params.childLen == 0) return false;
      else return true;
    } else if ((childLen == 1)
        && (children.item(0).getNodeType() == Node.ELEMENT_NODE)
        &&
(params.elName.equalsIgnoreCase(children.item(0).getNodeName()))) {
      // single nested link
      return true;

    } else if (childLen == 2) {

      Node c0 = children.item(0);
      Node c1 = children.item(1);

      if ((c0.getNodeType() == Node.ELEMENT_NODE)
          && (params.elName.equalsIgnoreCase(c0.getNodeName()))
          && (c1.getNodeType() == Node.TEXT_NODE)
          && hasOnlyWhiteSpace(c1)) {
        // single link followed by whitespace node
        return true;
      }

      if ((c1.getNodeType() == Node.ELEMENT_NODE)
          && (params.elName.equalsIgnoreCase(c1.getNodeName()))
          && (c0.getNodeType() == Node.TEXT_NODE)
          && hasOnlyWhiteSpace(c0)) {
        // whitespace node followed by single link
        return true;
      }

    } else if (childLen == 3) {
      Node c0 = children.item(0);
      Node c1 = children.item(1);
      Node c2 = children.item(2);

      if ((c1.getNodeType() == Node.ELEMENT_NODE)
          && (params.elName.equalsIgnoreCase(c1.getNodeName()))
          && (c0.getNodeType() == Node.TEXT_NODE)
          && (c2.getNodeType() == Node.TEXT_NODE)
          && hasOnlyWhiteSpace(c0)
          && hasOnlyWhiteSpace(c2)) {
        // single link surrounded by whitespace nodes
        return true;
      }
    }

    return false;
  }

  /**
   * This method finds all anchors below the supplied DOM
   * <code>node</code>, and creates appropriate {@link Outlink}
   * records for each (relative to the supplied <code>base</code>
   * URL), and adds them to the <code>outlinks</code> {@link
   * ArrayList}.
   * <p/>
   * <p/>
   * <p/>
   * Links without inner structure (tags, text, etc) are discarded, as
   * are links which contain only single nested links and empty text
   * nodes (this is a common DOM-fixup artifact, at least with
   * nekohtml).
   */
  public static final void getOutlinks(URL base, ArrayList outlinks,
                                       Node node) {

    NodeList children = node.getChildNodes();
    int childLen = 0;
    if (children != null)
      childLen = children.getLength();

    if (node.getNodeType() == Node.ELEMENT_NODE) {
      LinkParams params = (LinkParams)
linkParams.get(node.getNodeName().toLowerCase());
      if (params != null) {
        if (!shouldThrowAwayLink(node, children, childLen, params)) {

          StringBuffer linkText = new StringBuffer();
          getText(linkText, node, true);

          NamedNodeMap attrs = node.getAttributes();
          String target = null;
          boolean noFollow = false;
          boolean post = false;
          for (int i = 0; i < attrs.getLength(); i++) {
            Node attr = attrs.item(i);
            String attrName = attr.getNodeName();
            if (params.attrName.equalsIgnoreCase(attrName)) {
              target = attr.getNodeValue();
            } else if ("rel".equalsIgnoreCase(attrName) &&
                "nofollow".equalsIgnoreCase(attr.getNodeValue())) {
              noFollow = true;
            } else if ("method".equalsIgnoreCase(attrName) &&
                "post".equalsIgnoreCase(attr.getNodeValue())) {
              post = true;
            }
          }
          if (target != null && !noFollow && !post)
            try {
              URL url = new URL(base, target);
              outlinks.add(new Outlink(url.toString(),
                  linkText.toString().trim()));
            } catch (MalformedURLException e) {
              // don't care
            }
        }
        // this should not have any children, skip them
        if (params.childLen == 0) return;
      }
    }
    for (int i = 0; i < childLen; i++) {
      getOutlinks(base, outlinks, children.item(i));
    }
  }

}

-------------------------------------
/**
 * Copyright 2005 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.nutch.parse.html;

import java.net.URL;
import java.util.Properties;

import org.apache.nutch.parse.HTMLMetaTags;
import org.w3c.dom.*;

/**
 * Class for parsing META Directives from DOM trees.  This class
 * handles specifically Robots META directives (all, none, nofollow,
 * noindex), finding BASE HREF tags, and HTTP-EQUIV no-cache
 * instructions. All meta directives are stored in a HTMLMetaTags
instance.
 */
public class HTMLMetaProcessor {

  /**
   * Utility class with indicators for the robots directives "noindex"
   * and "nofollow", and HTTP-EQUIV/no-cache
   */

  /**
   * Sets the indicators in <code>robotsMeta</code> to appropriate
   * values, based on any META tags found under the given
   * <code>node</code>.
   */
  public static final void getMetaTags(
      HTMLMetaTags metaTags, Node node, URL currURL) {

    metaTags.reset();
    getMetaTagsHelper(metaTags, node, currURL);
  }

  /**
   * Collect available meta tags from HTML page
   * @param metaTags
   * @param node
   * @param currURL
   */
  private static final void getMetaTagsHelper(HTMLMetaTags metaTags,
Node node, URL currURL) {
    String content;
    int index;

    content = DOMContentUtils.GetMetaAttributes(node, "name",
"description");
    metaTags.setDescription(content);

    content = DOMContentUtils.GetMetaAttributes(node, "name",
"keywords");
    metaTags.setKeywords(content);

    content = DOMContentUtils.GetMetaAttributes(node, "http-equiv",
"pragma");
    if (content != null) {
      index = content.indexOf("no-cache");
      if (index >= 0)
        metaTags.setNoCache();
    }

    content = DOMContentUtils.GetMetaAttributes(node, "http-equiv",
"refresh");
    if (content != null) {
      index = content.indexOf(';');

      String time = null;
      if (index == -1) { // just the refresh time
        time = content;
      } else
        time = content.substring(0, index);
      try {
        metaTags.setRefreshTime(Integer.parseInt(time));
        // skip this if we couldn't parse the time
        metaTags.setRefresh(true);
      } catch (Exception e) {
        ;
      }

      URL refreshUrl = null;
      if (metaTags.getRefresh() && index != -1) { // set the URL
        index = content.indexOf("url=");
        if (index == -1) { // assume a mis-formatted entry with just the
url
          index = content.indexOf(';') + 1;
        } else index += 4;
        if (index != -1) {
          String url = content.substring(index);
          try {
            refreshUrl = new URL(url);
          } catch (Exception e) {
            // XXX according to the spec, this has to be an absolute
            // XXX url. However, many websites use relative URLs and
            // XXX expect browsers to handle that.
            // XXX Unfortunately, in some cases this may create a
            // XXX infinitely recursive paths (a crawler trap)...
            // if (!url.startsWith("/")) url = "/" + url;
            try {
              refreshUrl = new URL(currURL, url);
            } catch (Exception e1) {
              refreshUrl = null;
            }
          }
        }
      }
      if (metaTags.getRefresh()) {
        if (refreshUrl == null) {
          // apparently only refresh time was present. set the URL
          // to the same URL.
          refreshUrl = currURL;
        }
        metaTags.setRefreshHref(refreshUrl);
      }
    } // refresh

    content = DOMContentUtils.GetMetaAttributes(node, "name", "robots");
    if (content != null) {
      index = content.indexOf("none");
      if (index >= 0) {
        metaTags.setNoIndex();
        metaTags.setNoFollow();
      }
      index = content.indexOf("all");
      if (index >= 0) {
        // do nothing...
      }
      index = content.indexOf("noindex");
      if (index >= 0) {
        metaTags.setNoIndex();
      }

      index = content.indexOf("nofollow");
      if (index >= 0) {
        metaTags.setNoFollow();
      }
    }

    URL url = DOMContentUtils.getBase(node);

    if (url != null) {
      metaTags.setBaseHref(url);
    }
  }
}