Need help in updating url in runtime in []

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Need help in updating url in runtime in []


I'm trying to fix a nutch "bug" in fetcher.
I don't know if you've noticed, but if you try to fetch sites that doesn't
have the "www" prefix in thier url, such as,
and these sites didn't register the domain, but only, nutch fetch will fail. (i know it's not a bug, but i
would like it to act like that).

so i've written a hardcoded snippet in

 public void run() {
      synchronized (Fetcher.this) {activeThreads++;} // count threads
      ... some code here....
 *            redirecting = false;
              Protocol protocol = this.protocolFactory.getProtocol(
              ProtocolOutput output = protocol.getProtocolOutput(url,
              ProtocolStatus status = output.getStatus();
              Content content = output.getContent();
              ParseStatus pstatus = null;
              // here comes my code  (the fetcher has thorwn an
              if ( status.getCode() == ProtocolStatus.EXCEPTION )
                  String urlPrefix = "";
                  String newurl = url.toString();
         ("this is newurl: " + newurl);
                  Pattern urlRegex =
                  Matcher urlMatcher = urlRegex.matcher(url.toString());
                  if (urlMatcher.find())
                     urlPrefix =;
                  newurl = newurl.replaceAll("http://" + urlPrefix, "
http://www." + urlPrefix);

                  // now i've come to the point that i have the new fixed
url with the WWW prefix,
                  // but i don't know how to update 'url' which is of type
Text(), without damaging the rest of the data it holds (like contentType)
                  // here is the command i don't know
                  url.set (newUrl); ???


              switch(status.getCode()) {

              case ProtocolStatus.SUCCESS:        // got a page

              pstatus = output(url, datum, content, status, CrawlDatum.S*

             ..... some more code....

thank you,.


Eyal Edri