Redirects to subdomains

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Redirects to subdomains

sangeet
I came across an issue where the main page of a site redirects to a
subdomain which doesn't get followed during the crawl. The URL http://w
ww.mercenarytrader.com redirects to https://members.mercenarytrader.com
 which doesn't get followed.
In the nutch-site.xml I have db.ignore.external.links set to 'true'
and db.ignore.external.links.mode set to 'byDomain' since I only want
to crawl within the domain inculding subdomains.

I came across this redirect code FetcherThread which causes the issue.
Instead of comparing the Domains the hosts get compared and don't match
up i.e members.mercenarytrader.com doesn't match up with
mercenarytrader.com. Is there an existing issue that has been logged
for this?

  String origHost = new URL(urlString).getHost().toLowerCase();
      String newHost = new URL(newUrl).getHost().toLowerCase();
      if (ignoreExternalLinks) {
        if (!origHost.equals(newHost)) {
          if (LOG.isDebugEnabled()) {
            LOG.debug(" - ignoring redirect " + redirType + " from "
                + urlString + " to " + newUrl
                + " because external links are ignored");
          }
          return null;
        }
      }
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Redirects to subdomains

Sebastian Nagel
Hi,

looks like this has been overseen when https://issues.apache.org/jira/browse/NUTCH-2069 was
implemented. Please, open an issue on
  https://issues.apache.org/jira/browse/NUTCH
to report your issue.

As a temporary work-around, try to set
  http.redirect.max = 0
Redirects are then treated same as links.

Thanks,
Sebastian

On 03/08/2017 04:20 PM, [hidden email] wrote:

> I came across an issue where the main page of a site redirects to a
> subdomain which doesn't get followed during the crawl. The URL http://w
> ww.mercenarytrader.com redirects to https://members.mercenarytrader.com
>  which doesn't get followed.
> In the nutch-site.xml I have db.ignore.external.links set to 'true'
> and db.ignore.external.links.mode set to 'byDomain' since I only want
> to crawl within the domain inculding subdomains.
>
> I came across this redirect code FetcherThread which causes the issue.
> Instead of comparing the Domains the hosts get compared and don't match
> up i.e members.mercenarytrader.com doesn't match up with
> mercenarytrader.com. Is there an existing issue that has been logged
> for this?
>
>   String origHost = new URL(urlString).getHost().toLowerCase();
>       String newHost = new URL(newUrl).getHost().toLowerCase();
>       if (ignoreExternalLinks) {
>         if (!origHost.equals(newHost)) {
>           if (LOG.isDebugEnabled()) {
>             LOG.debug(" - ignoring redirect " + redirType + " from "
>                 + urlString + " to " + newUrl
>                 + " because external links are ignored");
>           }
>           return null;
>         }
>       }
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Redirects to subdomains

sangeet
Thanks, unfortunately the workaround won't work for us since we don't
want to follow redirects greater than 5. I opened an issue NUTCH-2365.

On Thu, 2017-03-09 at 10:03 +0100, Sebastian Nagel wrote:

> Hi,
>
> looks like this has been overseen when https://issues.apache.org/jira
> /browse/NUTCH-2069 was
> implemented. Please, open an issue on
>   https://issues.apache.org/jira/browse/NUTCH
> to report your issue.
>
> As a temporary work-around, try to set
>   http.redirect.max = 0
> Redirects are then treated same as links.
>
> Thanks,
> Sebastian
>
> On 03/08/2017 04:20 PM, [hidden email] wrote:
> > I came across an issue where the main page of a site redirects to a
> > subdomain which doesn't get followed during the crawl. The URL
> > http://w
> > ww.mercenarytrader.com redirects to https://members.mercenarytrader
> > .com
> >  which doesn't get followed.
> > In the nutch-site.xml I have db.ignore.external.links set to 'true'
> > and db.ignore.external.links.mode set to 'byDomain' since I only
> > want
> > to crawl within the domain inculding subdomains.
> >
> > I came across this redirect code FetcherThread which causes the
> > issue.
> > Instead of comparing the Domains the hosts get compared and don't
> > match
> > up i.e members.mercenarytrader.com doesn't match up with
> > mercenarytrader.com. Is there an existing issue that has been
> > logged
> > for this?
> >
> >   String origHost = new URL(urlString).getHost().toLowerCase();
> >       String newHost = new URL(newUrl).getHost().toLowerCase();
> >       if (ignoreExternalLinks) {
> >         if (!origHost.equals(newHost)) {
> >           if (LOG.isDebugEnabled()) {
> >             LOG.debug(" - ignoring redirect " + redirType + " from
> > "
> >                 + urlString + " to " + newUrl
> >                 + " because external links are ignored");
> >           }
> >           return null;
> >         }
> >       }
> >
>
>
Loading...