Redirection behavior

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Redirection behavior

prateek sachdeva
Hi,

I am currently using Nutch 1.16 with the properties below -



*db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*

When I am crawling websites that are redirecting (301 http code) using
Nutch (for example -  https://zyfro.com/ and http://wikipedia.com/). I see
that the new redirected URL is not captured by nutch. Even the outlinks
point to the original url provided and status returned is 200.
So my question is
1. How do I capture the new URL?
2. Is there a way to allow nutch to capture 301 status and then the new url
and then crawl the related content?

Here is CrawlDatum and Parsedata structure for http://wikipedia.com/ which
gets redirected to wikipedia.org.
























*CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed May 05
17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC 1970Retries since
fetch: 0Retry interval: 31536000 seconds (365 days)Score: 2.0Signature:
nullMetadata:   _ngt_=1620235730883 _depth_=1 _http_status_code_=200
_pst_=success(1), lastModified=1620038693000 _rs_=410
Content-Type=text/html _maxdepth_=1000 nutch.protocol.code=200ParseData :
Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1  outlink: toUrl:
http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
<http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png>
anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
Server-Timing=cache;desc="hit-front", host;desc="cp1081"
Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May 2021
10:44:53 GMT Strict-Transport-Security=max-age=106384710;
includeSubDomains; preload X-Cache-Status=hit-front Report-To={ "group":
"wm_nel", "max_age": 86400, "endpoints": [{ "url":
"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
<https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>"
}] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081 hit/578233
Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114 Date=Wed,
05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes
nutch.segment.name <http://nutch.segment.name>=20210505173059
Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
"report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05,
"success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068" Vary=Accept-Encoding
X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671|dst:www.wikipedia.org
<http://www.wikipedia.org>|principal:hadoop-test _fst_=33 Parse Metadata:
CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 _depth_=1
viewport=initial-scale=1,user-scalable=yes metatag.description=Wikipedia is
a free online encyclopedia, created and edited by volunteers around the
world and hosted by the Wikimedia Foundation. metatag.description=Wikipedia
is a free online encyclopedia, created and edited by volunteers around the
world and hosted by the Wikimedia Foundation. description=Wikipedia is a
free online encyclopedia, created and edited by volunteers around the world
and hosted by the Wikimedia Foundation. _maxdepth_=1000 *


Thanks
Prateek
Reply | Threaded
Open this post in threaded view
|

Re: Redirection behavior

Sebastian Nagel-2
Hi Prateek,

(sorry, I pressed the wrong reply button, so redirecting the discussion back to user@nutch)


 > I am not sure what I am missing.

Well, URL filters?  Robots.txt?  Don't know...


 > I am currently using Nutch 1.16

Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1]) which caused Fetcher
not to follow redirects. But it was fixed already in Nutch 1.15.

I've retried using Nutch 1.16:
- using -Dplugin.includes='protocol-okhttp|parse-html'
    FetcherThread 43 fetching http://wikipedia.com/ (queue crawl delay=3000ms)
    FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
    FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl delay=3000ms)

Note: there might be an issue using protocol-http (-Dplugin.includes='protocol-http|parse-html')
together with Nutch 1.16:
    FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
    FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
    Couldn't get robots.txt for https://wikipedia.com/: java.net.SocketException: Socket is closed
    FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl delay=3000ms)
    FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl delay=3000ms)
    Couldn't get robots.txt for https://www.wikipedia.org/: java.net.SocketException: Socket is closed
    Failed to get protocol output java.net.SocketException: Socket is closed
         at sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
         at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162)
         at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
    FetcherThread 43 fetch of https://www.wikipedia.org/ failed with: java.net.SocketException: Socket is closed

But it's not reproducible using Nutch master / 1.18 - as it relates to HTTPS/SSL it's likely fixed by NUTCH-2794 [2].

In any case, could you try to reproduce the problem using Nutch 1.18 ?

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-2550
[2] https://issues.apache.org/jira/browse/NUTCH-2794


On 5/6/21 11:54 AM, prateek wrote:

> Thanks for your reply Sebastian.
>
> I am using http.redirect.max=5 for my setup.
> In the seed URL, I am only passing http://wikipedia.com/ <http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> . CrawlDatum
> and ParseData shared in my earlier email are from http://wikipedia.com/ <http://wikipedia.com/> url.
> I don't see the other redirected URL's in the logs or segments. Here is my log -
>
> /2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 1 Using queue mode : byHost
> 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
> 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
> *2021-05-05 17:35:23,855 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching http://wikipedia.com/ 
> <http://wikipedia.com/> (queue crawl delay=1000ms)*
>
> *2021-05-05 17:35:29,095 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching https://zyfro.com/ 
> <https://zyfro.com/> (queue crawl delay=1000ms)*
> 2021-05-05 17:35:29,095 INFO [main] com.linkedin.nutchplugin.http.Http: fetching https://zyfro.com/robots.txt <https://zyfro.com/robots.txt>
> 2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher: -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=1
> 2021-05-05 17:35:30,189 INFO [main] com.linkedin.nutchplugin.http.Http: fetching https://zyfro.com/ <https://zyfro.com/>
> 2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work available/
>
> I am not sure what I am missing.
>
> Regards
> Prateek
>
>
> On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi Prateek,
>
>     could you share information about all pages/URLs in the redirect chain?
>
>     http://wikipedia.com/ <http://wikipedia.com/>
>     https://wikipedia.com/ <https://wikipedia.com/>
>     https://www.wikipedia.org/ <https://www.wikipedia.org/>
>
>     If I'm not wrong, the shown  CrawlDatum and ParseData stems from
>     https://www.wikipedia.org/ <https://www.wikipedia.org/> and is _http_status_code_=200.
>     So, looks like the redirects have been followed.
>
>     Note: all 3 URLs should have records in the segment and the CrawlDb.
>
>     I've also verified that the above redirect chain is followed by Fetcher
>     with the following settings (passed on the command-line via -D) using
>     Nutch master (1.18):
>        -Dhttp.redirect.max=3
>        -Ddb.ignore.external.links=true
>        -Ddb.ignore.external.links.mode=byDomain
>        -Ddb.ignore.also.redirects=false
>
>     Fetcher log snippets:
>        FetcherThread 51 fetching http://wikipedia.com/ <http://wikipedia.com/> (queue crawl delay=3000ms)
>        FetcherThread 51 fetching https://wikipedia.com/ <https://wikipedia.com/> (queue crawl delay=3000ms)
>        FetcherThread 51 fetching https://www.wikipedia.org/ <https://www.wikipedia.org/> (queue crawl delay=3000ms)
>
>     Just in case: what's the value of the property http.redirect.max ?
>
>     Best,
>     Sebastian
>
>
>     On 5/5/21 8:09 PM, prateek wrote:
>      > Hi,
>      >
>      > I am currently using Nutch 1.16 with the properties below -
>      >
>      >
>      >
>      > *db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*
>      >
>      > When I am crawling websites that are redirecting (301 http code) using
>      > Nutch (for example - https://zyfro.com/ <https://zyfro.com/> and http://wikipedia.com/ <http://wikipedia.com/>). I see
>      > that the new redirected URL is not captured by nutch. Even the outlinks
>      > point to the original url provided and status returned is 200.
>      > So my question is
>      > 1. How do I capture the new URL?
>      > 2. Is there a way to allow nutch to capture 301 status and then the new url
>      > and then crawl the related content?
>      >
>      > Here is CrawlDatum and Parsedata structure for http://wikipedia.com/ <http://wikipedia.com/> which
>      > gets redirected to wikipedia.org <http://wikipedia.org>.
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      >
>      > *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed May 05
>      > 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC 1970Retries since
>      > fetch: 0Retry interval: 31536000 seconds (365 days)Score: 2.0Signature:
>      > nullMetadata:   _ngt_=1620235730883 _depth_=1 _http_status_code_=200
>      > _pst_=success(1), lastModified=1620038693000 _rs_=410
>      > Content-Type=text/html _maxdepth_=1000 nutch.protocol.code=200ParseData :
>      > Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1  outlink: toUrl:
>      > http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>     <http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png>
>      > <http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>     <http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png>>
>      > anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
>      > nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
>      > Server-Timing=cache;desc="hit-front", host;desc="cp1081"
>      > Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May 2021
>      > 10:44:53 GMT Strict-Transport-Security=max-age=106384710;
>      > includeSubDomains; preload X-Cache-Status=hit-front Report-To={ "group":
>      > "wm_nel", "max_age": 86400, "endpoints": [{ "url":
>      >
>     "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>     <https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>
>      >
>     <https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>     <https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>>"
>      > }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081 hit/578233
>      > Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114 Date=Wed,
>      > 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes
>      > nutch.segment.name <http://nutch.segment.name> <http://nutch.segment.name <http://nutch.segment.name>>=20210505173059
>      > Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
>      > "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05,
>      > "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068" Vary=Accept-Encoding
>      >
>     X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671|dst:www.wikipedia.org
>     <http://www.wikipedia.org>
>      > <http://www.wikipedia.org <http://www.wikipedia.org>>|principal:hadoop-test _fst_=33 Parse Metadata:
>      > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 _depth_=1
>      > viewport=initial-scale=1,user-scalable=yes metatag.description=Wikipedia is
>      > a free online encyclopedia, created and edited by volunteers around the
>      > world and hosted by the Wikimedia Foundation. metatag.description=Wikipedia
>      > is a free online encyclopedia, created and edited by volunteers around the
>      > world and hosted by the Wikimedia Foundation. description=Wikipedia is a
>      > free online encyclopedia, created and edited by volunteers around the world
>      > and hosted by the Wikimedia Foundation. _maxdepth_=1000 *
>      >
>      >
>      > Thanks
>      > Prateek
>      >
>

Reply | Threaded
Open this post in threaded view
|

Re: Redirection behavior

prateek sachdeva
Thanks.. I am using a custom http plugin. So I will debug with 1.16 to see
what's causing it. Thanks for your help

Regards
Prateek

On Thu, May 6, 2021 at 11:26 AM Sebastian Nagel <[hidden email]>
wrote:

> Hi Prateek,
>
> (sorry, I pressed the wrong reply button, so redirecting the discussion
> back to user@nutch)
>
>
>  > I am not sure what I am missing.
>
> Well, URL filters?  Robots.txt?  Don't know...
>
>
>  > I am currently using Nutch 1.16
>
> Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1])
> which caused Fetcher
> not to follow redirects. But it was fixed already in Nutch 1.15.
>
> I've retried using Nutch 1.16:
> - using -Dplugin.includes='protocol-okhttp|parse-html'
>     FetcherThread 43 fetching http://wikipedia.com/ (queue crawl
> delay=3000ms)
>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
>
> Note: there might be an issue using protocol-http
> (-Dplugin.includes='protocol-http|parse-html')
> together with Nutch 1.16:
>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
>     Couldn't get robots.txt for https://wikipedia.com/:
> java.net.SocketException: Socket is closed
>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
>     Couldn't get robots.txt for https://www.wikipedia.org/:
> java.net.SocketException: Socket is closed
>     Failed to get protocol output java.net.SocketException: Socket is
> closed
>          at
> sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
>          at
> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162)
>          at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
>          at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
>          at
> org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
>     FetcherThread 43 fetch of https://www.wikipedia.org/ failed with:
> java.net.SocketException: Socket is closed
>
> But it's not reproducible using Nutch master / 1.18 - as it relates to
> HTTPS/SSL it's likely fixed by NUTCH-2794 [2].
>
> In any case, could you try to reproduce the problem using Nutch 1.18 ?
>
> Best,
> Sebastian
>
> [1] https://issues.apache.org/jira/browse/NUTCH-2550
> [2] https://issues.apache.org/jira/browse/NUTCH-2794
>
>
> On 5/6/21 11:54 AM, prateek wrote:
> > Thanks for your reply Sebastian.
> >
> > I am using http.redirect.max=5 for my setup.
> > In the seed URL, I am only passing http://wikipedia.com/ <
> http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> .
> CrawlDatum
> > and ParseData shared in my earlier email are from http://wikipedia.com/
> <http://wikipedia.com/> url.
> > I don't see the other redirected URL's in the logs or segments. Here is
> my log -
> >
> > /2021-05-05 17:35:23,854 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 1 Using queue mode :
> byHost
> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
> Fetcher: throughput threshold: -1
> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
> Fetcher: throughput threshold retries: 5
> > *2021-05-05 17:35:23,855 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
> http://wikipedia.com/
> > <http://wikipedia.com/> (queue crawl delay=1000ms)*
> >
> > *2021-05-05 17:35:29,095 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
> https://zyfro.com/
> > <https://zyfro.com/> (queue crawl delay=1000ms)*
> > 2021-05-05 17:35:29,095 INFO [main] com.**.nutchplugin.http.Http:
> fetching https://zyfro.com/robots.txt <https://zyfro.com/robots.txt>
> > 2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher:
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> > fetchQueues.getQueueCount=1
> > 2021-05-05 17:35:30,189 INFO [main] com.**.nutchplugin.http.Http:
> fetching https://zyfro.com/ <https://zyfro.com/>
> > 2021-05-05 17:35:30,786 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work
> available/
> >
> > I am not sure what I am missing.
> >
> > Regards
> > Prateek
> >
> >
> > On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel <
> [hidden email] <mailto:[hidden email]>> wrote:
> >
> >     Hi Prateek,
> >
> >     could you share information about all pages/URLs in the redirect
> chain?
> >
> >     http://wikipedia.com/ <http://wikipedia.com/>
> >     https://wikipedia.com/ <https://wikipedia.com/>
> >     https://www.wikipedia.org/ <https://www.wikipedia.org/>
> >
> >     If I'm not wrong, the shown  CrawlDatum and ParseData stems from
> >     https://www.wikipedia.org/ <https://www.wikipedia.org/> and is
> _http_status_code_=200.
> >     So, looks like the redirects have been followed.
> >
> >     Note: all 3 URLs should have records in the segment and the CrawlDb.
> >
> >     I've also verified that the above redirect chain is followed by
> Fetcher
> >     with the following settings (passed on the command-line via -D) using
> >     Nutch master (1.18):
> >        -Dhttp.redirect.max=3
> >        -Ddb.ignore.external.links=true
> >        -Ddb.ignore.external.links.mode=byDomain
> >        -Ddb.ignore.also.redirects=false
> >
> >     Fetcher log snippets:
> >        FetcherThread 51 fetching http://wikipedia.com/ <
> http://wikipedia.com/> (queue crawl delay=3000ms)
> >        FetcherThread 51 fetching https://wikipedia.com/ <
> https://wikipedia.com/> (queue crawl delay=3000ms)
> >        FetcherThread 51 fetching https://www.wikipedia.org/ <
> https://www.wikipedia.org/> (queue crawl delay=3000ms)
> >
> >     Just in case: what's the value of the property http.redirect.max ?
> >
> >     Best,
> >     Sebastian
> >
> >
> >     On 5/5/21 8:09 PM, prateek wrote:
> >      > Hi,
> >      >
> >      > I am currently using Nutch 1.16 with the properties below -
> >      >
> >      >
> >      >
> >      >
> *db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*
> >      >
> >      > When I am crawling websites that are redirecting (301 http code)
> using
> >      > Nutch (for example - https://zyfro.com/ <https://zyfro.com/> and
> http://wikipedia.com/ <http://wikipedia.com/>). I see
> >      > that the new redirected URL is not captured by nutch. Even the
> outlinks
> >      > point to the original url provided and status returned is 200.
> >      > So my question is
> >      > 1. How do I capture the new URL?
> >      > 2. Is there a way to allow nutch to capture 301 status and then
> the new url
> >      > and then crawl the related content?
> >      >
> >      > Here is CrawlDatum and Parsedata structure for
> http://wikipedia.com/ <http://wikipedia.com/> which
> >      > gets redirected to wikipedia.org <http://wikipedia.org>.
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      > *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed
> May 05
> >      > 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC
> 1970Retries since
> >      > fetch: 0Retry interval: 31536000 seconds (365 days)Score:
> 2.0Signature:
> >      > nullMetadata:   _ngt_=1620235730883 _depth_=1
> _http_status_code_=200
> >      > _pst_=success(1), lastModified=1620038693000 _rs_=410
> >      > Content-Type=text/html _maxdepth_=1000
> nutch.protocol.code=200ParseData :
> >      > Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1
> outlink: toUrl:
> >      >
> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
> >     <
> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
> >
> >      > <
> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
> >     <
> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
> >>
> >      > anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
> >      > nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
> >      > Server-Timing=cache;desc="hit-front", host;desc="cp1081"
> >      > Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May
> 2021
> >      > 10:44:53 GMT Strict-Transport-Security=max-age=106384710;
> >      > includeSubDomains; preload X-Cache-Status=hit-front Report-To={
> "group":
> >      > "wm_nel", "max_age": 86400, "endpoints": [{ "url":
> >      >
> >     "
> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
> >     <
> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
> >
> >      >
> >     <
> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
> >     <
> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
> >>"
> >      > }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081
> hit/578233
> >      > Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114
> Date=Wed,
> >      > 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes
> >      > nutch.segment.name <http://nutch.segment.name> <
> http://nutch.segment.name <http://nutch.segment.name>>=20210505173059
> >      > Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
> >      > "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05,
> >      > "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068"
> Vary=Accept-Encoding
> >      >
> >
>  X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671
> |dst:www.wikipedia.org
> >     <http://www.wikipedia.org>
> >      > <http://www.wikipedia.org <http://www.wikipedia.org>>|principal:hadoop-test
> _fst_=33 Parse Metadata:
> >      > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> _depth_=1
> >      > viewport=initial-scale=1,user-scalable=yes
> metatag.description=Wikipedia is
> >      > a free online encyclopedia, created and edited by volunteers
> around the
> >      > world and hosted by the Wikimedia Foundation.
> metatag.description=Wikipedia
> >      > is a free online encyclopedia, created and edited by volunteers
> around the
> >      > world and hosted by the Wikimedia Foundation.
> description=Wikipedia is a
> >      > free online encyclopedia, created and edited by volunteers around
> the world
> >      > and hosted by the Wikimedia Foundation. _maxdepth_=1000 *
> >      >
> >      >
> >      > Thanks
> >      > Prateek
> >      >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Redirection behavior

prateek sachdeva
Hi,

Just to close this thread - I figured out that the issue was because of
apache http client's (HttpClientBuilder) default behavior of handling
redirections.
Disabling the behavior like this solved the problem.


* HttpClientBuilder builder = HttpClientBuilder.create();*
* builder.disableRedirectHandling();*

Now I am able to see all the redirected URL's being crawled as expected.
Thanks for the help.

Regards
Prateek
On Thu, May 6, 2021 at 11:42 AM prateek <[hidden email]> wrote:

> Thanks.. I am using a custom http plugin. So I will debug with 1.16 to see
> what's causing it. Thanks for your help
>
> Regards
> Prateek
>
> On Thu, May 6, 2021 at 11:26 AM Sebastian Nagel <
> [hidden email]> wrote:
>
>> Hi Prateek,
>>
>> (sorry, I pressed the wrong reply button, so redirecting the discussion
>> back to user@nutch)
>>
>>
>>  > I am not sure what I am missing.
>>
>> Well, URL filters?  Robots.txt?  Don't know...
>>
>>
>>  > I am currently using Nutch 1.16
>>
>> Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1])
>> which caused Fetcher
>> not to follow redirects. But it was fixed already in Nutch 1.15.
>>
>> I've retried using Nutch 1.16:
>> - using -Dplugin.includes='protocol-okhttp|parse-html'
>>     FetcherThread 43 fetching http://wikipedia.com/ (queue crawl
>> delay=3000ms)
>>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
>> delay=3000ms)
>>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
>> delay=3000ms)
>>
>> Note: there might be an issue using protocol-http
>> (-Dplugin.includes='protocol-http|parse-html')
>> together with Nutch 1.16:
>>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
>> delay=3000ms)
>>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
>> delay=3000ms)
>>     Couldn't get robots.txt for https://wikipedia.com/:
>> java.net.SocketException: Socket is closed
>>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
>> delay=3000ms)
>>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
>> delay=3000ms)
>>     Couldn't get robots.txt for https://www.wikipedia.org/:
>> java.net.SocketException: Socket is closed
>>     Failed to get protocol output java.net.SocketException: Socket is
>> closed
>>          at
>> sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
>>          at
>> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162)
>>          at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
>>          at
>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
>>          at
>> org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
>>     FetcherThread 43 fetch of https://www.wikipedia.org/ failed with:
>> java.net.SocketException: Socket is closed
>>
>> But it's not reproducible using Nutch master / 1.18 - as it relates to
>> HTTPS/SSL it's likely fixed by NUTCH-2794 [2].
>>
>> In any case, could you try to reproduce the problem using Nutch 1.18 ?
>>
>> Best,
>> Sebastian
>>
>> [1] https://issues.apache.org/jira/browse/NUTCH-2550
>> [2] https://issues.apache.org/jira/browse/NUTCH-2794
>>
>>
>> On 5/6/21 11:54 AM, prateek wrote:
>> > Thanks for your reply Sebastian.
>> >
>> > I am using http.redirect.max=5 for my setup.
>> > In the seed URL, I am only passing http://wikipedia.com/ <
>> http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> .
>> CrawlDatum
>> > and ParseData shared in my earlier email are from http://wikipedia.com/
>> <http://wikipedia.com/> url.
>> > I don't see the other redirected URL's in the logs or segments. Here is
>> my log -
>> >
>> > /2021-05-05 17:35:23,854 INFO [main]
>> org.apache.nutch.fetcher.FetcherThread: FetcherThread 1 Using queue mode :
>> byHost
>> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
>> Fetcher: throughput threshold: -1
>> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
>> Fetcher: throughput threshold retries: 5
>> > *2021-05-05 17:35:23,855 INFO [main]
>> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
>> http://wikipedia.com/
>> > <http://wikipedia.com/> (queue crawl delay=1000ms)*
>> >
>> > *2021-05-05 17:35:29,095 INFO [main]
>> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
>> https://zyfro.com/
>> > <https://zyfro.com/> (queue crawl delay=1000ms)*
>> > 2021-05-05 17:35:29,095 INFO [main] com.**.nutchplugin.http.Http:
>> fetching https://zyfro.com/robots.txt <https://zyfro.com/robots.txt>
>> > 2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher:
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
>> > fetchQueues.getQueueCount=1
>> > 2021-05-05 17:35:30,189 INFO [main] com.**.nutchplugin.http.Http:
>> fetching https://zyfro.com/ <https://zyfro.com/>
>> > 2021-05-05 17:35:30,786 INFO [main]
>> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work
>> available/
>> >
>> > I am not sure what I am missing.
>> >
>> > Regards
>> > Prateek
>> >
>> >
>> > On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel <
>> [hidden email] <mailto:[hidden email]>> wrote:
>> >
>> >     Hi Prateek,
>> >
>> >     could you share information about all pages/URLs in the redirect
>> chain?
>> >
>> >     http://wikipedia.com/ <http://wikipedia.com/>
>> >     https://wikipedia.com/ <https://wikipedia.com/>
>> >     https://www.wikipedia.org/ <https://www.wikipedia.org/>
>> >
>> >     If I'm not wrong, the shown  CrawlDatum and ParseData stems from
>> >     https://www.wikipedia.org/ <https://www.wikipedia.org/> and is
>> _http_status_code_=200.
>> >     So, looks like the redirects have been followed.
>> >
>> >     Note: all 3 URLs should have records in the segment and the CrawlDb.
>> >
>> >     I've also verified that the above redirect chain is followed by
>> Fetcher
>> >     with the following settings (passed on the command-line via -D)
>> using
>> >     Nutch master (1.18):
>> >        -Dhttp.redirect.max=3
>> >        -Ddb.ignore.external.links=true
>> >        -Ddb.ignore.external.links.mode=byDomain
>> >        -Ddb.ignore.also.redirects=false
>> >
>> >     Fetcher log snippets:
>> >        FetcherThread 51 fetching http://wikipedia.com/ <
>> http://wikipedia.com/> (queue crawl delay=3000ms)
>> >        FetcherThread 51 fetching https://wikipedia.com/ <
>> https://wikipedia.com/> (queue crawl delay=3000ms)
>> >        FetcherThread 51 fetching https://www.wikipedia.org/ <
>> https://www.wikipedia.org/> (queue crawl delay=3000ms)
>> >
>> >     Just in case: what's the value of the property http.redirect.max ?
>> >
>> >     Best,
>> >     Sebastian
>> >
>> >
>> >     On 5/5/21 8:09 PM, prateek wrote:
>> >      > Hi,
>> >      >
>> >      > I am currently using Nutch 1.16 with the properties below -
>> >      >
>> >      >
>> >      >
>> >      >
>> *db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*
>> >      >
>> >      > When I am crawling websites that are redirecting (301 http code)
>> using
>> >      > Nutch (for example - https://zyfro.com/ <https://zyfro.com/>
>> and http://wikipedia.com/ <http://wikipedia.com/>). I see
>> >      > that the new redirected URL is not captured by nutch. Even the
>> outlinks
>> >      > point to the original url provided and status returned is 200.
>> >      > So my question is
>> >      > 1. How do I capture the new URL?
>> >      > 2. Is there a way to allow nutch to capture 301 status and then
>> the new url
>> >      > and then crawl the related content?
>> >      >
>> >      > Here is CrawlDatum and Parsedata structure for
>> http://wikipedia.com/ <http://wikipedia.com/> which
>> >      > gets redirected to wikipedia.org <http://wikipedia.org>.
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      > *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time:
>> Wed May 05
>> >      > 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC
>> 1970Retries since
>> >      > fetch: 0Retry interval: 31536000 seconds (365 days)Score:
>> 2.0Signature:
>> >      > nullMetadata:   _ngt_=1620235730883 _depth_=1
>> _http_status_code_=200
>> >      > _pst_=success(1), lastModified=1620038693000 _rs_=410
>> >      > Content-Type=text/html _maxdepth_=1000
>> nutch.protocol.code=200ParseData :
>> >      > Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1
>> outlink: toUrl:
>> >      >
>> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>> >     <
>> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>> >
>> >      > <
>> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>> >     <
>> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>> >>
>> >      > anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
>> >      > nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
>> >      > Server-Timing=cache;desc="hit-front", host;desc="cp1081"
>> >      > Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May
>> 2021
>> >      > 10:44:53 GMT Strict-Transport-Security=max-age=106384710;
>> >      > includeSubDomains; preload X-Cache-Status=hit-front Report-To={
>> "group":
>> >      > "wm_nel", "max_age": 86400, "endpoints": [{ "url":
>> >      >
>> >     "
>> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>> >     <
>> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>> >
>> >      >
>> >     <
>> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>> >     <
>> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>> >>"
>> >      > }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081
>> hit/578233
>> >      > Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114
>> Date=Wed,
>> >      > 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0
>> Accept-Ranges=bytes
>> >      > nutch.segment.name <http://nutch.segment.name> <
>> http://nutch.segment.name <http://nutch.segment.name>>=20210505173059
>> >      > Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
>> >      > "report_to": "wm_nel", "max_age": 86400, "failure_fraction":
>> 0.05,
>> >      > "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068"
>> Vary=Accept-Encoding
>> >      >
>> >
>>  X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671
>> |dst:www.wikipedia.org
>> >     <http://www.wikipedia.org>
>> >      > <http://www.wikipedia.org <http://www.wikipedia.org>>|principal:hadoop-test
>> _fst_=33 Parse Metadata:
>> >      > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>> _depth_=1
>> >      > viewport=initial-scale=1,user-scalable=yes
>> metatag.description=Wikipedia is
>> >      > a free online encyclopedia, created and edited by volunteers
>> around the
>> >      > world and hosted by the Wikimedia Foundation.
>> metatag.description=Wikipedia
>> >      > is a free online encyclopedia, created and edited by volunteers
>> around the
>> >      > world and hosted by the Wikimedia Foundation.
>> description=Wikipedia is a
>> >      > free online encyclopedia, created and edited by volunteers
>> around the world
>> >      > and hosted by the Wikimedia Foundation. _maxdepth_=1000 *
>> >      >
>> >      >
>> >      > Thanks
>> >      > Prateek
>> >      >
>> >
>>
>>