[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2648) Make configurable whether TLS/SSL certificates are checked by protocol plugins

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642968#comment-16642968 ]

Sebastian Nagel commented on NUTCH-2648:
----------------------------------------

Hi [~markus17], it should work for protocol-http, protocol-httpclient and protocol-okhttp. I've tested it using parsechecker for all three plugins, here for httpclient:
{noformat}
% bin/nutch parsechecker -Dhttp.tls.certificates.check=true -Dplugin.includes='protocol-httpclient|parse-tika' https://ingevd.waarbenjij.nu/kaart/5000179/dag-4
...
Fetch failed with protocol status: exception(16), lastModified=0: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

% bin/nutch parsechecker -Dhttp.tls.certificates.check=false -Dplugin.includes='protocol-httpclient|parse-tika'
...
Status: success(1,0)
...{noformat}
Regarding the protocol-htmlunit and the two selenium-based protocol plugins: it's now tracked in NUTCH-2649.

?? Maybe its also an idea to add a dummy trust manager in Nutch' base code??

Yes, or in lib-http. While implementing this for protocol-okhttp, I've thought about trying to bundle the DummyTrustManager functionalities. But the code overlaps only partially, so I was lazy here.

> Make configurable whether TLS/SSL certificates are checked by protocol plugins
> ------------------------------------------------------------------------------
>
>                 Key: NUTCH-2648
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2648
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> (see discussion in NUTCH-2647)
> It should be possible to enable/disable TLS/SSL certificate validation centrally for all http/https protocol plugins by a single configuration property.
> Some use cases (eg. crawl a site to detect insecure pages) may require that TLS/SSL certificates are checked. Also a broader, unrestricted web crawl may skip sites with invalid certificates as this is can be an indicator for the quality of a site.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)