Crawling pages behind SSO authentication (SAML/OIDC)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawling pages behind SSO authentication (SAML/OIDC)

abhay
Hello,

We are using Nutch to crawl intranet pages behind SSO authentication.

I would like to know if anyone has used/updated httpclient protocol plugin
for crawling pages behind SSO authentication.

The SSO auth redirects pages to the SSO server for login and optionally
asks for second factor authentication like TOTP.

We have been using a custom plugin (which calls a nodejs service) which
uses a google puppeteer to drive chromium browser to do this login and OTP
handling. This is much slower and might not require as many of these pages
are rendered on server sides (so dynamic rendering isn't required)

Thank you
Abhay Ratnaparkhi
Reply | Threaded
Open this post in threaded view
|

Re: Crawling pages behind SSO authentication (SAML/OIDC)

lewis john mcgibbney-2
Hi Abhay,

This is a problem space we looked at a while ago and made quite a bit of progress on.

Firstly, the protocol-httpclient plugin has been considered in a deprecated state for a while.
https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
I'm pretty sure that it will NOT cater for your use case. More information on the functionality and limits of this plugin can be found at
https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes 
some more recent initiatives can be found at
https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication

Now, some of the plugins which may be used/adapted for your use case include

1. https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit - customizable through https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java 

2. both
https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
some documentation exists at https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction

Admittedly, I've not tried to run these plugins against a modern SSO site recently. I suspect that some dependency updates would not go a miss so please take that info consideration.

Your note regarding the time it takes for the 'chaining' of systems together to achieve the login is well made. This was easily observed and needs a more consolidated/calculated approach IMHO.

I would be interested to discuss this further with you...

hth
lewismc

On 2021/06/07 02:45:54, Abhay Ratnaparkhi <[hidden email]> wrote:

> Hello,
>
> We are using Nutch to crawl intranet pages behind SSO authentication.
>
> I would like to know if anyone has used/updated httpclient protocol plugin
> for crawling pages behind SSO authentication.
>
> The SSO auth redirects pages to the SSO server for login and optionally
> asks for second factor authentication like TOTP.
>
> We have been using a custom plugin (which calls a nodejs service) which
> uses a google puppeteer to drive chromium browser to do this login and OTP
> handling. This is much slower and might not require as many of these pages
> are rendered on server sides (so dynamic rendering isn't required)
>
> Thank you
> Abhay Ratnaparkhi
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawling pages behind SSO authentication (SAML/OIDC)

abhay
Thank you Lewis for your reply.

I initially looked into the above protocol-htmlunit
<https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit>
 and protocol-interactiveselenium
<https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium>
plugins
you mentioned.

Based on selenium I created a microservice (which handles all required SSO
redirections/ OTP handlings etc) and hosted that with a selenium grid in
the kubernetes cluster for scaling.
I found that we couldn't scale this approach beyond a certain point and the
selenium hub in the selenium grid can not be scaled horizontally.

Later we switched using Puppetter <https://github.com/puppeteer/puppeteer>
to drive headless chrome and scaled this in kubernetes using browserless
<https://github.com/browserless/chrome>
The nutch plugin developed to call these hosted APIs. This helps but still
this is very slow compared to traditional httpclient approach.

As this is a common problem in the intranet environment, I was wondering
how people are handling this. I would be happy to discuss this further.

Thank you
Abhay





On Wed, Jun 9, 2021 at 6:41 PM Lewis John McGibbney <[hidden email]>
wrote:

> Hi Abhay,
>
> This is a problem space we looked at a while ago and made quite a bit of
> progress on.
>
> Firstly, the protocol-httpclient plugin has been considered in a
> deprecated state for a while.
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
> I'm pretty sure that it will NOT cater for your use case. More information
> on the functionality and limits of this plugin can be found at
> https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes
> some more recent initiatives can be found at
> https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication
>
> Now, some of the plugins which may be used/adapted for your use case
> include
>
> 1.
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit
> - customizable through
> https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
>
> 2. both
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
>
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
> some documentation exists at
> https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction
>
> Admittedly, I've not tried to run these plugins against a modern SSO site
> recently. I suspect that some dependency updates would not go a miss so
> please take that info consideration.
>
> Your note regarding the time it takes for the 'chaining' of systems
> together to achieve the login is well made. This was easily observed and
> needs a more consolidated/calculated approach IMHO.
>
> I would be interested to discuss this further with you...
>
> hth
> lewismc
>
> On 2021/06/07 02:45:54, Abhay Ratnaparkhi <[hidden email]>
> wrote:
> > Hello,
> >
> > We are using Nutch to crawl intranet pages behind SSO authentication.
> >
> > I would like to know if anyone has used/updated httpclient protocol
> plugin
> > for crawling pages behind SSO authentication.
> >
> > The SSO auth redirects pages to the SSO server for login and optionally
> > asks for second factor authentication like TOTP.
> >
> > We have been using a custom plugin (which calls a nodejs service) which
> > uses a google puppeteer to drive chromium browser to do this login and
> OTP
> > handling. This is much slower and might not require as many of these
> pages
> > are rendered on server sides (so dynamic rendering isn't required)
> >
> > Thank you
> > Abhay Ratnaparkhi
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawling pages behind SSO authentication (SAML/OIDC)

lewis john mcgibbney-2
In reply to this post by abhay
Yes you are hitting the exact same problems that we did. This presents a
major persistent challenge for using Nutch across the enterprise as it
quite frankly doesn’t scale.
I’m going to take next week to have a look into this specific issue and see
what I can come up with.
By any chance are you able to share your K8s configuration management here?
Are you using Helm?
Are you running Nutch in K8s or via some other deployment?
Next week I’m also looking into building our  CloudFormation template for
Nutch on EMR with Ranger included and will donate this to the Nutch
project.

On Sat, Jun 12, 2021 at 17:36 <[hidden email]> wrote:

>
> user Digest 13 Jun 2021 00:36:36 -0000 Issue 3108
>
> Topics (messages 34633 through 34634)
>
> Re: Apache Nutch help request for a school project :)
>         34633 by: lewis john mcgibbney
>
> Re: Crawling pages behind SSO authentication (SAML/OIDC)
>         34634 by: Abhay Ratnaparkhi
>
> Administrivia:
>
> ---------------------------------------------------------------------
> To post to the list, e-mail: [hidden email]
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ----------------------------------------------------------------------
>
>
>
>
> ---------- Forwarded message ----------
> From: lewis john mcgibbney <[hidden email]>
> To: "gokmen.yontem" <[hidden email]>
> Cc: Sebastian Nagel <[hidden email]>, [hidden email]
> Bcc:
> Date: Thu, 10 Jun 2021 09:53:31 -0700
> Subject: Re: Apache Nutch help request for a school project :)
> :)
>
> On Thu, Jun 10, 2021 at 7:18 AM gokmen.yontem <[hidden email]>
> wrote:
>
> > Lewis, Sebastian
> > I can’t thank you enough! Your help is much appreciated.
> >
> > Next time I'll follow your advice and use the mailing list, which I
> > wasn't aware of that.
> >
> > Best wishes,
> > Gorkem
> >
> >
> > On 2021-06-07 20:08, lewis john mcgibbney wrote:
> > > Yep Sebastian is absolutely correct. I sent you a pull request.
> > >
> > > https://github.com/gorkemyontem/nutch/pull/1
> > > HTH
> > > lewismc
> > >
> > > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
> > > <[hidden email]> wrote:
> > >
> > >> I’ll have a look today. You can always use the mailing list as
> > >> well. Feel free to post your questions there and we will help you
> > >> out :)
> > >>
> > >> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
> > >> <[hidden email]> wrote:
> > >>
> > >>> Hi Lewis,
> > >>> Sorry to bother you. I've been trying to configure Apache Nutch
> > >>> for
> > >>> almost 10 days now and I'm about to give up. I saw that you are
> > >>> contributing to this project and I thought maybe you can help me.
> > >>> This is how desperate I am :)
> > >>>
> > >>> Here's my repo if you have time:
> > >>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> > >>> I'm trying to use docker images so there isn't much on the repo/
> > >>>
> > >>> This is my current error:
> > >>>
> > >>> nutch    | Indexer: java.lang.RuntimeException: Indexing job did
> > >>> not
> > >>> succeed, job status:FAILED, reason: NA
> > >>> nutch    |      at
> > >>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> > >>> nutch    |      at
> > >>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> > >>> nutch    |      at
> > >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> > >>> nutch    |      at
> > >>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
> > >>>
> > >>> People say that schema.xml could be wrong, but I'm using the most
> > >>> up to
> > >>> date one from here
> > >>>
> > >>
> > >
> >
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
> > >>>
> > >>> Many many thanks!
> > >>> Best wishes,
> > >>> Gorkem
> > >> --
> > >>
> > >> http://home.apache.org/~lewismc/
> > >> http://people.apache.org/keys/committer/lewismc
> > >
> > > --
> > >
> > > http://home.apache.org/~lewismc/
> > > http://people.apache.org/keys/committer/lewismc
> >
>
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>
>
>
> ---------- Forwarded message ----------
> From: Abhay Ratnaparkhi <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Thu, 10 Jun 2021 17:27:42 -0500
> Subject: Re: Crawling pages behind SSO authentication (SAML/OIDC)
> Thank you Lewis for your reply.
>
> I initially looked into the above protocol-htmlunit
> <https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit>
>  and protocol-interactiveselenium
> <
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
> >
> plugins
> you mentioned.
>
> Based on selenium I created a microservice (which handles all required SSO
> redirections/ OTP handlings etc) and hosted that with a selenium grid in
> the kubernetes cluster for scaling.
> I found that we couldn't scale this approach beyond a certain point and the
> selenium hub in the selenium grid can not be scaled horizontally.
>
> Later we switched using Puppetter <https://github.com/puppeteer/puppeteer>
> to drive headless chrome and scaled this in kubernetes using browserless
> <https://github.com/browserless/chrome>
> The nutch plugin developed to call these hosted APIs. This helps but still
> this is very slow compared to traditional httpclient approach.
>
> As this is a common problem in the intranet environment, I was wondering
> how people are handling this. I would be happy to discuss this further.
>
> Thank you
> Abhay
>
>
>
>
>
> On Wed, Jun 9, 2021 at 6:41 PM Lewis John McGibbney <[hidden email]>
> wrote:
>
> > Hi Abhay,
> >
> > This is a problem space we looked at a while ago and made quite a bit of
> > progress on.
> >
> > Firstly, the protocol-httpclient plugin has been considered in a
> > deprecated state for a while.
> >
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
> > I'm pretty sure that it will NOT cater for your use case. More
> information
> > on the functionality and limits of this plugin can be found at
> >
> https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes
> > some more recent initiatives can be found at
> > https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication
> >
> > Now, some of the plugins which may be used/adapted for your use case
> > include
> >
> > 1.
> > https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit
> > - customizable through
> >
> https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
> >
> > 2. both
> > https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
> >
> >
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
> > some documentation exists at
> >
> https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction
> >
> > Admittedly, I've not tried to run these plugins against a modern SSO site
> > recently. I suspect that some dependency updates would not go a miss so
> > please take that info consideration.
> >
> > Your note regarding the time it takes for the 'chaining' of systems
> > together to achieve the login is well made. This was easily observed and
> > needs a more consolidated/calculated approach IMHO.
> >
> > I would be interested to discuss this further with you...
> >
> > hth
> > lewismc
> >
> > On 2021/06/07 02:45:54, Abhay Ratnaparkhi <[hidden email]>
> > wrote:
> > > Hello,
> > >
> > > We are using Nutch to crawl intranet pages behind SSO authentication.
> > >
> > > I would like to know if anyone has used/updated httpclient protocol
> > plugin
> > > for crawling pages behind SSO authentication.
> > >
> > > The SSO auth redirects pages to the SSO server for login and optionally
> > > asks for second factor authentication like TOTP.
> > >
> > > We have been using a custom plugin (which calls a nodejs service) which
> > > uses a google puppeteer to drive chromium browser to do this login and
> > OTP
> > > handling. This is much slower and might not require as many of these
> > pages
> > > are rendered on server sides (so dynamic rendering isn't required)
> > >
> > > Thank you
> > > Abhay Ratnaparkhi
> > >
> >
>
--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc