protocol-selenium plug-in incompatible with downstream plugins

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

protocol-selenium plug-in incompatible with downstream plugins

Michael Portnoy
The pages that I'm crawling are dynamically generated (i.e. using
javascript) for which purpose I am using the `protocol-selenium` plugin
instead of `protocol-http` as per
https://wiki.apache.org/nutch/AdvancedAjaxInteraction.

Problem:

protocol-selenium is using lib-selenium which unlike protocol-http -- which
returns all of the page source -- only returns the data within the <body>
tag of the page. This in turn prevents downstream plugins from parsing such
items as meta-tags and page title which normally exist outside the page
<body>.

My solution (part):

---
a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
+++
b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java

@@ -160,7 +160,7 @@ public class HttpWebClient {
-      return
driver.findElement(By.tagName("body")).getAttribute("innerHTML");
+      return driver.getPageSource();
   }

Question:

Has anyone ran into a similar issue and how did you overcome it? I would
think this to be a common problem, and is wondering if there is a (good)
reason why lib-selenium was not patched to date.

Thank you,
Michael
Reply | Threaded
Open this post in threaded view
|

Re: protocol-selenium plug-in incompatible with downstream plugins

Chris Mattmann
Michael, thanks for this, and the below sounds like a worthwhile patch. I’ll try and test it out
this week and see if it improves crawling for sites like you mention. My belief is that the domains
in which we were testing this in Nutch were such that the only javascript to be rendered was potentially
body javascript, or possibly we kicked the problem down into the Selenium “handler” interface in which
a call could be similarly made to grab the whole page body.

(check out protocol-interactiveselenium)

Cheers,
Chris




On 10/25/17, 2:06 PM, "Michael Portnoy" <[hidden email]> wrote:

    The pages that I'm crawling are dynamically generated (i.e. using
    javascript) for which purpose I am using the `protocol-selenium` plugin
    instead of `protocol-http` as per
    https://wiki.apache.org/nutch/AdvancedAjaxInteraction.
   
    Problem:
   
    protocol-selenium is using lib-selenium which unlike protocol-http -- which
    returns all of the page source -- only returns the data within the <body>
    tag of the page. This in turn prevents downstream plugins from parsing such
    items as meta-tags and page title which normally exist outside the page
    <body>.
   
    My solution (part):
   
    ---
    a/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
    +++
    b/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
   
    @@ -160,7 +160,7 @@ public class HttpWebClient {
    -      return
    driver.findElement(By.tagName("body")).getAttribute("innerHTML");
    +      return driver.getPageSource();
       }
   
    Question:
   
    Has anyone ran into a similar issue and how did you overcome it? I would
    think this to be a common problem, and is wondering if there is a (good)
    reason why lib-selenium was not patched to date.
   
    Thank you,
    Michael