[jira] [Commented] (NUTCH-2739) indexer-elastic: Upgrade ES and migrate to REST client

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2739) indexer-elastic: Upgrade ES and migrate to REST client

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976626#comment-16976626 ]

ASF GitHub Bot commented on NUTCH-2739:
---------------------------------------

sebastian-nagel commented on issue #484: NUTCH-2739 : Upgrade ES and migrate to REST client
URL: https://github.com/apache/nutch/pull/484#issuecomment-555061506
 
 
   > This will mock the client itself. But we need to mock the server with requests and response. So can we go ahead and not do the tests at all?
   
   Well, the previous non-REST test implemented a client which did not send anything to the server but just returned a successful response or (if `clusterSaturated` was set to true) a temporary failure.
   
   But I'm ok to remove the Test class if it's too much work to rewrite it for the REST client.
   
   I've tested the PR but the initial rounds failed for about 50% of the pages/documents:
   ```
   [2019-11-18T12:56:46,803][DEBUG][o.e.a.b.TransportShardBulkAction] [vagran] [nutch][0] failed to execute bulk item (index) index {[nutch][_doc][http://nutch.apache.org/apidocs/apidocs-2.2.1/index.html], source[{"{date=Mon Jun 09 15:03:28 CEST 2014, type=[text/html, text, html], title=apache-nutch 2.2.1 API, url=http://nutch.apache.org/apidocs/apidocs-2.2.1/index.html, content=apache-nutch 2.2.1 API\n<H2> Frame Alert</H2> <P> This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client. <BR> Link to<A HREF=\"overview-summary.html\">Non-frame version.</A>\n, search=apache-nutch 2.2.1 API, tstamp=Thu Jul 26 16:50:11 CEST 2018, segment=20180726164932, digest=8b8785f9cec87c0376a7fa940e0e3a6c, host=nutch.apache.org, boost=1.0, id=http://nutch.apache.org/apidocs/apidocs-2.2.1/index.html, lastModified=Mon Jun 09 15:03:28 CEST 2014}":"doc"}]}
   ```
   
   I got it fixed by using XContentBuilder to pass document as JSON to ES client, you'll find the necessary changes in [this branch](https://github.com/sebastian-nagel/nutch/tree/NUTCH-2739). Also:
   - updated the description how to upgrade the dependencies in the plugin.xml and added few exclusions of dependencies already provided by Nutch core.
   - changed the default properties in index-writers.xml.template so that the indexer-elastic plugin works out-of-the-box with default settings
   
   So far, I didn't run any tests at scale. Should be to make sure we are able to index millions of documents with the given settings.
   
   Please have a look at my changes. Can you integrate them into your branch?
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> indexer-elastic: Upgrade ES and migrate to REST client
> ------------------------------------------------------
>
>                 Key: NUTCH-2739
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2739
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> The indexer-elastic plugin is based on 5.3.0 and should be upgraded to the most recent Elasticsearch version (7.3.0 or upwards).
> [TransportClient|https://www.elastic.co/guide/en/elasticsearch/client/java-api/7.3/transport-client.html] has been deprecated in ES 7.x and will be removed in 8.x. We should migrate to using the [REST client|https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.3/java-rest-high.html] and also check whether this would obsolete the indexer-elastic-rest plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)