Preparing to release Nutch 1.15 ?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Preparing to release Nutch 1.15 ?

Sebastian Nagel
Hi all,

almost 80 fixes and improvements are done now and include:

NUTCH-2375 upgrade to new mapreduce API
  It was a huge change affecting more than 10,000 lines of code. Thanks, Omkar!
  Well, there have been some regressions but those are resolved now. Tests in
  pseudo-distributed mode [1] succeeded and also a mid-size test crawl (180
  million pages) on a Hadoop cluster.
  Would be great if anybody is able to test the Nutch master in combination with
  a non-HDFS file system (e.g. s3://)! Please let us know whether this works. Thanks!

NUTCH-1480: Multiple index writer instances with different configurations
  Thanks to Roannel it's now possible to index into multiple Solr or Elasticsearch
  instances. With NUTCH- (needs to be reviewed) also the routing to of documents
  to the index will be configurable.

NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
   Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests need
   to be addressed in NUTCH-2596.

And two important issues are almost ready to be committed soon:

NUTCH-2549: a long list of fixes and improvements to protocol-http. Thanks to
   Gerard Bouchard!

NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation based
   on the okhttp library. Supports HTTP/2.


The full list of fixes and improvements is available at [2].

I'll plan to work through the remaining 70 open issues during the next
days and hope to commit/resolve 15-25 of them and move the remaining
ones to Nutch 1.16.

Please vote for issues you want to get included. If there are open
pull requests, it will help if these can be merged, the unit tests
pass, and any review comments are addressed. Thanks!

If there are any objections or blockers, please also let us know!

I'll also plan to run a test crawl on Hadoop mid of this week.
But any help in testing is welcome.

Note that the tutorial needs to be updated (will be done after 1.15
is finally released) to reflect the changes related to NUTCH-1480.


Thanks,
Sebastian


[1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
[2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302

Reply | Threaded
Open this post in threaded view
|

Re: Preparing to release Nutch 1.15 ?

kamaci
+1


13 Haz 2018 Çar, saat 21:04 tarihinde Joe Obernberger <[hidden email]> şunu yazdı:
Woot!


On 6/11/2018 11:55 AM, Chris Mattmann wrote:
> ++1!
>
>   
>
> Sounds great.
>
>   
>
> Cheers,
>
> Chris
>
>   
>
>   
>
>   
>
>   
>
> From: Sebastian Nagel <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Monday, June 11, 2018 at 7:35 AM
> To: "[hidden email]" <[hidden email]>
> Cc: "[hidden email]" <[hidden email]>
> Subject: Preparing to release Nutch 1.15 ?
>
>   
>
> Hi all,
>
>   
>
> almost 80 fixes and improvements are done now and include:
>
>   
>
> NUTCH-2375 upgrade to new mapreduce API
>
>    It was a huge change affecting more than 10,000 lines of code. Thanks, Omkar!
>
>    Well, there have been some regressions but those are resolved now. Tests in
>
>    pseudo-distributed mode [1] succeeded and also a mid-size test crawl (180
>
>    million pages) on a Hadoop cluster.
>
>    Would be great if anybody is able to test the Nutch master in combination with
>
>    a non-HDFS file system (e.g. s3://)! Please let us know whether this works. Thanks!
>
>   
>
> NUTCH-1480: Multiple index writer instances with different configurations
>
>    Thanks to Roannel it's now possible to index into multiple Solr or Elasticsearch
>
>    instances. With NUTCH- (needs to be reviewed) also the routing to of documents
>
>    to the index will be configurable.
>
>   
>
> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>
>     Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests need
>
>     to be addressed in NUTCH-2596.
>
>   
>
> And two important issues are almost ready to be committed soon:
>
>   
>
> NUTCH-2549: a long list of fixes and improvements to protocol-http. Thanks to
>
>     Gerard Bouchard!
>
>   
>
> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation based
>
>     on the okhttp library. Supports HTTP/2.
>
>   
>
>   
>
> The full list of fixes and improvements is available at [2].
>
>   
>
> I'll plan to work through the remaining 70 open issues during the next
>
> days and hope to commit/resolve 15-25 of them and move the remaining
>
> ones to Nutch 1.16.
>
>   
>
> Please vote for issues you want to get included. If there are open
>
> pull requests, it will help if these can be merged, the unit tests
>
> pass, and any review comments are addressed. Thanks!
>
>   
>
> If there are any objections or blockers, please also let us know!
>
>   
>
> I'll also plan to run a test crawl on Hadoop mid of this week.
>
> But any help in testing is welcome.
>
>   
>
> Note that the tutorial needs to be updated (will be done after 1.15
>
> is finally released) to reflect the changes related to NUTCH-1480.
>
>   
>
>   
>
> Thanks,
>
> Sebastian
>
>   
>
>   
>
> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>
> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>
>   
>
>   
>
>
>
> ---
> This email has been checked for viruses by AVG.
> https://www.avg.com
>

Reply | Threaded
Open this post in threaded view
|

Re: Preparing to release Nutch 1.15 ?

Omkar Reddy-2
+1

On 14 June 2018 at 03:09, Furkan KAMACI <[hidden email]> wrote:
+1


13 Haz 2018 Çar, saat 21:04 tarihinde Joe Obernberger <[hidden email]> şunu yazdı:
Woot!



On 6/11/2018 11:55 AM, Chris Mattmann wrote:
> ++1!
>
>   
>
> Sounds great.
>
>   
>
> Cheers,
>
> Chris
>
>   
>
>   
>
>   
>
>   
>
> From: Sebastian Nagel <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Monday, June 11, 2018 at 7:35 AM
> To: "[hidden email]" <[hidden email]>
> Cc: "[hidden email]" <[hidden email]>
> Subject: Preparing to release Nutch 1.15 ?
>
>   
>
> Hi all,
>
>   
>
> almost 80 fixes and improvements are done now and include:
>
>   
>
> NUTCH-2375 upgrade to new mapreduce API
>
>    It was a huge change affecting more than 10,000 lines of code. Thanks, Omkar!
>
>    Well, there have been some regressions but those are resolved now. Tests in
>
>    pseudo-distributed mode [1] succeeded and also a mid-size test crawl (180
>
>    million pages) on a Hadoop cluster.
>
>    Would be great if anybody is able to test the Nutch master in combination with
>
>    a non-HDFS file system (e.g. s3://)! Please let us know whether this works. Thanks!
>
>   
>
> NUTCH-1480: Multiple index writer instances with different configurations
>
>    Thanks to Roannel it's now possible to index into multiple Solr or Elasticsearch
>
>    instances. With NUTCH- (needs to be reviewed) also the routing to of documents
>
>    to the index will be configurable.
>
>   
>
> NUTCH-2583: Ralf contributed a huge upgrade of dependencies.
>
>     Nutch now runs and compiles on Java 9 + 10. Only errors in unit tests need
>
>     to be addressed in NUTCH-2596.
>
>   
>
> And two important issues are almost ready to be committed soon:
>
>   
>
> NUTCH-2549: a long list of fixes and improvements to protocol-http. Thanks to
>
>     Gerard Bouchard!
>
>   
>
> NUTCH-2576: plugin protocol-okhttp, a new HTTP protocol implementation based
>
>     on the okhttp library. Supports HTTP/2.
>
>   
>
>   
>
> The full list of fixes and improvements is available at [2].
>
>   
>
> I'll plan to work through the remaining 70 open issues during the next
>
> days and hope to commit/resolve 15-25 of them and move the remaining
>
> ones to Nutch 1.16.
>
>   
>
> Please vote for issues you want to get included. If there are open
>
> pull requests, it will help if these can be merged, the unit tests
>
> pass, and any review comments are addressed. Thanks!
>
>   
>
> If there are any objections or blockers, please also let us know!
>
>   
>
> I'll also plan to run a test crawl on Hadoop mid of this week.
>
> But any help in testing is welcome.
>
>   
>
> Note that the tutorial needs to be updated (will be done after 1.15
>
> is finally released) to reflect the changes related to NUTCH-1480.
>
>   
>
>   
>
> Thanks,
>
> Sebastian
>
>   
>
>   
>
> [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>
> [2] https://issues.apache.org/jira/projects/NUTCH/versions/12342302
>
>   
>
>   
>
>
>
> ---
> This email has been checked for viruses by AVG.
> https://www.avg.com
>