[jira] [Commented] (NUTCH-2531) Unclear steps in Nutch2 Tutorial

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2531) Unclear steps in Nutch2 Tutorial

David Pilato (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973428#comment-16973428 ]

Sebastian Nagel commented on NUTCH-2531:

Hi [~balaShashanka], there are no plans as there are indeed zero committers working actively on 2.x right now. Of course, there is a small chance that new (or old) contributors start working again on the 2.x branch. But the entire question is better discussion on the Nutch mailing list, not here (it's a bug tracker). Thanks!

> Unclear steps in Nutch2 Tutorial
> --------------------------------
>                 Key: NUTCH-2531
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2531
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Krzysztof Madejski
>            Priority: Minor
>             Fix For: 2.5
> I was trying to install Nutch based on this tutorial [https://wiki.apache.org/nutch/Nutch2Tutorial:]
> Issues I've found:
> In Obtaining Software and Configuration:
>  # _"Specify the [...] along with all of the other Configuration options suggested within the [Nutch 1.x tutorial|http://wiki.apache.org/nutch/NutchTutorial]."_
>   It would be better to copy necessary configuration. I don't have idea which settings exactly should be copied.
> 2. _"In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive dependency, this is a bug in gora-hbase 0.6.1 as described [here|https://github.com/apache/gora/pull/21]. This bug is removed in current Gora development."_
>   __  What does this step require from me? Should I add something to the dependencies? In which file? This point is written in an informative manner. Should be either deleted or clear instruction should be given.
> 3. _"*N.B.* It's probably worth checking and setting all your usual configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before progressing."_
>    I'ts my first install. There is no such thing as "usual configuration"..
> In "Invoke Nutch":
>  # "nutch readdb" doesn't return anything meaningful apart from Usage. 
> ./nutch readdb
> Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])
>  [-crawlId <id>] [-content] [-headers] [-links] [-text]
>  -crawlId <id> - the id to prefix the schemas to operate on,
>  (default: storage.crawl.id)
>  -stats [-sort] - print overall statistics to System.out
>  [-sort] - list status sorted by host
>  -url <url> - print information on <url> to System.out
>  -dump <out_dir> [-regex regex] - dump the webtable to a text file in
>  <out_dir>
>  -content - dump also raw content
>  -headers - dump protocol headers
>  -links - dump links
>  -text - dump extracted text
>  [-regex] - filter on the URL of the webtable entry

This message was sent by Atlassian Jira