[jira] Created: (NUTCH-640) confusing description "set it to Integer.MAX_VALUE"

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-640) confusing description "set it to Integer.MAX_VALUE"

Sebastian Nagel (Jira)
confusing description "set it to Integer.MAX_VALUE"
---------------------------------------------------

                 Key: NUTCH-640
                 URL: https://issues.apache.org/jira/browse/NUTCH-640
             Project: Nutch
          Issue Type: Improvement
          Components: documentation
    Affects Versions: 0.9.0
            Reporter: Stijn Vermeeren


This property "indexer.max.tokens" has the following description in nutch-default.xml :

" The maximum number of tokens that will be indexed for a single field
  in a document. This limits the amount of memory required for
  indexing, so that collections with very large files will not crash
  the indexing process by running out of memory.

  Note that this effectively truncates large documents, excluding
  from the index tokens that occur further in the document. If you
  know your source documents are large, be sure to set this value
  high enough to accomodate the expected size. If you set it to
  Integer.MAX_VALUE, then the only limit is your memory, but you
  should anticipate an OutOfMemoryError."

Apparently, "set it to Integer.MAX_VALUE" here means <<substitute the integer value of Integer.MAX_VALUE>>, and not <<put the text "Integer.MAX_VALUE" between the value tags>>. I think this is very confusing and the description should be improved.

I first put <value>Integer.MAX_VALUE</value> in my configuration, and it took a long time to figure out what was wrong, especially since Nutch rolled back on the default value of 10000 instead of giving an error.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-640) confusing description "set it to Integer.MAX_VALUE"

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stijn Vermeeren updated NUTCH-640:
----------------------------------

    Priority: Minor  (was: Major)

> confusing description "set it to Integer.MAX_VALUE"
> ---------------------------------------------------
>
>                 Key: NUTCH-640
>                 URL: https://issues.apache.org/jira/browse/NUTCH-640
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.9.0
>            Reporter: Stijn Vermeeren
>            Priority: Minor
>
> This property "indexer.max.tokens" has the following description in nutch-default.xml :
> " The maximum number of tokens that will be indexed for a single field
>   in a document. This limits the amount of memory required for
>   indexing, so that collections with very large files will not crash
>   the indexing process by running out of memory.
>   Note that this effectively truncates large documents, excluding
>   from the index tokens that occur further in the document. If you
>   know your source documents are large, be sure to set this value
>   high enough to accomodate the expected size. If you set it to
>   Integer.MAX_VALUE, then the only limit is your memory, but you
>   should anticipate an OutOfMemoryError."
> Apparently, "set it to Integer.MAX_VALUE" here means <<substitute the integer value of Integer.MAX_VALUE>>, and not <<put the text "Integer.MAX_VALUE" between the value tags>>. I think this is very confusing and the description should be improved.
> I first put <value>Integer.MAX_VALUE</value> in my configuration, and it took a long time to figure out what was wrong, especially since Nutch rolled back on the default value of 10000 instead of giving an error.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-640) confusing description "set it to Integer.MAX_VALUE"

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-640:
--------------------------------

    Attachment: NUTCH-640.patch

You are right that it is confusing but asking users to substitute value of Integer.MAX_VALUE would also be unnecessarily difficult.

Attached patch instead changes conf description to use -1 instead of Integer.MAX_VALUE. Also, Indexer is modified to check for negatives in indexer.max.tokens and make them Integer.MAX_VALUE.

> confusing description "set it to Integer.MAX_VALUE"
> ---------------------------------------------------
>
>                 Key: NUTCH-640
>                 URL: https://issues.apache.org/jira/browse/NUTCH-640
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.9.0
>            Reporter: Stijn Vermeeren
>            Priority: Minor
>         Attachments: NUTCH-640.patch
>
>
> This property "indexer.max.tokens" has the following description in nutch-default.xml :
> " The maximum number of tokens that will be indexed for a single field
>   in a document. This limits the amount of memory required for
>   indexing, so that collections with very large files will not crash
>   the indexing process by running out of memory.
>   Note that this effectively truncates large documents, excluding
>   from the index tokens that occur further in the document. If you
>   know your source documents are large, be sure to set this value
>   high enough to accomodate the expected size. If you set it to
>   Integer.MAX_VALUE, then the only limit is your memory, but you
>   should anticipate an OutOfMemoryError."
> Apparently, "set it to Integer.MAX_VALUE" here means <<substitute the integer value of Integer.MAX_VALUE>>, and not <<put the text "Integer.MAX_VALUE" between the value tags>>. I think this is very confusing and the description should be improved.
> I first put <value>Integer.MAX_VALUE</value> in my configuration, and it took a long time to figure out what was wrong, especially since Nutch rolled back on the default value of 10000 instead of giving an error.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (NUTCH-640) confusing description "set it to Integer.MAX_VALUE"

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney reassigned NUTCH-640:
-----------------------------------

    Assignee: Doğacan Güney

> confusing description "set it to Integer.MAX_VALUE"
> ---------------------------------------------------
>
>                 Key: NUTCH-640
>                 URL: https://issues.apache.org/jira/browse/NUTCH-640
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.9.0
>            Reporter: Stijn Vermeeren
>            Assignee: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-640.patch
>
>
> This property "indexer.max.tokens" has the following description in nutch-default.xml :
> " The maximum number of tokens that will be indexed for a single field
>   in a document. This limits the amount of memory required for
>   indexing, so that collections with very large files will not crash
>   the indexing process by running out of memory.
>   Note that this effectively truncates large documents, excluding
>   from the index tokens that occur further in the document. If you
>   know your source documents are large, be sure to set this value
>   high enough to accomodate the expected size. If you set it to
>   Integer.MAX_VALUE, then the only limit is your memory, but you
>   should anticipate an OutOfMemoryError."
> Apparently, "set it to Integer.MAX_VALUE" here means <<substitute the integer value of Integer.MAX_VALUE>>, and not <<put the text "Integer.MAX_VALUE" between the value tags>>. I think this is very confusing and the description should be improved.
> I first put <value>Integer.MAX_VALUE</value> in my configuration, and it took a long time to figure out what was wrong, especially since Nutch rolled back on the default value of 10000 instead of giving an error.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-640) confusing description "set it to Integer.MAX_VALUE"

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-640.
-------------------------------

    Resolution: Fixed

Committed as of rev. 701052.

> confusing description "set it to Integer.MAX_VALUE"
> ---------------------------------------------------
>
>                 Key: NUTCH-640
>                 URL: https://issues.apache.org/jira/browse/NUTCH-640
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.9.0
>            Reporter: Stijn Vermeeren
>            Assignee: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-640.patch
>
>
> This property "indexer.max.tokens" has the following description in nutch-default.xml :
> " The maximum number of tokens that will be indexed for a single field
>   in a document. This limits the amount of memory required for
>   indexing, so that collections with very large files will not crash
>   the indexing process by running out of memory.
>   Note that this effectively truncates large documents, excluding
>   from the index tokens that occur further in the document. If you
>   know your source documents are large, be sure to set this value
>   high enough to accomodate the expected size. If you set it to
>   Integer.MAX_VALUE, then the only limit is your memory, but you
>   should anticipate an OutOfMemoryError."
> Apparently, "set it to Integer.MAX_VALUE" here means <<substitute the integer value of Integer.MAX_VALUE>>, and not <<put the text "Integer.MAX_VALUE" between the value tags>>. I think this is very confusing and the description should be improved.
> I first put <value>Integer.MAX_VALUE</value> in my configuration, and it took a long time to figure out what was wrong, especially since Nutch rolled back on the default value of 10000 instead of giving an error.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-640) confusing description "set it to Integer.MAX_VALUE"

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636524#action_12636524 ]

Hudson commented on NUTCH-640:
------------------------------

Integrated in Nutch-trunk #588 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/588/])
     - confusing description "set it to Integer.MAX_VALUE"


> confusing description "set it to Integer.MAX_VALUE"
> ---------------------------------------------------
>
>                 Key: NUTCH-640
>                 URL: https://issues.apache.org/jira/browse/NUTCH-640
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.9.0
>            Reporter: Stijn Vermeeren
>            Assignee: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-640.patch
>
>
> This property "indexer.max.tokens" has the following description in nutch-default.xml :
> " The maximum number of tokens that will be indexed for a single field
>   in a document. This limits the amount of memory required for
>   indexing, so that collections with very large files will not crash
>   the indexing process by running out of memory.
>   Note that this effectively truncates large documents, excluding
>   from the index tokens that occur further in the document. If you
>   know your source documents are large, be sure to set this value
>   high enough to accomodate the expected size. If you set it to
>   Integer.MAX_VALUE, then the only limit is your memory, but you
>   should anticipate an OutOfMemoryError."
> Apparently, "set it to Integer.MAX_VALUE" here means <<substitute the integer value of Integer.MAX_VALUE>>, and not <<put the text "Integer.MAX_VALUE" between the value tags>>. I think this is very confusing and the description should be improved.
> I first put <value>Integer.MAX_VALUE</value> in my configuration, and it took a long time to figure out what was wrong, especially since Nutch rolled back on the default value of 10000 instead of giving an error.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.