Not correct usage of key name(segments) in the RestAPI.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Not correct usage of key name(segments) in the RestAPI.

Semyon Semyonov
Dear community,

I would like to suggest an improvement/ask an advice about an API key usage(Nutch 1.13)

The class metadata/Nutch.java contains two possible parameters for InvertLinks job in api.
    line 97: public static final String ARG_SEGMENTDIR = "segment_dir"
    line 99 : public static final String ARG_SEGMENT = "segment";

An example of usage can be found in nutch/crawl/LinkDB.java(see bellow), witch the following logic:
The ARG_SEGMENTDIR  - a directory of all segments 
but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path. Therefore if a string is passed instead of array of strings it throws an exception.

I belive it contradicts the naming convention and misleads the users(Nutch.ARG_SEGMENT is a singular noun, but it is used as a plural).
My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace the the logic to the single path processing.

nutch/crawl/LinkDB.java:
lines 383 : 
    if(args.containsKey(Nutch.ARG_SEGMENTDIR)) {
      Object segDir = args.get(Nutch.ARG_SEGMENTDIR);
      if(segDir instanceof Path) {
        segmentsDir = (Path) segDir;
      }
      else {
        segmentsDir = new Path(segDir.toString());
      }
      FileStatus[] paths = fs.listStatus(segmentsDir,
          HadoopFSUtil.getPassDirectoriesFilter(fs));
      segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths)));
    }
    else if(args.containsKey(Nutch.ARG_SEGMENT)) {
      Object segments = args.get(Nutch.ARG_SEGMENT);
      ArrayList<String> segmentList = new ArrayList<>();
      if(segments instanceof ArrayList) {
        segmentList = (ArrayList<String>)segments;
      }
      for(String segment: segmentList) {
        segs.add(new Path(segment));
      }
    }

Semyon.
Reply | Threaded
Open this post in threaded view
|

Re: Not correct usage of key name(segments) in the RestAPI.

Sebastian Nagel
Hi Semyon,

thanks!  Please open an issue on
  https://issues.apache.org/jira/projects/NUTCH
and if possible provide a patch or a pull-request
on github with the Jira issue id "NUTCH-XXX" in the title.

You're right the usage of ARG_SEGMENT isn't consistent
 single path in
   Fetcher
   ParseSegment
 list of paths in
   CrawlDb
   LinkDb
   IndexingJob

> My proposal is either to add new variable ARG_SEGMENTS(plural) or
This would break the existing API, but ok if properly logged and not
silently ignored. That's bad with the current behavior: if ARG_SEGMENT
isn't an instance of ArrayList no error is shown.

> to replace the the logic to the single path processing.
Not my favorite option since some of the tools are made to work with multiple segments.

Third option: could check the type of ARG_SEGMENT (Path/String vs. (Array)List)

But the decision is on you.

Thanks,
Sebastian

On 10/11/2017 03:38 PM, Semyon Semyonov wrote:

> Dear community,
>
> I would like to suggest an improvement/ask an advice about an API key usage(Nutch 1.13)
>
> The class metadata/Nutch.java contains two possible parameters for InvertLinks job in api.
>     line 97: public static final String ARG_SEGMENTDIR = "segment_dir"
>     line 99 : public static final String ARG_SEGMENT = "segment";
>
> An example of usage can be found in nutch/crawl/LinkDB.java(see bellow), witch the following logic:
> The ARG_SEGMENTDIR  - a directory of all segments 
> but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path. Therefore if a string
> is passed instead of array of strings it throws an exception.
>
> I belive it contradicts the naming convention and misleads the users(Nutch.ARG_SEGMENT is a singular
> noun, but it is used as a plural).
> My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace the the logic to the
> single path processing.
>
> nutch/crawl/LinkDB.java:
> lines 383 : 
>     if(args.containsKey(Nutch.ARG_SEGMENTDIR)) {
>       Object segDir = args.get(Nutch.ARG_SEGMENTDIR);
>       if(segDir instanceof Path) {
>         segmentsDir = (Path) segDir;
>       }
>       else {
>         segmentsDir = new Path(segDir.toString());
>       }
>       FileStatus[] paths = fs.listStatus(segmentsDir,
>           HadoopFSUtil.getPassDirectoriesFilter(fs));
>       segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths)));
>     }
>     else if(args.containsKey(Nutch.ARG_SEGMENT)) {
>       Object segments = args.get(Nutch.ARG_SEGMENT);
>       ArrayList<String> segmentList = new ArrayList<>();
>       if(segments instanceof ArrayList) {
>         segmentList = (ArrayList<String>)segments;
>       }
>       for(String segment: segmentList) {
>         segs.add(new Path(segment));
>       }
>     }
>
> Semyon.

Reply | Threaded
Open this post in threaded view
|

Re: Not correct usage of key name(segments) in the RestAPI.

Semyon Semyonov
Thanks for the suggestion.
An issue with an attached path is here:
https://issues.apache.org/jira/browse/NUTCH-2441
 
Sent: Thursday, October 12, 2017 at 11:17 AM
From: "Sebastian Nagel" <[hidden email]>
To: [hidden email]
Subject: Re: Not correct usage of key name(segments) in the RestAPI.
Hi Semyon,

thanks! Please open an issue on
https://issues.apache.org/jira/projects/NUTCH
and if possible provide a patch or a pull-request
on github with the Jira issue id "NUTCH-XXX" in the title.

You're right the usage of ARG_SEGMENT isn't consistent
single path in
Fetcher
ParseSegment
list of paths in
CrawlDb
LinkDb
IndexingJob

> My proposal is either to add new variable ARG_SEGMENTS(plural) or
This would break the existing API, but ok if properly logged and not
silently ignored. That's bad with the current behavior: if ARG_SEGMENT
isn't an instance of ArrayList no error is shown.

> to replace the the logic to the single path processing.
Not my favorite option since some of the tools are made to work with multiple segments.

Third option: could check the type of ARG_SEGMENT (Path/String vs. (Array)List)

But the decision is on you.

Thanks,
Sebastian

On 10/11/2017 03:38 PM, Semyon Semyonov wrote:
> Dear community,
>
> I would like to suggest an improvement/ask an advice about an API key usage(Nutch 1.13)
>
> The class metadata/Nutch.java contains two possible parameters for InvertLinks job in api.
>     line 97: public static final String ARG_SEGMENTDIR = "segment_dir"
>     line 99 : public static final String ARG_SEGMENT = "segment";
>
> An example of usage can be found in nutch/crawl/LinkDB.java(see bellow), witch the following logic:
> The ARG_SEGMENTDIR  - a directory of all segments 
> but Nutch.ARG_SEGMENT - an array of paths to segments instead of single path. Therefore if a string
> is passed instead of array of strings it throws an exception.
>
> I belive it contradicts the naming convention and misleads the users(Nutch.ARG_SEGMENT is a singular
> noun, but it is used as a plural).
> My proposal is either to add new variable ARG_SEGMENTS(plural) or to replace the the logic to the
> single path processing.
>
> nutch/crawl/LinkDB.java:
> lines 383 : 
>     if(args.containsKey(Nutch.ARG_SEGMENTDIR)) {
>       Object segDir = args.get(Nutch.ARG_SEGMENTDIR);
>       if(segDir instanceof Path) {
>         segmentsDir = (Path) segDir;
>       }
>       else {
>         segmentsDir = new Path(segDir.toString());
>       }
>       FileStatus[] paths = fs.listStatus(segmentsDir,
>           HadoopFSUtil.getPassDirectoriesFilter(fs));
>       segs.addAll(Arrays.asList(HadoopFSUtil.getPaths(paths)));
>     }
>     else if(args.containsKey(Nutch.ARG_SEGMENT)) {
>       Object segments = args.get(Nutch.ARG_SEGMENT);
>       ArrayList<String> segmentList = new ArrayList<>();
>       if(segments instanceof ArrayList) {
>         segmentList = (ArrayList<String>)segments;
>       }
>       for(String segment: segmentList) {
>         segs.add(new Path(segment));
>       }
>     }
>
> Semyon.