possibly wrong code in class org.apache.nutch.indexer.IndexerMapReduce , nutch-1.13

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

possibly wrong code in class org.apache.nutch.indexer.IndexerMapReduce , nutch-1.13

Junqiang Zhang
Hello,

I am using nutch version 1.13. There might be mistakenly used logical
operators in the code from line 259 to 262 of the class
org.apache.nutch.indexer.IndexerMapReduce.

The logical operators used are OR ||. I think the correct logical
operators should be AND &&. The comment in line 261 says "only have
inlinks", which also indicates the logical operators should be AND.

Line 259 to 262 of org.apache.nutch.indexer.IndexerMapReduce are copied below.

    if (fetchDatum == null || dbDatum == null || parseText == null

        || parseData == null) {

      return; // only have inlinks

    }

Development team please have a look at the code and determine whether
it is wrong. Thanks.

Best,
Junqiang
Reply | Threaded
Open this post in threaded view
|

Re: possibly wrong code in class org.apache.nutch.indexer.IndexerMapReduce , nutch-1.13

Sebastian Nagel
Hi Junqiang,

thanks for the careful code review.

Well, the answer isn't that trivial. In general, the code is right
because any of the values being null will make the indexing or
scoring filters called later fail with a NPE.
However, the comment is wrong or incomplete:
- "only have inlinks" (if links not yet added to CrawlDb,
  then the most common case for sure)
- but there are other possible ways the condition may become true:
  * a URL / CrawlDatum removed from CrawlDb
    (in combination with a parallelized workflow)
  * parsing skipped or failed
    (need to check whether this may happen)

Feel free to open an issue on http://issues.apache.org/jira/NUTCH
to make the code better documented / commented.

Thanks,
Sebastian


On 09/09/2017 01:24 PM, Junqiang Zhang wrote:

> Hello,
>
> I am using nutch version 1.13. There might be mistakenly used logical
> operators in the code from line 259 to 262 of the class
> org.apache.nutch.indexer.IndexerMapReduce.
>
> The logical operators used are OR ||. I think the correct logical
> operators should be AND &&. The comment in line 261 says "only have
> inlinks", which also indicates the logical operators should be AND.
>
> Line 259 to 262 of org.apache.nutch.indexer.IndexerMapReduce are copied below.
>
>     if (fetchDatum == null || dbDatum == null || parseText == null
>
>         || parseData == null) {
>
>       return; // only have inlinks
>
>     }
>
> Development team please have a look at the code and determine whether
> it is wrong. Thanks.
>
> Best,
> Junqiang
>

Reply | Threaded
Open this post in threaded view
|

Re: possibly wrong code in class org.apache.nutch.indexer.IndexerMapReduce , nutch-1.13

Sebastian Nagel
Sorry, the right link to open an issue is
   https://issues.apache.org/jira/projects/NUTCH

Thanks,
Sebastian

On 09/10/2017 12:58 PM, Sebastian Nagel wrote:

> Hi Junqiang,
>
> thanks for the careful code review.
>
> Well, the answer isn't that trivial. In general, the code is right
> because any of the values being null will make the indexing or
> scoring filters called later fail with a NPE.
> However, the comment is wrong or incomplete:
> - "only have inlinks" (if links not yet added to CrawlDb,
>   then the most common case for sure)
> - but there are other possible ways the condition may become true:
>   * a URL / CrawlDatum removed from CrawlDb
>     (in combination with a parallelized workflow)
>   * parsing skipped or failed
>     (need to check whether this may happen)
>
> Feel free to open an issue on http://issues.apache.org/jira/NUTCH
> to make the code better documented / commented.
>
> Thanks,
> Sebastian
>
>
> On 09/09/2017 01:24 PM, Junqiang Zhang wrote:
>> Hello,
>>
>> I am using nutch version 1.13. There might be mistakenly used logical
>> operators in the code from line 259 to 262 of the class
>> org.apache.nutch.indexer.IndexerMapReduce.
>>
>> The logical operators used are OR ||. I think the correct logical
>> operators should be AND &&. The comment in line 261 says "only have
>> inlinks", which also indicates the logical operators should be AND.
>>
>> Line 259 to 262 of org.apache.nutch.indexer.IndexerMapReduce are copied below.
>>
>>     if (fetchDatum == null || dbDatum == null || parseText == null
>>
>>         || parseData == null) {
>>
>>       return; // only have inlinks
>>
>>     }
>>
>> Development team please have a look at the code and determine whether
>> it is wrong. Thanks.
>>
>> Best,
>> Junqiang
>>
>