[jira] [Created] (NUTCH-2440) DbResource does not accept crawlid

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Created] (NUTCH-2440) DbResource does not accept crawlid

JIRA jira@apache.org
Tulay Muezzinoglu created NUTCH-2440:

             Summary: DbResource does not accept crawlid
                 Key: NUTCH-2440
                 URL: https://issues.apache.org/jira/browse/NUTCH-2440
             Project: Nutch
          Issue Type: Bug
          Components: REST_api
    Affects Versions: 2.3, 2.4
            Reporter: Tulay Muezzinoglu
            Priority: Critical
             Fix For: 2.4

DbResource is initiating DbReaders with null crawlids. This blocks querying correct table/collection if crawlid is set during fetch.

For example in mongodb, by default all data is stored in "webpage" collection. Let say you set crawlid as "tech" for fetch, then all data gets stored in "tech_webpage" collection. But during rest call to /db end point, since you cannot specify crawlid, it will query "webpage" collection.

I am thinking either DBFilter can be changed to read in crawlid, or resource path can include crawlid. I am open to suggestions and then can make PR.

This message was sent by Atlassian JIRA