[jira] Created: (NUTCH-201) add support for subcollections

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-201) add support for subcollections

Nick Burch (Jira)
add support for subcollections
------------------------------

         Key: NUTCH-201
         URL: http://issues.apache.org/jira/browse/NUTCH-201
     Project: Nutch
        Type: New Feature
    Versions: 0.8-dev    
    Reporter: Sami Siren
 Assigned to: Sami Siren
    Priority: Minor
     Fix For: 0.8-dev


Subcollection is a subset of an index. Subcollections are defined
by urlpatterns in form of white/blacklist. So to get the page into
subcollection it must match the whitelist and not the blacklist.

Subcollection definitions are read from a file subcollections.xml
and the format is as follows (imagine here that you are crawling all
the virtualhosts from apache.org and you wan't to tag pages with
url pattern "http://lucene.apache.org/" to be part of subcollection
lucene.

<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
       <subcollection>
               <name>lucene</name>
               <id>lucene</id>
               <whitelist>http://lucene.apache.org/</whitelist>
               <blacklist />
       </subcollection>
</subcollections>

plugin contains indexingfilter, query filter and supporting classes


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-201) add support for subcollections

Nick Burch (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-201?page=all ]

Sami Siren updated NUTCH-201:
-----------------------------

    Attachment: subcollections-1.patch

> add support for subcollections
> ------------------------------
>
>          Key: NUTCH-201
>          URL: http://issues.apache.org/jira/browse/NUTCH-201
>      Project: Nutch
>         Type: New Feature
>     Versions: 0.8-dev
>     Reporter: Sami Siren
>     Assignee: Sami Siren
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: subcollections-1.patch
>
> Subcollection is a subset of an index. Subcollections are defined
> by urlpatterns in form of white/blacklist. So to get the page into
> subcollection it must match the whitelist and not the blacklist.
> Subcollection definitions are read from a file subcollections.xml
> and the format is as follows (imagine here that you are crawling all
> the virtualhosts from apache.org and you wan't to tag pages with
> url pattern "http://lucene.apache.org/" to be part of subcollection
> lucene.
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
>        <subcollection>
>                <name>lucene</name>
>                <id>lucene</id>
>                <whitelist>http://lucene.apache.org/</whitelist>
>                <blacklist />
>        </subcollection>
> </subcollections>
> plugin contains indexingfilter, query filter and supporting classes

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-201) add support for subcollections

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-201?page=all ]

Sami Siren updated NUTCH-201:
-----------------------------

    Attachment: subcollections.2.patch

-added missing class
-nutch->hadoop api changes

> add support for subcollections
> ------------------------------
>
>          Key: NUTCH-201
>          URL: http://issues.apache.org/jira/browse/NUTCH-201
>      Project: Nutch
>         Type: New Feature
>     Versions: 0.8-dev
>     Reporter: Sami Siren
>     Assignee: Sami Siren
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: subcollections-1.patch, subcollections.2.patch
>
> Subcollection is a subset of an index. Subcollections are defined
> by urlpatterns in form of white/blacklist. So to get the page into
> subcollection it must match the whitelist and not the blacklist.
> Subcollection definitions are read from a file subcollections.xml
> and the format is as follows (imagine here that you are crawling all
> the virtualhosts from apache.org and you wan't to tag pages with
> url pattern "http://lucene.apache.org/" to be part of subcollection
> lucene.
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
>        <subcollection>
>                <name>lucene</name>
>                <id>lucene</id>
>                <whitelist>http://lucene.apache.org/</whitelist>
>                <blacklist />
>        </subcollection>
> </subcollections>
> plugin contains indexingfilter, query filter and supporting classes

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-201) add support for subcollections

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-201?page=all ]
     
Sami Siren resolved NUTCH-201:
------------------------------

    Resolution: Fixed

just committed this

> add support for subcollections
> ------------------------------
>
>          Key: NUTCH-201
>          URL: http://issues.apache.org/jira/browse/NUTCH-201
>      Project: Nutch
>         Type: New Feature

>     Versions: 0.8-dev
>     Reporter: Sami Siren
>     Assignee: Sami Siren
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: subcollections-1.patch, subcollections.2.patch
>
> Subcollection is a subset of an index. Subcollections are defined
> by urlpatterns in form of white/blacklist. So to get the page into
> subcollection it must match the whitelist and not the blacklist.
> Subcollection definitions are read from a file subcollections.xml
> and the format is as follows (imagine here that you are crawling all
> the virtualhosts from apache.org and you wan't to tag pages with
> url pattern "http://lucene.apache.org/" to be part of subcollection
> lucene.
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
>        <subcollection>
>                <name>lucene</name>
>                <id>lucene</id>
>                <whitelist>http://lucene.apache.org/</whitelist>
>                <blacklist />
>        </subcollection>
> </subcollections>
> plugin contains indexingfilter, query filter and supporting classes

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira