Created: (NUTCH-447) Dmoz Structure Parser Tool

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Created: (NUTCH-447) Dmoz Structure Parser Tool

JIRA jira@apache.org
Dmoz Structure Parser Tool
--------------------------

                 Key: NUTCH-447
                 URL: https://issues.apache.org/jira/browse/NUTCH-447
             Project: Nutch
          Issue Type: New Feature
    Affects Versions: 0.9.0
         Environment: all platforms
            Reporter: Dennis Kubes
         Assigned To: Dennis Kubes
            Priority: Minor


This is a tool that will take the dmoz structure RDF file and return a listing of the categories.  The categories return can be limited by depth or by regular expression pattern.  This tool borrows heavily from the DmozParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (NUTCH-447) Dmoz Structure Parser Tool

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-447:
-------------------------------

    Attachment: dmoz-structure.patch

Patch that contains the DmozStructureParser class.

> Dmoz Structure Parser Tool
> --------------------------
>
>                 Key: NUTCH-447
>                 URL: https://issues.apache.org/jira/browse/NUTCH-447
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>         Environment: all platforms
>            Reporter: Dennis Kubes
>         Assigned To: Dennis Kubes
>            Priority: Minor
>         Attachments: dmoz-structure.patch
>
>
> This is a tool that will take the dmoz structure RDF file and return a listing of the categories.  The categories return can be limited by depth or by regular expression pattern.  This tool borrows heavily from the DmozParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-447) Dmoz Structure Parser Tool

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474663 ]

Otis Gospodnetic commented on NUTCH-447:
----------------------------------------

The idea being to limit crawling only to links under a certain category as opposed to crawling all links in Dmoz?


> Dmoz Structure Parser Tool
> --------------------------
>
>                 Key: NUTCH-447
>                 URL: https://issues.apache.org/jira/browse/NUTCH-447
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>         Environment: all platforms
>            Reporter: Dennis Kubes
>         Assigned To: Dennis Kubes
>            Priority: Minor
>         Attachments: dmoz-structure.patch
>
>
> This is a tool that will take the dmoz structure RDF file and return a listing of the categories.  The categories return can be limited by depth or by regular expression pattern.  This tool borrows heavily from the DmozParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-447) Dmoz Structure Parser Tool

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474713 ]

Dennis Kubes commented on NUTCH-447:
------------------------------------

This tool is for people who need a defined category structure or want to grab all or part of the dmoz category structure without urls.  You could certainly then use this list as the topic list in the DmozParserTool to only crawl under a certain category.  

> Dmoz Structure Parser Tool
> --------------------------
>
>                 Key: NUTCH-447
>                 URL: https://issues.apache.org/jira/browse/NUTCH-447
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>         Environment: all platforms
>            Reporter: Dennis Kubes
>         Assigned To: Dennis Kubes
>            Priority: Minor
>         Attachments: dmoz-structure.patch
>
>
> This is a tool that will take the dmoz structure RDF file and return a listing of the categories.  The categories return can be limited by depth or by regular expression pattern.  This tool borrows heavily from the DmozParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-447) Dmoz Structure Parser Tool

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-447.
------------------------------


Closed

> Dmoz Structure Parser Tool
> --------------------------
>
>                 Key: NUTCH-447
>                 URL: https://issues.apache.org/jira/browse/NUTCH-447
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>         Environment: all platforms
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>         Attachments: dmoz-structure.patch
>
>
> This is a tool that will take the dmoz structure RDF file and return a listing of the categories.  The categories return can be limited by depth or by regular expression pattern.  This tool borrows heavily from the DmozParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-447) Dmoz Structure Parser Tool

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-447.
--------------------------------

    Resolution: Won't Fix

Tool is in JIRA, no need to add to main trunk.

> Dmoz Structure Parser Tool
> --------------------------
>
>                 Key: NUTCH-447
>                 URL: https://issues.apache.org/jira/browse/NUTCH-447
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>         Environment: all platforms
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>         Attachments: dmoz-structure.patch
>
>
> This is a tool that will take the dmoz structure RDF file and return a listing of the categories.  The categories return can be limited by depth or by regular expression pattern.  This tool borrows heavily from the DmozParser.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.