[jira] Created: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

Parth (Jira)
Process Sitemap data in text, rss or xml format as well as OAI-PMH
------------------------------------------------------------------

         Key: NUTCH-158
         URL: http://issues.apache.org/jira/browse/NUTCH-158
     Project: Nutch
        Type: New Feature
  Components: fetcher  
    Versions: 0.8-dev    
    Reporter: byron miller
    Priority: Minor


Add support to the fetcher to look for sitemap files, download them and process them into webdb.

Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line per url) and process that.

I would love to see someone stomp on proprietary sitemap features or making things so google specific as they are today :)

* RSS format/Atom Format (standard)
* XML meta descroption
* OAI-PMH meta description (http://www.openarchives.org/OAI/openarchivesprotocol.html)

Perhaps even a "pre crawler" that will scour for these to inject into the web db to help build your link map so you could even just index topN.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

Parth (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-158?page=comments#action_12365483 ]

raghavendra prabhu commented on NUTCH-158:
------------------------------------------

This is an important thing

We should automaticall be able to insert the links parsed out of site map into webdb

But currently if we enable parse-rss and crawl these links ,dont they get added

> Process Sitemap data in text, rss or xml format as well as OAI-PMH
> ------------------------------------------------------------------
>
>          Key: NUTCH-158
>          URL: http://issues.apache.org/jira/browse/NUTCH-158
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: byron miller
>     Priority: Minor

>
> Add support to the fetcher to look for sitemap files, download them and process them into webdb.
> Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line per url) and process that.
> I would love to see someone stomp on proprietary sitemap features or making things so google specific as they are today :)
> * RSS format/Atom Format (standard)
> * XML meta descroption
> * OAI-PMH meta description (http://www.openarchives.org/OAI/openarchivesprotocol.html)
> Perhaps even a "pre crawler" that will scour for these to inject into the web db to help build your link map so you could even just index topN.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira