Generating a sitemap

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Generating a sitemap

Ian M. Evans
Been testing nutch to crawl for solr and I was wondering if anyone had
already worked on a system for getting the urls out of solr and generating
an XML sitemap for Google.
Reply | Threaded
Open this post in threaded view
|

Re: Generating a sitemap

hossman

: Been testing nutch to crawl for solr and I was wondering if anyone had
: already worked on a system for getting the urls out of solr and generating
: an XML sitemap for Google.

it's pretty easy to just paginate through all docs in solr, so you could
do that -- but I'd be really suprised if Nutch wasn't also loggign all the
URLs it indexed, so you could just post-process that log to build the
sitemap as well.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Generating a sitemap

Jon Baer
It's also possible to try and use the Velocity contrib response writer and paging it w/ the sitemap elements.

BTW generating a sitemap was a big reason of a switch we did from GSA to Solr because (for some reason) the map took way too long to generate (even simple requests).

If you page through w/ Solr (ie rows=100&wt=velocity&v.template=sitemap) its fairly painless to build on cron.

- Jon

On Mar 18, 2010, at 6:25 PM, Chris Hostetter wrote:

>
> : Been testing nutch to crawl for solr and I was wondering if anyone had
> : already worked on a system for getting the urls out of solr and generating
> : an XML sitemap for Google.
>
> it's pretty easy to just paginate through all docs in solr, so you could
> do that -- but I'd be really suprised if Nutch wasn't also loggign all the
> URLs it indexed, so you could just post-process that log to build the
> sitemap as well.
>
>
>
> -Hoss
>

Reply | Threaded
Open this post in threaded view
|

Re: Generating a sitemap

Erik Hatcher-4
Jon -

Very cool use of VelocityResponseWriter!

Would you happen to have a sitemap.vm template to contribute?   I  
realize there'd need to be an external URL configurable, but this  
would be trivially added as a request parameter and leveraged in the  
template.

        Erik

p.s. Anyone else using VelocityResponseWriter out there?   Sitemaps is  
a great use of it.  And also I've got a report of a big company in  
Brazil using it for e-mail generation of search results.   I'm in the  
process of baking VrW into the main Solr example (it's there on trunk,  
basically) and more examples are better.

On Mar 18, 2010, at 7:40 PM, Jon Baer wrote:

> It's also possible to try and use the Velocity contrib response  
> writer and paging it w/ the sitemap elements.
>
> BTW generating a sitemap was a big reason of a switch we did from  
> GSA to Solr because (for some reason) the map took way too long to  
> generate (even simple requests).
>
> If you page through w/ Solr (ie  
> rows=100&wt=velocity&v.template=sitemap) its fairly painless to  
> build on cron.
>
> - Jon
>
> On Mar 18, 2010, at 6:25 PM, Chris Hostetter wrote:
>
>>
>> : Been testing nutch to crawl for solr and I was wondering if  
>> anyone had
>> : already worked on a system for getting the urls out of solr and  
>> generating
>> : an XML sitemap for Google.
>>
>> it's pretty easy to just paginate through all docs in solr, so you  
>> could
>> do that -- but I'd be really suprised if Nutch wasn't also loggign  
>> all the
>> URLs it indexed, so you could just post-process that log to build the
>> sitemap as well.
>>
>>
>>
>> -Hoss
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Generating a sitemap

Jon Baer
It's unfortunately actually a pretty domain specific thing (urls, content, etc), there are also limits @ certain points (see ... but we took CNN.com as a model, for example:

http://www.cnn.com/video_sitemap_index.xml
http://www.cnn.com/sitemap_videos_0001.xml

Then you just line up the big 3 w/ the static URLs, etc.

http://en.wikipedia.org/wiki/Sitemaps (the submission URLs are there)
http://www.bing.com/toolbox/posts/archive/2009/10/09/submit-a-sitemap-to-bing.aspx

In general though it's great to create custom handlers and use Velocity templates for pretty much anything + its great for prototyping.

- Jon

On Mar 19, 2010, at 8:55 AM, Erik Hatcher wrote:

> Jon -
>
> Very cool use of VelocityResponseWriter!
>
> Would you happen to have a sitemap.vm template to contribute?   I realize there'd need to be an external URL configurable, but this would be trivially added as a request parameter and leveraged in the template.
>
> Erik
>
> p.s. Anyone else using VelocityResponseWriter out there?   Sitemaps is a great use of it.  And also I've got a report of a big company in Brazil using it for e-mail generation of search results.   I'm in the process of baking VrW into the main Solr example (it's there on trunk, basically) and more examples are better.
>
> On Mar 18, 2010, at 7:40 PM, Jon Baer wrote:
>
>> It's also possible to try and use the Velocity contrib response writer and paging it w/ the sitemap elements.
>>
>> BTW generating a sitemap was a big reason of a switch we did from GSA to Solr because (for some reason) the map took way too long to generate (even simple requests).
>>
>> If you page through w/ Solr (ie rows=100&wt=velocity&v.template=sitemap) its fairly painless to build on cron.
>>
>> - Jon
>>
>> On Mar 18, 2010, at 6:25 PM, Chris Hostetter wrote:
>>
>>>
>>> : Been testing nutch to crawl for solr and I was wondering if anyone had
>>> : already worked on a system for getting the urls out of solr and generating
>>> : an XML sitemap for Google.
>>>
>>> it's pretty easy to just paginate through all docs in solr, so you could
>>> do that -- but I'd be really suprised if Nutch wasn't also loggign all the
>>> URLs it indexed, so you could just post-process that log to build the
>>> sitemap as well.
>>>
>>>
>>>
>>> -Hoss
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Generating a sitemap

Doki
Hi all,

     Hate to bring forward a zombified thread (Mar 2010 though, not too bad), but I also am tasked to generate a sitemap for items indexed in a Solr index.  Been at this job for only a few weeks, so Solr and Lucene are all new to me, but I think my path forward on this is to create a requesthandler that creates a flat datafile upon request, then program a script (Php) that calls this request, reformats the data into the appropriate xml format, then posts it for Google to find and crawl.  Attach this script to a crontab item (daily, weekly, whatever schedule the Google Webmaster Tools has set for the site), and Boom!  Problem solved.
     Anyone else try this method?  Any successes, failures, advice, etc?

Dave