Crawling an SCM to update a Solr index

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawling an SCM to update a Solr index

Van Tassell, Kristian
Hello everyone,

I'm in the process of pulling together requirements for a SCM (source code manager) crawling mechanism for our Solr index. I probably don't need to argue the need for a crawler, but to be specific, we have an index which receives its updates from a custom built application. I would, however, like to periodically crawl the SCM to ensure the index is up to date. In addition, if updates are made which require a complete reindex (such as schema.xml modifications), I could utilize this crawler to update everything or specific areas.

I'm wondering if there are any initiatives, tools (like Nutch) or whitepapers out there, which crawl an SCM. More specifically, I'm looking for a Perforce solution. I'm guessing that there is nothing specific and I'm prepared to design to our specific requirements, but wanted to check with the Solr community prior to getting too far in.

I'm most likely going to build the solution to interact with the SCM directly (via their API) versus sync'ing the SCM repository to the filesystem and crawl that way, since there could be filesystem problem syncing the data and because there may be relevant metadata information that can be retrieved from the SCM.

Thanks in advance for any information you may have,
Kristian
Reply | Threaded
Open this post in threaded view
|

Re: Crawling an SCM to update a Solr index

Otis Gospodnetic-2
Kristian,

For what it's worth, for http://search-lucene.com and http://search-hadoop.com we simply check out the source code from the SCM and index from the file system.  It works reasonably well.  The only issues that I can recall us having is with the source code organization under SCM - modules get moved around and sometimes this requires us to update stuff on our end to match those changes.

Otis
----
Performance Monitoring for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: "Van Tassell, Kristian" <[hidden email]>
>To: "[hidden email]" <[hidden email]>
>Sent: Friday, April 20, 2012 3:26 PM
>Subject: Crawling an SCM to update a Solr index
>
>Hello everyone,
>
>I'm in the process of pulling together requirements for a SCM (source code manager) crawling mechanism for our Solr index. I probably don't need to argue the need for a crawler, but to be specific, we have an index which receives its updates from a custom built application. I would, however, like to periodically crawl the SCM to ensure the index is up to date. In addition, if updates are made which require a complete reindex (such as schema.xml modifications), I could utilize this crawler to update everything or specific areas.
>
>I'm wondering if there are any initiatives, tools (like Nutch) or whitepapers out there, which crawl an SCM. More specifically, I'm looking for a Perforce solution. I'm guessing that there is nothing specific and I'm prepared to design to our specific requirements, but wanted to check with the Solr community prior to getting too far in.
>
>I'm most likely going to build the solution to interact with the SCM directly (via their API) versus sync'ing the SCM repository to the filesystem and crawl that way, since there could be filesystem problem syncing the data and because there may be relevant metadata information that can be retrieved from the SCM.
>
>Thanks in advance for any information you may have,
>Kristian
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Crawling an SCM to update a Solr index

Van Tassell, Kristian
Otis,

Thanks for the input! Were it not the metadata I need to extract and the slight possibility a sync error/file system error or inconsistency could occur, I would take that same route.

-Kristian

-----Original Message-----
From: Otis Gospodnetic [mailto:[hidden email]]
Sent: Friday, April 20, 2012 10:13 PM
To: [hidden email]
Subject: Re: Crawling an SCM to update a Solr index

Kristian,

For what it's worth, for http://search-lucene.com and http://search-hadoop.com we simply check out the source code from the SCM and index from the file system.  It works reasonably well.  The only issues that I can recall us having is with the source code organization under SCM - modules get moved around and sometimes this requires us to update stuff on our end to match those changes.

Otis
----
Performance Monitoring for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: "Van Tassell, Kristian" <[hidden email]>
>To: "[hidden email]" <[hidden email]>
>Sent: Friday, April 20, 2012 3:26 PM
>Subject: Crawling an SCM to update a Solr index
>
>Hello everyone,
>
>I'm in the process of pulling together requirements for a SCM (source code manager) crawling mechanism for our Solr index. I probably don't need to argue the need for a crawler, but to be specific, we have an index which receives its updates from a custom built application. I would, however, like to periodically crawl the SCM to ensure the index is up to date. In addition, if updates are made which require a complete reindex (such as schema.xml modifications), I could utilize this crawler to update everything or specific areas.
>
>I'm wondering if there are any initiatives, tools (like Nutch) or whitepapers out there, which crawl an SCM. More specifically, I'm looking for a Perforce solution. I'm guessing that there is nothing specific and I'm prepared to design to our specific requirements, but wanted to check with the Solr community prior to getting too far in.
>
>I'm most likely going to build the solution to interact with the SCM directly (via their API) versus sync'ing the SCM repository to the filesystem and crawl that way, since there could be filesystem problem syncing the data and because there may be relevant metadata information that can be retrieved from the SCM.
>
>Thanks in advance for any information you may have,
>Kristian
>
>
>