Indexing source code files

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing source code files

Dharmalingam
I am working on some sort of search mechanism to link a requirement (i.e. a query) to source code files (i.e., documents). For that purpose, I indexed the source code files using Lucene. Contrary to traditional natural language search scenario, we search for code files that are relevant to a given requirement. One problem here is that the source files usually contain a lot of abbreviations, words joint by _ or combination of words and/or abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether anyone of you already did indexing of (source) files or documents which contain that kind of words.
Reply | Threaded
Open this post in threaded view
|

Re: Indexing source code files

Mathieu Lecarme
Dharmalingam a écrit :

> I am working on some sort of search mechanism to link a requirement (i.e. a
> query) to source code files (i.e., documents). For that purpose, I indexed
> the source code files using Lucene. Contrary to traditional natural language
> search scenario, we search for code files that are relevant to a given
> requirement. One problem here is that the source files usually contain a lot
> of abbreviations, words joint by _ or combination of words and/or
> abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether anyone
> of you already did indexing of (source) files or documents which contain
> that kind of words.
>  
You need a specific Tokenizer.
You will use several Field : class, method, comments, code, javadoc.
Some field can use casual tokenizer (comments), other needs a specific
one : splitting oneJavWord in several words.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing source code files

kkrugler
In reply to this post by Dharmalingam
>I am working on some sort of search mechanism to link a requirement (i.e. a
>query) to source code files (i.e., documents). For that purpose, I indexed
>the source code files using Lucene. Contrary to traditional natural language
>search scenario, we search for code files that are relevant to a given
>requirement. One problem here is that the source files usually contain a lot
>of abbreviations, words joint by _ or combination of words and/or
>abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether anyone
>of you already did indexing of (source) files or documents which contain
>that kind of words.

Yes, that's been something we've spent a fair amount of time on...see
http://www.krugle.org (public code search).

As Mathieu noted, the first thing you really want to do is split the
file up into at least comments vs. code. Then you can use a regular
analyzer (or perhaps something more human language-specific, e.g.
with stemming support) on the comment text, and your own custom
tokenizer on the code.

In the code, you might further want to treat literals (strings, etc)
differently than other terms.

And in "real" code terms, then you want to do essentially synonym
processing, where you turn a single term into multiple terms based on
the automatic splitting of the term using '_', '-', camelCasing,
letter/digit transitions, etc.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing source code files

Bill Au
In reply to this post by Dharmalingam
There is an opensource project, OpenGrok, that uses Lucene for indexing and
searching source code:

http://opensolaris.org/os/project/opengrok/

It has Analyzers for different type of source files.  It does link source
code to requirements but you can
take a look at the source code to see how it does the indexing.

Bill

On Thu, Feb 28, 2008 at 11:18 AM, Ken Krugler <[hidden email]>
wrote:

> >I am working on some sort of search mechanism to link a requirement (i.e.
> a
> >query) to source code files (i.e., documents). For that purpose, I
> indexed
> >the source code files using Lucene. Contrary to traditional natural
> language
> >search scenario, we search for code files that are relevant to a given
> >requirement. One problem here is that the source files usually contain a
> lot
> >of abbreviations, words joint by _ or combination of words and/or
> >abbreviations (e.x., getAccountBalanceTbl).  I am wondering whether
> anyone
> >of you already did indexing of (source) files or documents which contain
> >that kind of words.
>
> Yes, that's been something we've spent a fair amount of time on...see
> http://www.krugle.org (public code search).
>
> As Mathieu noted, the first thing you really want to do is split the
> file up into at least comments vs. code. Then you can use a regular
> analyzer (or perhaps something more human language-specific, e.g.
> with stemming support) on the comment text, and your own custom
> tokenizer on the code.
>
> In the code, you might further want to treat literals (strings, etc)
> differently than other terms.
>
> And in "real" code terms, then you want to do essentially synonym
> processing, where youhttp://opensolaris.org/os/project/opengrok/ turn a
> single term into multiple terms based on
> the automatic splitting of the term using '_', '-', camelCasing,
> letter/digit transitions, etc.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>