[GSoC]About some general information

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[GSoC]About some general information

Han Jiang
Hi All,

I'm Billy, a senior undergraduate student in Peking University. I'm working in the area of Information Retrieval and Web Mining. When going through the idea list, I felt quite interested in the LUCENE-3892 and LUCENE-3069. I am very proficient on java, and have been using lucene for about one year. I am looking forward to make a contribution to this project.

Here, I have a few questions about lucene:

First of all,  which version of lucene shall we use as a start point? The trunk or 3.5?
Is there any demo codes to show the idea of Codecs?
How many posting formats are supposed to be implemented, for project LUCENE-3892 ?
Is there any further documentation for LUCENE-3069 ?

Thank you!

--
Han Jiang

EECS, Peking University, China
Every Effort Creates Smile
 
Senior Student

Reply | Threaded
Open this post in threaded view
|

Re: [GSoC]About some general information

Michael McCandless-2
Hello!  Answers below...:

On Wed, Mar 21, 2012 at 11:03 AM, Han Jiang <[hidden email]> wrote:
> Hi All,
>
> I'm Billy, a senior undergraduate student in Peking University. I'm working
> in the area of Information Retrieval and Web Mining. When going through the
> idea list, I felt quite interested in the LUCENE-3892 and LUCENE-3069. I am
> very proficient on java, and have been using lucene for about one year. I am
> looking forward to make a contribution to this project.

Awesome.

> Here, I have a few questions about lucene:
>
> First of all,  which version of lucene shall we use as a start point? The
> trunk or 3.5?

Both of these issues will be trunk only I think: they both are far
easier to do with the Codec API in 4.0.

> Is there any demo codes to show the idea of Codecs?

Maybe the simplest demo would be to look at the SimpleText codec?  It
roughly "tries" to have simple source code as well as a simple (text
only, human readable) on-disk format.

> How many posting formats are supposed to be implemented, for project
> LUCENE-3892 ?

This can be worked out when scoping the project... but I think getting
one postings format working well would be awesome :)  If somehow
that's too easy, then add more!

> Is there any further documentation for LUCENE-3069 ?

Not that I know of... but I suspect the approach can be very similar
to the MemoryPostingsFormat we already have, just that it'd only be
the terms data stored in the FST, while the postings
(docs/freqs/positions/offsets) are written to a file.

Ideally, it would just act like a different terms dictionary
implementation, ie so that we can then plug in any PostingsBaseFormat
(even the one from LUCENE-3892!).

> Thank you!

You're welcome, and welcome to Lucene/Solr!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]