clucene-java bindings

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

clucene-java bindings

Ben van Klinken
Hi Nutch People,

I am a developer of CLucene, which is a full C++ port of Apache
Lucene. I would like to propose something to users of Nutch:

I have been working on some SWIG wrappers for CLucene in various
higher-level languages such as C#,Java and COM. I started working on
the Java wrapper for the purpose of 'stealing' Java test suites for
the purpose of testing CLucene.

I have already managed to run about half of the luceneDotNet tests
successfully using the CLucene-csharp bindings (the rest can mostly
not be done because of the lack of director support in the Swig Csharp
module). This has been useful in tracking down bugs, etc.

Without too much effort, I have managed to get the Java bindings
working. I have so far been able to get the IndexFiles demo program to
run with very few changes to the Java code (I had to change the
imports code and put a System.loadLibrary call in - though these
differences would eventually be able to be removed completely).

I only spent a minute looking at speeds, but I indexed a directory
which took 2.5 seconds on java lucene and the same thing took 1.5
seconds in clucene-java. Of course this is not saying much, but it
means that clucene-java *might* be faster.

So what I wanted to propose to users and developer of Nutch this: with
a bit of effort, clucene-java could be good enough to be 'dropped
into' the nutch project thereby speeding up the nutch indexer. We
could write directors for clucene-java which would pass off some
things like the analysers into java. This would be beneficial to nutch
because of the added speed. If the clucene-java wrapper was written
well, there would be no need for any code change in nutch, aside from
changing which lucene jar file is loaded.

This is just some preliminary thoughts, I'm sure there is still a lot
to think about. But I have shown that the concept could work using the
demo files and I think that it could give nutch indexing/search a
reasonable speed boost.

What do people think? I am prepared to nut out this one with whoever
is interested

cheers,
ben
Reply | Threaded
Open this post in threaded view
|

Re: clucene-java bindings

Piotr Kosiorowski
Hello Ben,
I personally would be interested mainly in search part of it if speed
increase would be significant. I am running my indices on linux/ AMD
Opterons - I hope CLucene will work well in this environment. I assume
CLucene is compatible with Java lucene index format as we do have some
tools in Java that manipulate Lucene indices. If you have abything to
integrate with nutch I am willing to help with integration and test it.
Regards,
Piotr

Ben van Klinken wrote:

> Hi Nutch People,
>
> I am a developer of CLucene, which is a full C++ port of Apache
> Lucene. I would like to propose something to users of Nutch:
>
> I have been working on some SWIG wrappers for CLucene in various
> higher-level languages such as C#,Java and COM. I started working on
> the Java wrapper for the purpose of 'stealing' Java test suites for
> the purpose of testing CLucene.
>
> I have already managed to run about half of the luceneDotNet tests
> successfully using the CLucene-csharp bindings (the rest can mostly
> not be done because of the lack of director support in the Swig Csharp
> module). This has been useful in tracking down bugs, etc.
>
> Without too much effort, I have managed to get the Java bindings
> working. I have so far been able to get the IndexFiles demo program to
> run with very few changes to the Java code (I had to change the
> imports code and put a System.loadLibrary call in - though these
> differences would eventually be able to be removed completely).
>
> I only spent a minute looking at speeds, but I indexed a directory
> which took 2.5 seconds on java lucene and the same thing took 1.5
> seconds in clucene-java. Of course this is not saying much, but it
> means that clucene-java *might* be faster.
>
> So what I wanted to propose to users and developer of Nutch this: with
> a bit of effort, clucene-java could be good enough to be 'dropped
> into' the nutch project thereby speeding up the nutch indexer. We
> could write directors for clucene-java which would pass off some
> things like the analysers into java. This would be beneficial to nutch
> because of the added speed. If the clucene-java wrapper was written
> well, there would be no need for any code change in nutch, aside from
> changing which lucene jar file is loaded.
>
> This is just some preliminary thoughts, I'm sure there is still a lot
> to think about. But I have shown that the concept could work using the
> demo files and I think that it could give nutch indexing/search a
> reasonable speed boost.
>
> What do people think? I am prepared to nut out this one with whoever
> is interested
>
> cheers,
> ben
>

Reply | Threaded
Open this post in threaded view
|

Re: clucene-java bindings

Ben van Klinken
Hi

On 8/9/05, Piotr Kosiorowski <[hidden email]> wrote:
> Hello Ben,
> I personally would be interested mainly in search part of it if speed
> increase would be significant. I am running my indices on linux/ AMD
> Opterons - I hope CLucene will work well in this environment.
If anyone has some java benchmark classes, i can try and run them.

CLucene compiles fine on linux. I have been experimenting with memory
pools, etc to increase speed even more, the google one seems to
increase speeds of clucene quite a bit. I started looking at building
a memory pool into clucene, so i could choose which objects would be
pooled. I haven't had a chance to work on it much more, but i did
notice some improvements

There are various optimisations that will increase the speed on linux
even further... there are often no hashmaps on linux, so clucene must
use non-hashed maps. Using the google densemap code might speed up
this aspect of it too.

So there are many things that can be done on CLucene to improve the
speed. It just needs more smart people to look at it ;)

> I assumeCLucene is compatible with Java lucene index format as we do have some
> tools in Java that manipulate Lucene indices.
Yep, it is an exact port, but currently only implementing the 1.4.3 API.

> integrate with nutch I am willing to help with integration and test it.
Great!

> Regards,
> Piotr
>
> Ben van Klinken wrote:
> > Hi Nutch People,
> >
> > I am a developer of CLucene, which is a full C++ port of Apache
> > Lucene. I would like to propose something to users of Nutch:
> >
> > I have been working on some SWIG wrappers for CLucene in various
> > higher-level languages such as C#,Java and COM. I started working on
> > the Java wrapper for the purpose of 'stealing' Java test suites for
> > the purpose of testing CLucene.
> >
> > I have already managed to run about half of the luceneDotNet tests
> > successfully using the CLucene-csharp bindings (the rest can mostly
> > not be done because of the lack of director support in the Swig Csharp
> > module). This has been useful in tracking down bugs, etc.
> >
> > Without too much effort, I have managed to get the Java bindings
> > working. I have so far been able to get the IndexFiles demo program to
> > run with very few changes to the Java code (I had to change the
> > imports code and put a System.loadLibrary call in - though these
> > differences would eventually be able to be removed completely).
> >
> > I only spent a minute looking at speeds, but I indexed a directory
> > which took 2.5 seconds on java lucene and the same thing took 1.5
> > seconds in clucene-java. Of course this is not saying much, but it
> > means that clucene-java *might* be faster.
> >
> > So what I wanted to propose to users and developer of Nutch this: with
> > a bit of effort, clucene-java could be good enough to be 'dropped
> > into' the nutch project thereby speeding up the nutch indexer. We
> > could write directors for clucene-java which would pass off some
> > things like the analysers into java. This would be beneficial to nutch
> > because of the added speed. If the clucene-java wrapper was written
> > well, there would be no need for any code change in nutch, aside from
> > changing which lucene jar file is loaded.
> >
> > This is just some preliminary thoughts, I'm sure there is still a lot
> > to think about. But I have shown that the concept could work using the
> > demo files and I think that it could give nutch indexing/search a
> > reasonable speed boost.
> >
> > What do people think? I am prepared to nut out this one with whoever
> > is interested
> >
> > cheers,
> > ben
> >
>
>