Modelling relational data in Lucene Index?

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Modelling relational data in Lucene Index?

Rajesh parab
Hi,

As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field.

I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is
   
    Folder -- > File
             |
             ------- > is image, is text file, ......


I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index?

Regards,
Rajesh




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

Mark Miller-3
Lucene is probably not the solution if you are looking for a relational
model. You should be using a database for that. If you want to combine
Lucene with a relational model, check out Hibernate and the new EJB
annotations that it supports...there is a cool little Lucene add-on that
lets you declare fields to be indexed (and how) with annotations.

- Mark

Rajesh parab wrote:

> Hi,
>
> As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field.
>
> I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is
>    
>     Folder -- > File
>              |
>              ------- > is image, is text file, ......
>
>
> I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index?
>
> Regards,
> Rajesh
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

Rajesh parab
In reply to this post by Rajesh parab
Thanks Mark.

Can you please tell me more about the Lucene add-on you are talking about? Are you talking about Compass?

Regards,
Rajesh

----- Original Message ----
From: Mark Miller <[hidden email]>
To: [hidden email]
Sent: Thursday, November 2, 2006 7:29:10 PM
Subject: Re: Modelling relational data in Lucene Index?

Lucene is probably not the solution if you are looking for a relational
model. You should be using a database for that. If you want to combine
Lucene with a relational model, check out Hibernate and the new EJB
annotations that it supports...there is a cool little Lucene add-on that
lets you declare fields to be indexed (and how) with annotations.

- Mark

Rajesh parab wrote:

> Hi,
>
> As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field.
>
> I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is
>    
>     Folder -- > File
>              |
>              ------- > is image, is text file, ......
>
>
> I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index?
>
> Regards,
> Rajesh
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

chrislusf
In reply to this post by Mark Miller-3
For this specific question, you can create index on files, search
files that of type image, and from matched files, find the unique
directories(can be done in lucene or you can do it via java).

Of course this does not scale to deeper relationships. Usually you do
need to flattern the database objects in order to use lucene. It's
just trading space for speed.

I would prefer a detached approach instead of Hibernate or EJB's
approach, which is kind of too tightly coupled with any system. How to
rebuild if the index is corrupted, or you have a new Analyzer, or
schema evolves? How to make it multi-thread safe?

--
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com

On 11/2/06, Mark Miller <[hidden email]> wrote:

> Lucene is probably not the solution if you are looking for a relational
> model. You should be using a database for that. If you want to combine
> Lucene with a relational model, check out Hibernate and the new EJB
> annotations that it supports...there is a cool little Lucene add-on that
> lets you declare fields to be indexed (and how) with annotations.
>
> - Mark
>
> Rajesh parab wrote:
> > Hi,
> >
> > As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field.
> >
> > I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is
> >
> >     Folder -- > File
> >              |
> >              ------- > is image, is text file, ......
> >
> >
> > I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index?
> >
> > Regards,
> > Rajesh
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

Rajesh parab
In reply to this post by Rajesh parab
Thanks for feedback Chris.
 
I agree with you. The data set should be flattened out to store inside Lucene index. The Folder-File was just an example. As you know, in relational database, we can have more complex relationships. I understand that this model may not work for deeper relationships.
 
What I am mainly interested in is just one level deep relationship. But, I would like to search on the additional attributes of the related object. For example, in the relationship for Folder-File, I would like to use additional file attributes as search criteria along with file name while searching for folders.
 
The way I see is having single filed for the related object and all its additional attributes and use some separator while capturing this data inside Lucene Field object. For example -
       
            new Field("file", "abc.txt<sep>image");
 
But, I am not quite sure if this model will work.
 
BTW. I did not understand what you meant by the detached approach. Can you please elaborate?
 
Regards,
Rajesh

----- Original Message ----
From: Chris Lu <[hidden email]>
To: [hidden email]
Sent: Thursday, November 2, 2006 7:57:46 PM
Subject: Re: Modelling relational data in Lucene Index?


For this specific question, you can create index on files, search
files that of type image, and from matched files, find the unique
directories(can be done in lucene or you can do it via java).

Of course this does not scale to deeper relationships. Usually you do
need to flattern the database objects in order to use lucene. It's
just trading space for speed.

I would prefer a detached approach instead of Hibernate or EJB's
approach, which is kind of too tightly coupled with any system. How to
rebuild if the index is corrupted, or you have a new Analyzer, or
schema evolves? How to make it multi-thread safe?

--
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com

On 11/2/06, Mark Miller <[hidden email]> wrote:

> Lucene is probably not the solution if you are looking for a relational
> model. You should be using a database for that. If you want to combine
> Lucene with a relational model, check out Hibernate and the new EJB
> annotations that it supports...there is a cool little Lucene add-on that
> lets you declare fields to be indexed (and how) with annotations.
>
> - Mark
>
> Rajesh parab wrote:
> > Hi,
> >
> > As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field.
> >
> > I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is
> >
> >     Folder -- > File
> >              |
> >              ------- > is image, is text file, ......
> >
> >
> > I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index?
> >
> > Regards,
> > Rajesh
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

chrislusf
Hi, Rajesh,

You can use space as <sep>, by use WhitespaceAnalyzer.

By detached mode, I mean the search function and your java system
should be kind of logically separated. From the technical side, a
separated search server will be more scalable. From the business side,
searching is more like an add-on rather than a must-have. And users
will have new different requirements to query the content.

--
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com


On 11/2/06, Rajesh parab <[hidden email]> wrote:

> Thanks for feedback Chris.
>
> I agree with you. The data set should be flattened out to store inside Lucene index. The Folder-File was just an example. As you know, in relational database, we can have more complex relationships. I understand that this model may not work for deeper relationships.
>
> What I am mainly interested in is just one level deep relationship. But, I would like to search on the additional attributes of the related object. For example, in the relationship for Folder-File, I would like to use additional file attributes as search criteria along with file name while searching for folders.
>
> The way I see is having single filed for the related object and all its additional attributes and use some separator while capturing this data inside Lucene Field object. For example -
>
>             new Field("file", "abc.txt<sep>image");
>
> But, I am not quite sure if this model will work.
>
> BTW. I did not understand what you meant by the detached approach. Can you please elaborate?
>
> Regards,
> Rajesh
>
> ----- Original Message ----
> From: Chris Lu <[hidden email]>
> To: [hidden email]
> Sent: Thursday, November 2, 2006 7:57:46 PM
> Subject: Re: Modelling relational data in Lucene Index?
>
>
> For this specific question, you can create index on files, search
> files that of type image, and from matched files, find the unique
> directories(can be done in lucene or you can do it via java).
>
> Of course this does not scale to deeper relationships. Usually you do
> need to flattern the database objects in order to use lucene. It's
> just trading space for speed.
>
> I would prefer a detached approach instead of Hibernate or EJB's
> approach, which is kind of too tightly coupled with any system. How to
> rebuild if the index is corrupted, or you have a new Analyzer, or
> schema evolves? How to make it multi-thread safe?
>
> --
> Chris Lu
> -------------------------
> Instant Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
>
> On 11/2/06, Mark Miller <[hidden email]> wrote:
> > Lucene is probably not the solution if you are looking for a relational
> > model. You should be using a database for that. If you want to combine
> > Lucene with a relational model, check out Hibernate and the new EJB
> > annotations that it supports...there is a cool little Lucene add-on that
> > lets you declare fields to be indexed (and how) with annotations.
> >
> > - Mark
> >
> > Rajesh parab wrote:
> > > Hi,
> > >
> > > As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field.
> > >
> > > I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is
> > >
> > >     Folder -- > File
> > >              |
> > >              ------- > is image, is text file, ......
> > >
> > >
> > > I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index?
> > >
> > > Regards,
> > > Rajesh
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

Erick Erickson
In reply to this post by Rajesh parab
One thing it took me a while to grasp, and is not automatic for folks with
significant database backgrounds is that the fields in a Lucene document are
only related to those of any other document by the meaning you, as a
programmer, understand. That is, document 1 may have fields a, b, c.
Document 2 may have fields b, e, g. There is no requirement that, in this
example, document 1 has fields e and g for instance. and vice-versa. In
other words, Lucene documents don't fit into a table model.

The reason I mention that is that I'm extremely leery of packing data in a
field that really doesn't belong together. Plus, your searching becomes more
complicated.

In your example above, what happens if the file name and image are similar
enough to produce false hits? Whereas if you stored them as separate fields
in a document, you don't have this kind of problem.

So, if you can cleverly de-normalize your data in such a way as to satisfy
all the searches you'll ever want to perform, you can store it all in a
Lucene index and be happy. If you can't, you could use Lucene to search the
parts you *do* care about and store the rest in a database. Or, you could
just use a database. I believe it all hinges on whether you have a fixed set
of queries you can anticipate (and thus reflect in a Lucene index) or not.

Best
Erick

On 11/2/06, Rajesh parab <[hidden email]> wrote:

>
> Thanks for feedback Chris.
>
> I agree with you. The data set should be flattened out to store inside
> Lucene index. The Folder-File was just an example. As you know, in
> relational database, we can have more complex relationships. I understand
> that this model may not work for deeper relationships.
>
> What I am mainly interested in is just one level deep relationship. But, I
> would like to search on the additional attributes of the related object. For
> example, in the relationship for Folder-File, I would like to use additional
> file attributes as search criteria along with file name while searching for
> folders.
>
> The way I see is having single filed for the related object and all its
> additional attributes and use some separator while capturing this data
> inside Lucene Field object. For example -
>
>             new Field("file", "abc.txt<sep>image");
>
> But, I am not quite sure if this model will work.
>
> BTW. I did not understand what you meant by the detached approach. Can you
> please elaborate?
>
> Regards,
> Rajesh
>
> ----- Original Message ----
> From: Chris Lu <[hidden email]>
> To: [hidden email]
> Sent: Thursday, November 2, 2006 7:57:46 PM
> Subject: Re: Modelling relational data in Lucene Index?
>
>
> For this specific question, you can create index on files, search
> files that of type image, and from matched files, find the unique
> directories(can be done in lucene or you can do it via java).
>
> Of course this does not scale to deeper relationships. Usually you do
> need to flattern the database objects in order to use lucene. It's
> just trading space for speed.
>
> I would prefer a detached approach instead of Hibernate or EJB's
> approach, which is kind of too tightly coupled with any system. How to
> rebuild if the index is corrupted, or you have a new Analyzer, or
> schema evolves? How to make it multi-thread safe?
>
> --
> Chris Lu
> -------------------------
> Instant Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
>
> On 11/2/06, Mark Miller <[hidden email]> wrote:
> > Lucene is probably not the solution if you are looking for a relational
> > model. You should be using a database for that. If you want to combine
> > Lucene with a relational model, check out Hibernate and the new EJB
> > annotations that it supports...there is a cool little Lucene add-on that
> > lets you declare fields to be indexed (and how) with annotations.
> >
> > - Mark
> >
> > Rajesh parab wrote:
> > > Hi,
> > >
> > > As I understand, Lucene has a flat structure where you can define
> multiple fields inside the document. There is no relationship between any
> field.
> > >
> > > I would like to enable index based search for some of the components
> inside relational database. For exmaple, let say "Folder" Object. The Folder
> object can have relationship with File object. The File object, in turn, can
> have attributes like is image, is text file, etc. So, the stricture is
> > >
> > >     Folder -- > File
> > >              |
> > >              ------- > is image, is text file, ......
> > >
> > >
> > > I would like to enable a search to find a Folder with File of type
> image. How can we model such relational data inside Lucene index?
> > >
> > > Regards,
> > > Rajesh
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

Emmanuel Bernard
In reply to this post by Rajesh parab
Hi
No, he is talking about
http://www.hibernate.org/hib_docs/annotations/reference/en/html/lucene.html

Also note that I'm about to release a new version much more flexible
http://www.mail-archive.com/hibernate-dev%40lists.jboss.org/msg00392.html
and for the future (but flexible)
http://www.mail-archive.com/hibernate-dev%40lists.jboss.org/msg00393.html

Note that Compass is an alternative approach. I haven't really looked at
the project in detail, the main drawback for me and some other people
who compared the 2 were
 - it requires you to deal with a different API than your ORM
 - it does not give you back a managed (ORM) object on query results
 - it abstracts quite a lot Lucene

I guess you need to check by yourself

Emmanuel

Rajesh parab wrote:

> Thanks Mark.
>
> Can you please tell me more about the Lucene add-on you are talking about? Are you talking about Compass?
>
> Regards,
> Rajesh
>
> ----- Original Message ----
> From: Mark Miller <[hidden email]>
> To: [hidden email]
> Sent: Thursday, November 2, 2006 7:29:10 PM
> Subject: Re: Modelling relational data in Lucene Index?
>
> Lucene is probably not the solution if you are looking for a relational
> model. You should be using a database for that. If you want to combine
> Lucene with a relational model, check out Hibernate and the new EJB
> annotations that it supports...there is a cool little Lucene add-on that
> lets you declare fields to be indexed (and how) with annotations.
>
> - Mark
>
> Rajesh parab wrote:
>  
>> Hi,
>>
>> As I understand, Lucene has a flat structure where you can define multiple fields inside the document. There is no relationship between any field.
>>
>> I would like to enable index based search for some of the components inside relational database. For exmaple, let say "Folder" Object. The Folder object can have relationship with File object. The File object, in turn, can have attributes like is image, is text file, etc. So, the stricture is
>>    
>>     Folder -- > File
>>              |
>>              ------- > is image, is text file, ......
>>
>>
>> I would like to enable a search to find a Folder with File of type image. How can we model such relational data inside Lucene index?
>>
>> Regards,
>> Rajesh
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>  
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

Emmanuel Bernard
In reply to this post by chrislusf
Hi,
What exactly are your concerned about the "non-detached" approach (see
below)?

Chris Lu wrote:
>
> I would prefer a detached approach instead of Hibernate or EJB's
> approach, which is kind of too tightly coupled with any system. How to
it is probably going to be couple with yours ;-)
> rebuild if the index is corrupted, or you have a new Analyzer, or
I've introduced a session.index() which forces the (re)indexing of the
document
> schema evolves? How to make it multi-thread safe?
What do you mean by multithread safe? The indexing?
the indexing is multithread safe in the Hibernate Lucene integration

The query process?
the query doesn't have to since you query on a give session (aka user
conversation), so no multithread threat here.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

chrislusf
I personally like your effort, but technically I would  disagree.

The SOLR project, and the project I am working on, DBSight, have an
detached approach which is implementation agnostic, no matter if it's
java, ruby, php, .net. The return results can be a rendered HTML,
JSON, XML. I don't think you can be more flexible than that. If
creating an new index takes 5 minutes without any coding, you can
create something more creative.

From business side, you don't need to worry about indexing when
designing a system. New requirement may come. It's very hard trying to
anticipate all the needs.

Technically, detached approach gives more flexible on resources like
CPU, memory, hard drive. For example, if your index grows large, say
1G, indexing can take hours with merging, I am not sure how compass or
hibernate/lucene handles it. Need to re-write code at that time? I
actually feel it's a dangerous trap.

> I've introduced a session.index() which forces the (re)indexing of the
> document
So does it mean you need to write some code to fix the index if it's crashed?

> What do you mean by multithread safe? The indexing?
> the indexing is multithread safe in the Hibernate Lucene integration
The indexing can be threadsafe. But will it affect the searching? With
many files changing and merging, if you cache the searcher. the
searching will have "read passed EOF" exceptions. If you don't cache
the searcher, you will loose the built-in caching, FieldCacheImpl, in
Lucene.

>
> The query process?
> the query doesn't have to since you query on a give session (aka user
> conversation), so no multithread threat here.
So you are not caching searcher.

--
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com

On 11/3/06, Emmanuel Bernard <[hidden email]> wrote:

> Hi,
> What exactly are your concerned about the "non-detached" approach (see
> below)?
>
> Chris Lu wrote:
> >
> > I would prefer a detached approach instead of Hibernate or EJB's
> > approach, which is kind of too tightly coupled with any system. How to
> it is probably going to be couple with yours ;-)
> > rebuild if the index is corrupted, or you have a new Analyzer, or
> I've introduced a session.index() which forces the (re)indexing of the
> document
> > schema evolves? How to make it multi-thread safe?
> What do you mean by multithread safe? The indexing?
> the indexing is multithread safe in the Hibernate Lucene integration
>
> The query process?
> the query doesn't have to since you query on a give session (aka user
> conversation), so no multithread threat here.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

Emmanuel Bernard
I had a quick look at SOLR and DBSight. They seem to achieve a different
goal than Hibernate Lucene.
The formers belong to the project box category: you set up a server that
will handle the search for you. The application will then delegate the
work to those servers.
The latter belongs to the framework category: you use it inside your
Hibernate/EJB 3.0 application to enable an index based search feature.
To a certain extend, it is the same difference between a Google box and
Lucene.

You can write some code based on the latter to covers the formers
features esp the platform abstraction (PHP, .net), but it is probably a
lot of work and that is not really the point.
You can write some code based on the formers to enable indexing and
search of your persistent domain model (persisted through Hibernate),
but that is probably more work.

Really it is a matter of easing the pain from one side of the problem or
the other side. I don't see much competition between the 2 approaches,
they cover different goals.

To specifically answer some of your remarks:
 - yes, you need to write some code to recreate an index. Literally, 6
lines of code.
 - no, I do not currently cache the searcher because, Hibernate is
transactional by nature and protect yourself as much as possible from
read uncommited and other data inconsistencies. I guess I could
implement some caching capabilities using reader.isCurrent() or
something equivalent.
 - the ability to split searchers servers from indexers servers is on my
todo list.

Cheers

Emmanuel


Chris Lu wrote:

> I personally like your effort, but technically I would  disagree.
>
> The SOLR project, and the project I am working on, DBSight, have an
> detached approach which is implementation agnostic, no matter if it's
> java, ruby, php, .net. The return results can be a rendered HTML,
> JSON, XML. I don't think you can be more flexible than that. If
> creating an new index takes 5 minutes without any coding, you can
> create something more creative.
>
>> From business side, you don't need to worry about indexing when
> designing a system. New requirement may come. It's very hard trying to
> anticipate all the needs.
>
> Technically, detached approach gives more flexible on resources like
> CPU, memory, hard drive. For example, if your index grows large, say
> 1G, indexing can take hours with merging, I am not sure how compass or
> hibernate/lucene handles it. Need to re-write code at that time? I
> actually feel it's a dangerous trap.
>
>> I've introduced a session.index() which forces the (re)indexing of the
>> document
> So does it mean you need to write some code to fix the index if it's
> crashed?
>
>> What do you mean by multithread safe? The indexing?
>> the indexing is multithread safe in the Hibernate Lucene integration
> The indexing can be threadsafe. But will it affect the searching? With
> many files changing and merging, if you cache the searcher. the
> searching will have "read passed EOF" exceptions. If you don't cache
> the searcher, you will loose the built-in caching, FieldCacheImpl, in
> Lucene.
>
>>
>> The query process?
>> the query doesn't have to since you query on a give session (aka user
>> conversation), so no multithread threat here.
> So you are not caching searcher.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modelling relational data in Lucene Index?

KEGan
Hi,

I am actually doing something what the original poster mentioned.
Previously, I have use Hibernate and Lucene. But I found that for my
particular project my data is quite flat, so in the next version I totally
take out Hibernate (and the complexity with it :)) and use Lucene as the
"main storage".

In this new version, my data is persisted into filesystem simply using
XMLEncoder. Lucene is both used as a text search index, and also provides
reference to the encoded data (.xml) in the filesystem. Everytime data is
added, an entry into Lucene will be made. And I am using RAMDirectory (super
fast), hence if the server ever shutdown/restart, the Lucene in built again
on startup.

This works for my case because my data set is small enough (hey I read
1.1million documents only average about 300MB of lucene index, and I
have
plenty of RAM), my data probably wont reach anywhere near 0.5 million. The
cons are startup will be slow when data increase, but server shouldnt be
down that often.

Is anyone using similar model ? Any pitfall that I should be aware about ?
Thanks.

~KEGan


On 11/6/06, Emmanuel Bernard <[hidden email]> wrote:

>
> I had a quick look at SOLR and DBSight. They seem to achieve a different
> goal than Hibernate Lucene.
> The formers belong to the project box category: you set up a server that
> will handle the search for you. The application will then delegate the
> work to those servers.
> The latter belongs to the framework category: you use it inside your
> Hibernate/EJB 3.0 application to enable an index based search feature.
> To a certain extend, it is the same difference between a Google box and
> Lucene.
>
> You can write some code based on the latter to covers the formers
> features esp the platform abstraction (PHP, .net), but it is probably a
> lot of work and that is not really the point.
> You can write some code based on the formers to enable indexing and
> search of your persistent domain model (persisted through Hibernate),
> but that is probably more work.
>
> Really it is a matter of easing the pain from one side of the problem or
> the other side. I don't see much competition between the 2 approaches,
> they cover different goals.
>
> To specifically answer some of your remarks:
> - yes, you need to write some code to recreate an index. Literally, 6
> lines of code.
> - no, I do not currently cache the searcher because, Hibernate is
> transactional by nature and protect yourself as much as possible from
> read uncommited and other data inconsistencies. I guess I could
> implement some caching capabilities using reader.isCurrent() or
> something equivalent.
> - the ability to split searchers servers from indexers servers is on my
> todo list.
>
> Cheers
>
> Emmanuel
>
>
> Chris Lu wrote:
> > I personally like your effort, but technically I would  disagree.
> >
> > The SOLR project, and the project I am working on, DBSight, have an
> > detached approach which is implementation agnostic, no matter if it's
> > java, ruby, php, .net. The return results can be a rendered HTML,
> > JSON, XML. I don't think you can be more flexible than that. If
> > creating an new index takes 5 minutes without any coding, you can
> > create something more creative.
> >
> >> From business side, you don't need to worry about indexing when
> > designing a system. New requirement may come. It's very hard trying to
> > anticipate all the needs.
> >
> > Technically, detached approach gives more flexible on resources like
> > CPU, memory, hard drive. For example, if your index grows large, say
> > 1G, indexing can take hours with merging, I am not sure how compass or
> > hibernate/lucene handles it. Need to re-write code at that time? I
> > actually feel it's a dangerous trap.
> >
> >> I've introduced a session.index() which forces the (re)indexing of the
> >> document
> > So does it mean you need to write some code to fix the index if it's
> > crashed?
> >
> >> What do you mean by multithread safe? The indexing?
> >> the indexing is multithread safe in the Hibernate Lucene integration
> > The indexing can be threadsafe. But will it affect the searching? With
> > many files changing and merging, if you cache the searcher. the
> > searching will have "read passed EOF" exceptions. If you don't cache
> > the searcher, you will loose the built-in caching, FieldCacheImpl, in
> > Lucene.
> >
> >>
> >> The query process?
> >> the query doesn't have to since you query on a give session (aka user
> >> conversation), so no multithread threat here.
> > So you are not caching searcher.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>