What *is* a lucene document?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

What *is* a lucene document?

Phillip Rhodes-2
I understand that  "Documents are the primary retrievable units from a
Lucene query"  But I don't know if I want to have 12 documents in the
lucene index that represent the same business object, or if I should
place 12 different business documents within the lucene index.

Here is the background:
I want to index a product catalog (some data in database and some data
on the filesystem, I have cross-reference between the two).
Each product is associated to attributes, categories and one or more
PDF/MS Word documents, HTML descriptions, images, etc...
A product could have 12 different files associated to it.

Is it okay if I create as many documents as assets that I want to return
from a search and add information to each document tying it back to the
product that it is assocated with?  Is that the right approach?

Thanks, it's keeping me up at night.


BTW, I am working on a release of a professional-grade ecommerce suite
that is open-source (apache license), I wouldn't mind help on the
lucene/search stuff.   There's plenty more for me to do.  120+ tables,
going to prod for a client this weekend (without search;)  Contact me!





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: What *is* a lucene document?

Erik Hatcher

On Jun 5, 2005, at 1:11 AM, Phillip Rhodes wrote:
> I understand that  "Documents are the primary retrievable units  
> from a Lucene query"  But I don't know if I want to have 12  
> documents in the lucene index that represent the same business  
> object, or if I should place 12 different business documents within  
> the lucene index.

Deciding how to slice a domain into Documents is one of the most  
important decisions to make with Lucene usage, and not one that  
Lucene itself gives an answer to.  There are precedents that have  
been set and advice that users here can give, but ultimately how to  
represent your domain in Lucene is up to you.

> Here is the background:
> I want to index a product catalog (some data in database and some  
> data on the filesystem, I have cross-reference between the two).
> Each product is associated to attributes, categories and one or  
> more PDF/MS Word documents, HTML descriptions, images, etc...
> A product could have 12 different files associated to it.
>
> Is it okay if I create as many documents as assets that I want to  
> return from a search and add information to each document tying it  
> back to the product that it is assocated with?  Is that the right  
> approach?

Do users of your search system need to know about the PDF/Word/HTML  
documents?  Or should they simply know about "products"?  If all you  
need back is the product, then the simplest approach would be to  
create one Lucene Document per product, parse all the files and data  
associated with it and add it as text to fields.  If the search  
system is simple in that fielded search is not needed, simply create  
two fields per Document: id and text.  Field "id" is the product id,  
and "text" is an aggregation of all the text associated with the  
product regardless of where it came from (careful if you're doing  
string concatenation to put whitespace between so you don't blur  
words together).

There are many other ways to approach this and my recommendation is  
just the simplest one based on the description of your needs.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: What *is* a lucene document?

Chris Hostetter-3

: There are many other ways to approach this and my recommendation is
: just the simplest one based on the description of your needs.

I'd like to add one thing to Erik's excellent advice, something many
people (especially people use to dealing with rgorously structured data)
tend to overlook:

   Documents in a Lucene index can be heterogenous.

You can have some documents in your index with fields A, B, and C; and you
can have other documents in your index with field X, Y and Z.  And in some
cases you can have a search result of docs from that first set of
documents, in other cases you can have a search result with docs from the
second set, or you can return a mixed bag of both -- it all depends on how
you structure your query, and which fields you search on.

Consider a simplified example of your orriginal question:  What if you had
products with specs, and reviews of produduts.  You could have one
document per review, indexing all the text of the review, and the product
Ids of the products mentioned in it.  You can also have one document per
product, indexing all the spec data.  If a user does a search for "dell"
you can return results containing a mix of products and reviews that
contain the word dell.  By making the product Id field stored and indexed
in all documents, you can even provide links next to reviews to see all
the products mentioned in the review, or next to a product to see all
reviews that mention that product.

There's no need to limit yourself and say "based on my data, I will make
one document per X in my data model." you can make one doc per X, and one
doc per Y, and one doc per Z -- all depending on the desired behavior at
search time.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]