Multiple doc types in schema

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple doc types in schema

Jack L
As far as I understand, I can put multiple doc types in the same
index, for example, web pages, images, products, etc. In order
to do so, I think I need to do the following:

- have a doctype field (not necessary but nice to have)
- add all possible fields of all doc types in schema
- when querying for a particular doc type, make sure to either
  specify the doctype field, or use fields that are only
  available in that doc type.

This is based on my understanding that solr/lucene does not
have the concept of document type. It only sees fields.

Is my understanding correct?

It seems a bit unclean to mix fields of all document types
in the same schema though. Or, is there a way to allow multiple
document types in the schema, and specify what type to use
when indexing and searching?

--
Best regards,
Jack

Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Chris Hostetter-3

: This is based on my understanding that solr/lucene does not
: have the concept of document type. It only sees fields.
:
: Is my understanding correct?

it is.

: It seems a bit unclean to mix fields of all document types
: in the same schema though. Or, is there a way to allow multiple
: document types in the schema, and specify what type to use
: when indexing and searching?

it's really just an issue of semantics ... the schema.xml is where you
list all of the fields you need in your index, any notion of doctype is
entire artificial ... you could group all of the
fields relating to doctypeA in one section of the schema.xml, then have a
big <!-- ##...## --> line and then list the fields in doctypeB, etc... but
wat if there are fields you use in both "doctypes" ? .. how much you "mix"
them is entirely up to you.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re[2]: Multiple doc types in schema

Jack L
Hello Chris,

Thanks for the reply. I understand that a mixed-type index will work
just fine. Just to bring up a topic for discussion/new features though:
there seem to be downsides of not having a doctype:

- name space conflict when two doctypes are not related. In
  this case the developer will have to be careful with names

- more difficult to maintain the index. If I want to delete
  all docs of a doc type, I can use deletet by query but it's
  always easier to wipe out the whole index directory if doctypes
  are kept separate but maintained by the same solr instance.
  I can run two separate solr instances to achieve this then this
  takes more memory/CPU/maintaince effort.

One schema file with doctypes defined, and separate index directories
would be perfect, in my opinion :) or even separate schema files :)

--
Best regards,
Jack

Tuesday, June 5, 2007, 9:58:10 PM, you wrote:


> : This is based on my understanding that solr/lucene does not
> : have the concept of document type. It only sees fields.
> :
> : Is my understanding correct?

> it is.

> : It seems a bit unclean to mix fields of all document types
> : in the same schema though. Or, is there a way to allow multiple
> : document types in the schema, and specify what type to use
> : when indexing and searching?

> it's really just an issue of semantics ... the schema.xml is where you
> list all of the fields you need in your index, any notion of doctype is
> entire artificial ... you could group all of the
> fields relating to doctypeA in one section of the schema.xml, then have a
> big <!-- ##...## --> line and then list the fields in doctypeB, etc... but
> wat if there are fields you use in both "doctypes" ? .. how much you "mix"
> them is entirely up to you.



> -Hoss

Reply | Threaded
Open this post in threaded view
|

Re[2]: Multiple doc types in schema

Chris Hostetter-3

Ah .... i was missunderstanding your goal of "doctypes" ... the use case i
was thinking is that you have "book" documents and "movie" documents
and you frequently only query on one type of the other but sometime you do
a generic query on all of them using the fields they have in common.

this is clearly not the situation you are describing, since you suggest
storing them in completley seperate indexes that can be blown away
independently.

there is a patch in Jira to support multiple SolrCore's in a single JVM
"context" ... as i understand it this would achieve your goal (but i
havne't really had a chance to look at it so i can't really speak to it.

in general, running multiple Solr isnt'ances is actaully wuite easy and
not as bad as you make it out to be ... the overhead of running multiple
Solr webapp instances in a single JVM doesn't really take up that much
more memory or CPU ... yes the classes are all loaded twice, but that
typically pales in comparison to the amount of data involved in your index
(unelss you've got hundrads of tiny indexes or something like that)

: - more difficult to maintain the index. If I want to delete
:   all docs of a doc type, I can use deletet by query but it's
:   always easier to wipe out the whole index directory if doctypes
:   are kept separate but maintained by the same solr instance.
:   I can run two separate solr instances to achieve this then this
:   takes more memory/CPU/maintaince effort.
:
: One schema file with doctypes defined, and separate index directories
: would be perfect, in my opinion :) or even separate schema files :)

-Hoss