Re: Multiple doc types in schema

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Otis Gospodnetic-2
This sounds like a potentially good use-case for SOLR-215!
See https://issues.apache.org/jira/browse/SOLR-215

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Chris Hostetter <[hidden email]>
To: [hidden email]; Jack L <[hidden email]>
Sent: Wednesday, June 6, 2007 6:58:10 AM
Subject: Re: Multiple doc types in schema


: This is based on my understanding that solr/lucene does not
: have the concept of document type. It only sees fields.
:
: Is my understanding correct?

it is.

: It seems a bit unclean to mix fields of all document types
: in the same schema though. Or, is there a way to allow multiple
: document types in the schema, and specify what type to use
: when indexing and searching?

it's really just an issue of semantics ... the schema.xml is where you
list all of the fields you need in your index, any notion of doctype is
entire artificial ... you could group all of the
fields relating to doctypeA in one section of the schema.xml, then have a
big <!-- ##...## --> line and then list the fields in doctypeB, etc... but
wat if there are fields you use in both "doctypes" ? .. how much you "mix"
them is entirely up to you.



-Hoss




Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

James liu-2
I see SOLR-215 from this mail.

Does it now really support multi index and search it will return merged
data?

for example:

i wanna search: aaa, and i have index1, index2, index3, index4,,,,it should
return the result from index1,index2,index3, index4 and merge result by
score, datetime, or other thing.

Does it support NFS and how its performance?



2007/6/21, Otis Gospodnetic <[hidden email]>:

>
> This sounds like a potentially good use-case for SOLR-215!
> See https://issues.apache.org/jira/browse/SOLR-215
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: Chris Hostetter <[hidden email]>
> To: [hidden email]; Jack L <[hidden email]>
> Sent: Wednesday, June 6, 2007 6:58:10 AM
> Subject: Re: Multiple doc types in schema
>
>
> : This is based on my understanding that solr/lucene does not
> : have the concept of document type. It only sees fields.
> :
> : Is my understanding correct?
>
> it is.
>
> : It seems a bit unclean to mix fields of all document types
> : in the same schema though. Or, is there a way to allow multiple
> : document types in the schema, and specify what type to use
> : when indexing and searching?
>
> it's really just an issue of semantics ... the schema.xml is where you
> list all of the fields you need in your index, any notion of doctype is
> entire artificial ... you could group all of the
> fields relating to doctypeA in one section of the schema.xml, then have a
> big <!-- ##...## --> line and then list the fields in doctypeB, etc... but
> wat if there are fields you use in both "doctypes" ? .. how much you "mix"
> them is entirely up to you.
>
>
>
> -Hoss
>
>
>
>
>


--
regards
jl
Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Jim Dow-2
In reply to this post by Otis Gospodnetic-2
Ignore the poor segmentation scheme (document types combined with
categorizing), but this is working quite well as we get close to going live
with a product.

This static IndexDocKey class contains enumerator that generates Catalog
keys for each type of document (POJO / Model object) that gets indexed.  The
indexing process assigns a Catalog key for each document type, the process
extracts the catid (doc-type) as well as other information that is put into
the index doc.

Just an idea for you:


/**
 * This drives much of the categorization of indexes and the subsequent
query
 * filters. Lots of logic built into these enumarations and really are rules
 * that may better be injected or looked up in a true rules engine. This is
the
 * start of system generated markup and metadata.
 *
 * @author jdow
 * @version %I%, %G%
 * @since 0.90
 *        <p>
 *
 * <pre>
 * TODO: Review to see if want to keep in the index doc, but make the
enumeration
 * of Categories, SubCat, etc...more meaninful and order them in the right
way
 * to facilitate getting back filtered docs in a controlled sort order.
 * </pre>
 */
public class IndexDocKey implements Serializable
{

    // STATICS
    public static final long serialVersionUID = 1L;

    public static long getSerialVersionUID()
    {
        return serialVersionUID;
    }

    @SuppressWarnings("unused")
    protected Category category;


    public IndexDocKey()
    {
    }

    public IndexDocKey(Category category)
    {
        this.category = category;

    }

    public void setCategory(Category category)
    {
        this.category = category;

    }

    /*
     * public void setCatDoc(CatDoc catdoc) { this.catdoc = catdoc; }
     */

    public enum Category implements Serializable
    {
        SYSTEM("S0000", "System", null, null),
        SYSPING("S0010", "System", null, null),
        APPCNTEXAMPLES("AC01L", "Example", ZExample.class, Person.class),
        APPCNTPEOPLE("AC02P", "Example", ZExample.class, Person.class),
        APPCNTDISCUSS("AC03D", "Example", ZExample.class, Person.class),
        APPCNTIMAGE("AC04I", "Example", ZExample.class, Person.class),
        APPCNTFILES("AC05F", "Example", ZExample.class, Person.class),
        APPCNTEVENT("AC02E", "Example", ZExample.class, Person.class),
        EXAMPLE("L0000", "Example", ZExample.class, Person.class),
        EXAMPLECNTPEOPLE("L00CP", "Example", ZExample.class, Person.class),
        EXAMPLECNTDISCUSS("L00CD", "Example", ZExample.class, Person.class),
        EXAMPLECNTIMAGE("L00CI", "Example", ZExample.class, Person.class),
        EXAMPLECNTFILES("L00CF", "Example", ZExample.class, Person.class),
        EXAMPLECNTEVENT("L00CE", "Example", ZExample.class, Person.class),
        EXAMPLEIDENTITY("L00LI", "Identity", Identity.class, ZExample.class
),

        EVENT("LCE10", "Content", Content.class, Content.class),
        EVENTLABEL("LCE11", "ContentLabel", ContentLabel.class,
Content.class),
        EVENTCOMM("LCE12", "ContentComment", ContentComment.class,
Content.class),
        EVENTPROP("LCE13", "ContentProperty", ContentProperty.class,
Content.class),

        DISCUSS("LCD10", "Content", Content.class, Content.class),
        DISCUSSLABEL("LCD11", "ContentLabel", ContentLabel.class,
Content.class),
        DISCUSSCOMM("LCD12", "ContentComment", ContentComment.class,
Content.class),
        DISCUSSPROP("LCD13", "ContentProperty", ContentProperty.class,
Content.class),

        IMAGE("LCI10", "Content", Content.class, Content.class),
        IMAGELABEL("LCI11", "ContentLabel", ContentLabel.class,
Content.class),
        IMAGECOMM("LCI12", "ContentComment", ContentComment.class,
Content.class),
        IMAGEPROP("LCI13", "ContentProperty", ContentProperty.class,
Content.class),

        FILE("LCF10", "Content", Content.class, Content.class),
        FILELABEL("LCF11", "ContentLabel", ContentLabel.class, Content.class
),
        FILECOMM("LCF12", "ContentComment", ContentComment.class,
Content.class),
        FILEPROP("LCF13", "ContentProperty", ContentProperty.class,
Content.class);

        private String catid;
        private String catname;
        private String catdoc;
        private Class<?> catclass;
        private Class<?> catparentclass;

        private Category(String catid, String catdoc, Class catclass, Class
catparentclass)
        {
            this.catid = catid;

            this.catdoc = catdoc;

            this.catname = this.name();
        }

        public String getCatId()
        {
            return this.catid;
        }

        public String getCatName()
        {
            return this.catname;
        }

        public String getCatDoc()
        {
            return this.catdoc;
        }

        public Class<?> getCatClass()
        {
            return this.catclass;
        }

        public Class<?> getCatParentClass()
        {
            return this.catparentclass;
        }
    }

}


On 6/20/07, Otis Gospodnetic <[hidden email]> wrote:

>
> This sounds like a potentially good use-case for SOLR-215!
> See https://issues.apache.org/jira/browse/SOLR-215
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: Chris Hostetter <[hidden email]>
> To: [hidden email]; Jack L <[hidden email]>
> Sent: Wednesday, June 6, 2007 6:58:10 AM
> Subject: Re: Multiple doc types in schema
>
>
> : This is based on my understanding that solr/lucene does not
> : have the concept of document type. It only sees fields.
> :
> : Is my understanding correct?
>
> it is.
>
> : It seems a bit unclean to mix fields of all document types
> : in the same schema though. Or, is there a way to allow multiple
> : document types in the schema, and specify what type to use
> : when indexing and searching?
>
> it's really just an issue of semantics ... the schema.xml is where you
> list all of the fields you need in your index, any notion of doctype is
> entire artificial ... you could group all of the
> fields relating to doctypeA in one section of the schema.xml, then have a
> big <!-- ##...## --> line and then list the fields in doctypeB, etc... but
> wat if there are fields you use in both "doctypes" ? .. how much you "mix"
> them is entirely up to you.
>
>
>
> -Hoss
>
>
>
>
>


--
Jim Dow

http://www.incontextnw.com

https://www.linkedin.com/in/jimmydow

mobile: 503-502-0113
[hidden email]
IM: Jim Dow (MSN)
Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Otis Gospodnetic-2
In reply to this post by Otis Gospodnetic-2
SOLR-215 support multiple indices on a single Solr instance.  It does *not* support searching of multiple indices at once (e.g. parallel search) and merging of results.

This has nothing to do with NFS, though.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: James liu <[hidden email]>
To: [hidden email]
Sent: Thursday, June 21, 2007 3:45:06 AM
Subject: Re: Multiple doc types in schema

I see SOLR-215 from this mail.

Does it now really support multi index and search it will return merged
data?

for example:

i wanna search: aaa, and i have index1, index2, index3, index4,,,,it should
return the result from index1,index2,index3, index4 and merge result by
score, datetime, or other thing.

Does it support NFS and how its performance?



2007/6/21, Otis Gospodnetic <[hidden email]>:

>
> This sounds like a potentially good use-case for SOLR-215!
> See https://issues.apache.org/jira/browse/SOLR-215
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: Chris Hostetter <[hidden email]>
> To: [hidden email]; Jack L <[hidden email]>
> Sent: Wednesday, June 6, 2007 6:58:10 AM
> Subject: Re: Multiple doc types in schema
>
>
> : This is based on my understanding that solr/lucene does not
> : have the concept of document type. It only sees fields.
> :
> : Is my understanding correct?
>
> it is.
>
> : It seems a bit unclean to mix fields of all document types
> : in the same schema though. Or, is there a way to allow multiple
> : document types in the schema, and specify what type to use
> : when indexing and searching?
>
> it's really just an issue of semantics ... the schema.xml is where you
> list all of the fields you need in your index, any notion of doctype is
> entire artificial ... you could group all of the
> fields relating to doctypeA in one section of the schema.xml, then have a
> big <!-- ##...## --> line and then list the fields in doctypeB, etc... but
> wat if there are fields you use in both "doctypes" ? .. how much you "mix"
> them is entirely up to you.
>
>
>
> -Hoss
>
>
>
>
>


--
regards
jl



Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Walter Underwood, Netflix
I used Solr with indexes on NFS and I do not recommend it.
It was either 100 or 1000 times slower than local disc
for indexing, I forget which. Unusable.

This is not a problem with Solr/Lucene, I have seen the
same NFS performance cost with other search engines.

wunder

On 6/21/07 3:22 AM, "Otis Gospodnetic" <[hidden email]> wrote:

> SOLR-215 support multiple indices on a single Solr instance.  It does *not*
> support searching of multiple indices at once (e.g. parallel search) and
> merging of results.
>
> This has nothing to do with NFS, though.
>
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: James liu <[hidden email]>
> To: [hidden email]
> Sent: Thursday, June 21, 2007 3:45:06 AM
> Subject: Re: Multiple doc types in schema
>
> I see SOLR-215 from this mail.
>
> Does it now really support multi index and search it will return merged
> data?
>
> for example:
>
> i wanna search: aaa, and i have index1, index2, index3, index4,,,,it should
> return the result from index1,index2,index3, index4 and merge result by
> score, datetime, or other thing.
>
> Does it support NFS and how its performance?
>
>
>
> 2007/6/21, Otis Gospodnetic <[hidden email]>:
>>
>> This sounds like a potentially good use-case for SOLR-215!
>> See https://issues.apache.org/jira/browse/SOLR-215
>>
>> Otis
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>>
>> ----- Original Message ----
>> From: Chris Hostetter <[hidden email]>
>> To: [hidden email]; Jack L <[hidden email]>
>> Sent: Wednesday, June 6, 2007 6:58:10 AM
>> Subject: Re: Multiple doc types in schema
>>
>>
>> : This is based on my understanding that solr/lucene does not
>> : have the concept of document type. It only sees fields.
>> :
>> : Is my understanding correct?
>>
>> it is.
>>
>> : It seems a bit unclean to mix fields of all document types
>> : in the same schema though. Or, is there a way to allow multiple
>> : document types in the schema, and specify what type to use
>> : when indexing and searching?
>>
>> it's really just an issue of semantics ... the schema.xml is where you
>> list all of the fields you need in your index, any notion of doctype is
>> entire artificial ... you could group all of the
>> fields relating to doctypeA in one section of the schema.xml, then have a
>> big <!-- ##...## --> line and then list the fields in doctypeB, etc... but
>> wat if there are fields you use in both "doctypes" ? .. how much you "mix"
>> them is entirely up to you.
>>
>>
>>
>> -Hoss
>>
>>
>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Frédéric Glorieux
In reply to this post by Otis Gospodnetic-2

Otis,

Thanks for the link and the work !
Maybe around september, I will need this patch, if it's not already
commit to the Solr sources.

I will also need multiple indexes searches, but understand that there is
no simple, fast and genereric solution in solr context. Maybe I should
lose solr caching, but it seems not an impossible work to design its own
custom request handler to query different indexes, like lucene allow it.

> SOLR-215 support multiple indices on a single Solr instance.  It does *not* support searching of multiple indices at once (e.g. parallel search) and merging of results.




--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique
Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Yonik Seeley-2
On 6/21/07, Frédéric Glorieux <[hidden email]> wrote:
> I will also need multiple indexes searches,

Do you mean:

1) Multiple unrelated indexes with different schemas, that you will
search separately... but you just want them in the same JVM for some
reason.

2) Multiple indexes with different schemas, search will search across
all or some subset and combine the results (federated search)

3) Multiple indexes with the same schema, each index is a "shard" that
contains part of the total collection.  Search will merge results
across all shards to give appearance of a single large collection
(distributed search).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Frédéric Glorieux

Hi Sonic,

>> I will also need multiple indexes searches,
>
> Do you mean:

> 2) Multiple indexes with different schemas, search will search across
> all or some subset and combine the results (federated search)

Exactly that. I'm comming from a quite old lucene based project, called SDX
<http://www.nongnu.org/sdx/docs/html/doc-sdx2/en/presentation/bases.html>.
Sorry for the link, the project is mainly documented in french. The
framework is cocoon base, maybe heavy now. It allows to host multiple
"applications", with multiple "bases", a base is a kind of Solr Schema,
in 2000.

 From this experience, I can say cross search between different schemas
is possible, and users may find it important. Take for example a
library. They have different collections, lets say : csv records
obtained from digitized photos, a light model, no write waited ; and a
complex librarian model documented every day. These collections share at
least a title and author field, and should be opened behind the same
form for public ; but each one should have also its own application,
according to its information model.

With the "SDX" framework upper, I know real life applications with 30
lucene indexes. It's possible, because lucene allow it (MultiReader)
<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/MultiReader.html>.


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


 > 1) Multiple unrelated indexes with different schemas, that you will
 > search separately... but you just want them in the same JVM for some
 > reason.
 >

> 3) Multiple indexes with the same schema, each index is a "shard" that
> contains part of the total collection.  Search will merge results
> across all shards to give appearance of a single large collection
> (distributed search).
>
> -Yonik


Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Yonik Seeley-2
On 6/21/07, Frédéric Glorieux <[hidden email]> wrote:

> >> I will also need multiple indexes searches,
> >
> > Do you mean:
>
> > 2) Multiple indexes with different schemas, search will search across
> > all or some subset and combine the results (federated search)
>
> Exactly that. I'm comming from a quite old lucene based project, called SDX
> <http://www.nongnu.org/sdx/docs/html/doc-sdx2/en/presentation/bases.html>.
> Sorry for the link, the project is mainly documented in french. The
> framework is cocoon base, maybe heavy now. It allows to host multiple
> "applications", with multiple "bases", a base is a kind of Solr Schema,
> in 2000.
>
>  From this experience, I can say cross search between different schemas
> is possible, and users may find it important. Take for example a
> library. They have different collections, lets say : csv records
> obtained from digitized photos, a light model, no write waited ; and a
> complex librarian model documented every day. These collections share at
> least a title and author field, and should be opened behind the same
> form for public ; but each one should have also its own application,
> according to its information model.

This doesn't sound like true federated search, since you have a number
of fields that are the same in each index that you search across, and
you treat them all the same.  This is functionally equivalent to
having a single schema and a single index.  You can still have
multiple applications that query the single collection differently.

Depending on update patterns and index sizes, you can probably get
better efficiency with multiple indexes, but not really more
functionality (in your case), right?

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Frédéric Glorieux
Thanks Yonik to share your reflexion,

> This doesn't sound like true federated search,

I'm affraid to not understand "federated search", you seems to have a
precise idea behind the head.

> since you have a number
> of fields that are the same in each index that you search across, and
> you treat them all the same.  This is functionally equivalent to
> having a single schema and a single index.  You can still have
> multiple applications that query the single collection differently.

Before a pointer or a web example from you, what you describe seems to
me like implement a complete database with a single table (not easy to
understand and maintain, but possible). To my experience, a collection
is a schema, with thousands or millions XML documents, could be 10, 20
or more fields, and search configuration is generated from a kind of
data schema (there's no real standard for explaining for example, that a
title or a subject need one field for exact match, and another for word
search). If an index was too big (hopefully I never touch this limit
with lucene), I guess there are solutions. My problem is to maintain
different collections with each their intellectual logic, some shared
FieldNames, like Dublin Core, or at least "fulltext", but also specific
for each ones.

> Depending on update patterns and index sizes, you can probably get
> better efficiency with multiple indexes, but not really more
> functionality (in your case), right?

Maybe "let it understandable" could be accepted as a functionality ?
Perhaps less now, but it was a time when lucene index could become
corrupted, so that separate them was important.

I guess that those specific problems will not be Solr priorities, but
till I have been corrected, I'm still feeling that multiple indexes are
useful.


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique
Reply | Threaded
Open this post in threaded view
|

Re: Multiple doc types in schema

Frédéric Glorieux
In reply to this post by Yonik Seeley-2

After further reading, especially
<http://people.apache.org/~hossman/apachecon2006us/faceted-searching-with-solr.pdf>
(Thanks Hoss)

> Depending on update patterns and index sizes, you can probably get
> better efficiency with multiple indexes, but not really more
> functionality (in your case), right?

Maybe I'm approaching your point of view : "Loose Schema with Dynamic
Fields", this is probably my solution. There's something strange to me
to consider a lucene index as a blob, but if it works for bigger than
me, I should follow. So, it means one fieldtype by analyzer, and the
datamodel logic is only from the collection side. I think I got my idea
for september, but I would be very glad if you have something to add.

--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique