Interest in Extending SOLR

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Interest in Extending SOLR

Bryzek.Michael
All -

My apologies in advance of a rather long email message, especially for
a first time poster to this list. I'm looking at using SOLR to
replace our custom http / xml infrastructure for Lucene that we built
to tightly integrate with our web apps running in an oracle, non java
environment.

Evaluating our migration led me to a few considerations that I would
like to propose to this group both for feedback and feasibility within
SOLR. We would be very happy contributing effort to build on top of
SOLR to the extent the community finds value in the work.

BACKGROUND:

United eWay is an application server provider offering highly
customizable web applications for philanthropic purposes. We've
created many database backed web applications, offering search until
recently via Oracle's interMedia product.

We decided to move all search out of Oracle and into Lucene in late
2005. Our infrastructure is based on AOLServer and a variant of
OpenACS, neither of which offered good integration with Java or one of
the ports of Lucene.

Prior to learning about SOLR, we deployed our own HTTP / XML based
services meeting the needs that we had:

  * Tight database integration - indexing a table in Oracle requires
    the execution of several stored procedures. That is, we provided
    an API in the database to synchronize the database table schema
    with the schema that we used for indexing.

  * Integrated support for partitioning - database tables can be
    partitioned for scalability reasons. The most common scenario for
    us is to partition off data for our largest customers. For
    example, imagine a users table:

     * user_id
     * email_address
     * site_id

    where site_id refers to the customer to whom the user
    belongs. Some sites aggregate data... i.e. one of our customers
    may have 100 sites. When indexing, we create a separate index to
    store only data for a given site. This precomputes one of our more
    expensive computations for search - a filter for all users that
    belong to a given site.

  * Decoupled infrastructure - we wanted the ability to fully scale
    our search application independent of our database application

  * High speed indexing - we initially moved data from the database to
    Lucene via XML documents. We found that to index even a 100k
    documents, it was much faster to move the data in CSV files
    (smaller files, less intensive processing).


IDEAS:

Looking through SOLR, I've identified the following main categories of
change. I would love to hear comments and feedback from this group. My
preference would be to build these changes directly into SOLR rather
than maintain our own application, but the presupposes interest from
the community. The general though is to introduce the concept of an
objectType into the schema. For example:

 <objectType name="users">
   <fields>
     <field name="id" type="string" indexed="false" stored="true"/>
     <field name="email_address" type="text" indexed="false" stored="true"/>

     <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
     <copyField source="id" dest="text"/>
     <copyField source="email_address" dest="text"/>

     <uniqueKey>id</uniqueKey>
   </fields>        
 </objectType>

 Within one global schema for SOLR, we would provide the ability
 to define which fields are available for which types of objects,
 and how they are analyzed. Each object type would then be stored
 in an independent Lucene index.

 I've dug a bit into the codebase to see what impact this would
 have. The change is a relatively large conceptual change, but I
 believe doable given the nicely separated core package:

   * Provide a factory to get a SolrCore instance (i.e. replace
     SolrCore.getSolrCore SolrCore.getInstance(String objectType))

   * Modify getInstanceDir, newSearcher, initIndex to accept an objectType

   * Provide backwards compatibility by providing a new schema file
     (e.g. schema-typed.xml). Include a 'default' object type for
     folks that would like to preserve the existing treatment of
     schemas in SOLR. Users would provide either the existing
     schema.xml file (resulting in one default object type) or the
     schema-typed.xml file.



Your comments and thoughts would be much appreciated.

Best,
Michael Bryzek
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Extending SOLR

Yonik Seeley
Welcome Michael,

On 4/12/06, Bryzek.Michael <[hidden email]> wrote:

>   * Integrated support for partitioning - database tables can be
>     partitioned for scalability reasons. The most common scenario for
>     us is to partition off data for our largest customers. For
>     example, imagine a users table:
>
>      * user_id
>      * email_address
>      * site_id
>
>     where site_id refers to the customer to whom the user
>     belongs. Some sites aggregate data... i.e. one of our customers
>     may have 100 sites. When indexing, we create a separate index to
>     store only data for a given site. This precomputes one of our more
>     expensive computations for search - a filter for all users that
>     belong to a given site.

So the number of filters is equal to the number of sites?  How many
sites are there?

>   * Decoupled infrastructure - we wanted the ability to fully scale
>     our search application independent of our database application

That makes total sense... we do the same thing.

>   * High speed indexing - we initially moved data from the database to
>     Lucene via XML documents. We found that to index even a 100k
>     documents, it was much faster to move the data in CSV files
>     (smaller files, less intensive processing).

Support for indexing from CSV files as well as simple pulling from a
database is on our "todo" list: http://wiki.apache.org/solr/TaskList

> IDEAS:
>
> Looking through SOLR, I've identified the following main categories of
> change. I would love to hear comments and feedback from this group.

It would be nice to make any changes as general as possible, while
still solving your particular problem.

I think I understand many of the internal changes you outlined, but
I'm not sure yet exactly what problem you are trying to solve, and how
the multiple indicies will be used.
- How would one identify what index (or SolrCore) an update is targeted to?
- What is the relationship between the multiple indicies... do queries
ever go across multiple indicies, or would there be an "objectType"
parameter passed in as part of the query?
- What is the purpose of multiple indicies... is it so search results
are always restricted to a single site, but it's not practical to have
that many Solr instances?  It looks like the indicies are partitioned
along the lines of object type, and not site-id though.

-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: Interest in Extending SOLR

Bryzek.Michael
Yonik -

> So the number of filters is equal to the number of sites?  
> How many sites are there?

Today: When new customers join, we generally don't do anything
special. Currently we have roughly 400 customers, most of which have
one site each. Note that a few customers have as many as 50 sites. In
total, we probably filter data in 500 unique ways, before we actually
search on the query string entered by the user. Of the 500 unique ways
in which we filter data, there are approximately 50 for which we would
prefer to use a unique index. I don't have 100% accurate numbers, but
these should be in the ballpark.

Future: We are planning to expand on the concepts we've developed to
integrate Lucene and hopefully SOLR into other applications. One in
particular:

  * Provides a core data set of 100K records

  * Allows each of 1,000 customers to create their own view of that
    data

  * In theory, our overall dataset may contain up to 100K * 1,000
    records (100M), but we know that at any given time, only 100K
    records should be made available.

We did rough tests and found that creating multiple indexes performed
better at run time, especially as the logic to determine what results
should be presented to which customer became more complex.


> Support for indexing from CSV files as well as simple pulling from a
> database is on our "todo" list: http://wiki.apache.org/solr/TaskList

I had seen this on the TODO list. I'm offering to contribute this
piece when we've got an idea of overall fit...


> How would one identify what index (or SolrCore) an update is
> targeted to?

This is a good question. I think the query interface itself would have
to be extended. That is, a new parameter would have to be introduced
which identified the objectType you would like to search/update. If omitted,
the default object type would be used. In our current system, we set
the objectType to the name of the database table and thus can issue
queries like:

  search.jsp?tableName=users&queryString=email:michael.bryzek


> What is the relationship between the multiple indicies... do queries
> ever go across multiple indicies, or would there be an "objectType"
> parameter passed in as part of the query?

In our case, there is no relationship between the multiple indices,
but I do see value here (more on this below). In our specific case, we
have a one to one mapping between a database table and a Lucene index
and have not needed to search across tables.

I think the value of the objectType is this true independence. If you
are indexing similar data, use a field on your data. If your data sets
are truly different, use a different object type.
 

> What is the purpose of multiple indicies... is it so search results
> are always restricted to a single site, but it's not practical to
> have that many Solr instances?  It looks like the indicies are
> partitioned along the lines of object type, and not site-id though.

Your questions and comments are good. Thinking about it has helped me
to clarify what exactly we're trying to accomplish. I think it boils
down to these goals:

  a) Minimize the number of instances of SOLR. If I have 3 web
     applications, each with 12 database tables to index, I don't want
     to run 36 JVMs. I think introducing an objectType would address
     this.

  b) Optimize retrieval when I have some knowledge that I can use to
     define partitions of data. This may actually be more appropriate
     for Lucene itself, but I see SOLR pretty well positioned to
     address. One approach is to introduce a "partitionField" that
     SOLR would use to figure out if a new index is required. For each
     unique value of the partitionField, we create a separate physical
     index. If the query does NOT contain a term for the
     partitionField, we use a multi reader to search across all
     indexes. If the query DOES contain the term, we only search
     across those partitions.

     We have tried using cached bitsets to implement this sort of
     approach, but have found that when we have one large document set
     partitioned into much smaller sets (e.g. 1-10% of the total
     document space), creating separate indexes gives us a much higher
     boost in performance.

-Mike


-----Original Message-----
From: Yonik Seeley [mailto:[hidden email]]
Sent: Wed 4/12/06 11:54 AM
To: [hidden email]
Cc:
Subject: Re: Interest in Extending SOLR

Welcome Michael,

On 4/12/06, Bryzek.Michael <[hidden email]> wrote:

>   * Integrated support for partitioning - database tables can be
>     partitioned for scalability reasons. The most common scenario for
>     us is to partition off data for our largest customers. For
>     example, imagine a users table:
>
>      * user_id
>      * email_address
>      * site_id
>
>     where site_id refers to the customer to whom the user
>     belongs. Some sites aggregate data... i.e. one of our customers
>     may have 100 sites. When indexing, we create a separate index to
>     store only data for a given site. This precomputes one of our more
>     expensive computations for search - a filter for all users that
>     belong to a given site.
So the number of filters is equal to the number of sites?  How many
sites are there?

>   * Decoupled infrastructure - we wanted the ability to fully scale
>     our search application independent of our database application

That makes total sense... we do the same thing.

>   * High speed indexing - we initially moved data from the database to
>     Lucene via XML documents. We found that to index even a 100k
>     documents, it was much faster to move the data in CSV files
>     (smaller files, less intensive processing).

Support for indexing from CSV files as well as simple pulling from a
database is on our "todo" list: http://wiki.apache.org/solr/TaskList

> IDEAS:
>
> Looking through SOLR, I've identified the following main categories of
> change. I would love to hear comments and feedback from this group.

It would be nice to make any changes as general as possible, while
still solving your particular problem.

I think I understand many of the internal changes you outlined, but
I'm not sure yet exactly what problem you are trying to solve, and how
the multiple indicies will be used.
- How would one identify what index (or SolrCore) an update is targeted to?
- What is the relationship between the multiple indicies... do queries
ever go across multiple indicies, or would there be an "objectType"
parameter passed in as part of the query?
- What is the purpose of multiple indicies... is it so search results
are always restricted to a single site, but it's not practical to have
that many Solr instances?  It looks like the indicies are partitioned
along the lines of object type, and not site-id though.

-Yonik


Reply | Threaded
Open this post in threaded view
|

Re: Interest in Extending SOLR

Vish D.
Mike,

I am currently evaluating different search engine technologies (esp., open
source ones), and this is very interesting to me, for the following reasons:

Our data is much like yours in that we have different types of data
(abstracts, fulltext, music, etc...), which eventually fall under different
"databases" in our subscription/offering model. So, the ability of have
different indexes (on database level, and on type level) would be the ideal
solution. The only difference being, when comparing to your needs, it would
be a requirement to be able to search between different indexes (searching
between "databases"), but also be able to search only within types. That is,
with your proposal, objectType could be "type" or "database." The point here
isn't that it would be nice to have second parameter, but it would be a
necessity to be able search between indexes.

I am truly interested in how this all works out, and hope to get myself
involved in Solr technology.





On 4/12/06, Bryzek.Michael <[hidden email]> wrote:

>
> Yonik -
>
> > So the number of filters is equal to the number of sites?
> > How many sites are there?
>
> Today: When new customers join, we generally don't do anything
> special. Currently we have roughly 400 customers, most of which have
> one site each. Note that a few customers have as many as 50 sites. In
> total, we probably filter data in 500 unique ways, before we actually
> search on the query string entered by the user. Of the 500 unique ways
> in which we filter data, there are approximately 50 for which we would
> prefer to use a unique index. I don't have 100% accurate numbers, but
> these should be in the ballpark.
>
> Future: We are planning to expand on the concepts we've developed to
> integrate Lucene and hopefully SOLR into other applications. One in
> particular:
>
>   * Provides a core data set of 100K records
>
>   * Allows each of 1,000 customers to create their own view of that
>     data
>
>   * In theory, our overall dataset may contain up to 100K * 1,000
>     records (100M), but we know that at any given time, only 100K
>     records should be made available.
>
> We did rough tests and found that creating multiple indexes performed
> better at run time, especially as the logic to determine what results
> should be presented to which customer became more complex.
>
>
> > Support for indexing from CSV files as well as simple pulling from a
> > database is on our "todo" list: http://wiki.apache.org/solr/TaskList
>
> I had seen this on the TODO list. I'm offering to contribute this
> piece when we've got an idea of overall fit...
>
>
> > How would one identify what index (or SolrCore) an update is
> > targeted to?
>
> This is a good question. I think the query interface itself would have
> to be extended. That is, a new parameter would have to be introduced
> which identified the objectType you would like to search/update. If
> omitted,
> the default object type would be used. In our current system, we set
> the objectType to the name of the database table and thus can issue
> queries like:
>
>   search.jsp?tableName=users&queryString=email:michael.bryzek
>
>
> > What is the relationship between the multiple indicies... do queries
> > ever go across multiple indicies, or would there be an "objectType"
> > parameter passed in as part of the query?
>
> In our case, there is no relationship between the multiple indices,
> but I do see value here (more on this below). In our specific case, we
> have a one to one mapping between a database table and a Lucene index
> and have not needed to search across tables.
>
> I think the value of the objectType is this true independence. If you
> are indexing similar data, use a field on your data. If your data sets
> are truly different, use a different object type.
>
>
> > What is the purpose of multiple indicies... is it so search results
> > are always restricted to a single site, but it's not practical to
> > have that many Solr instances?  It looks like the indicies are
> > partitioned along the lines of object type, and not site-id though.
>
> Your questions and comments are good. Thinking about it has helped me
> to clarify what exactly we're trying to accomplish. I think it boils
> down to these goals:
>
>   a) Minimize the number of instances of SOLR. If I have 3 web
>      applications, each with 12 database tables to index, I don't want
>      to run 36 JVMs. I think introducing an objectType would address
>      this.
>
>   b) Optimize retrieval when I have some knowledge that I can use to
>      define partitions of data. This may actually be more appropriate
>      for Lucene itself, but I see SOLR pretty well positioned to
>      address. One approach is to introduce a "partitionField" that
>      SOLR would use to figure out if a new index is required. For each
>      unique value of the partitionField, we create a separate physical
>      index. If the query does NOT contain a term for the
>      partitionField, we use a multi reader to search across all
>      indexes. If the query DOES contain the term, we only search
>      across those partitions.
>
>      We have tried using cached bitsets to implement this sort of
>      approach, but have found that when we have one large document set
>      partitioned into much smaller sets (e.g. 1-10% of the total
>      document space), creating separate indexes gives us a much higher
>      boost in performance.
>
> -Mike
>
>
> -----Original Message-----
> From:   Yonik Seeley [mailto:[hidden email]]
> Sent:   Wed 4/12/06 11:54 AM
> To:     [hidden email]
> Cc:
> Subject:        Re: Interest in Extending SOLR
>
> Welcome Michael,
>
> On 4/12/06, Bryzek.Michael <[hidden email]> wrote:
> >   * Integrated support for partitioning - database tables can be
> >     partitioned for scalability reasons. The most common scenario for
> >     us is to partition off data for our largest customers. For
> >     example, imagine a users table:
> >
> >      * user_id
> >      * email_address
> >      * site_id
> >
> >     where site_id refers to the customer to whom the user
> >     belongs. Some sites aggregate data... i.e. one of our customers
> >     may have 100 sites. When indexing, we create a separate index to
> >     store only data for a given site. This precomputes one of our more
> >     expensive computations for search - a filter for all users that
> >     belong to a given site.
>
> So the number of filters is equal to the number of sites?  How many
> sites are there?
>
> >   * Decoupled infrastructure - we wanted the ability to fully scale
> >     our search application independent of our database application
>
> That makes total sense... we do the same thing.
>
> >   * High speed indexing - we initially moved data from the database to
> >     Lucene via XML documents. We found that to index even a 100k
> >     documents, it was much faster to move the data in CSV files
> >     (smaller files, less intensive processing).
>
> Support for indexing from CSV files as well as simple pulling from a
> database is on our "todo" list: http://wiki.apache.org/solr/TaskList
>
> > IDEAS:
> >
> > Looking through SOLR, I've identified the following main categories of
> > change. I would love to hear comments and feedback from this group.
>
> It would be nice to make any changes as general as possible, while
> still solving your particular problem.
>
> I think I understand many of the internal changes you outlined, but
> I'm not sure yet exactly what problem you are trying to solve, and how
> the multiple indicies will be used.
> - How would one identify what index (or SolrCore) an update is targeted
> to?
> - What is the relationship between the multiple indicies... do queries
> ever go across multiple indicies, or would there be an "objectType"
> parameter passed in as part of the query?
> - What is the purpose of multiple indicies... is it so search results
> are always restricted to a single site, but it's not practical to have
> that many Solr instances?  It looks like the indicies are partitioned
> along the lines of object type, and not site-id though.
>
> -Yonik
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Interest in Extending SOLR

Chris Hostetter-3
In reply to this post by Bryzek.Michael


The crux of the issue seems to be supporting multiple indexes within a
single JVM. This has come up before, and personally i'm still in favor of
implimenting this via multiple webapps in the same servlet container,
rather then a single webapp with many seperate configs/schemas/cores that
it chooses between...

        http://www.nabble.com/Re%3A-Multiple-indices-p3540026.html

you mentioned...

: I think the value of the objectType is this true independence. If you
: are indexing similar data, use a field on your data. If your data sets
: are truly different, use a different object type.

...there is still some dependency there, unless you add a similar
objectType sectioning to the solrconfig -- there might be some query
handlers that i only want for one index but not others, or
newSearcher/firstSearcher listeners i only want for one index, etc...
having truely seperate webapsp gives you all the benefits of independency,
plus the benefit of a single shared JVM with a shared memory pool


I'm still confused however about why you want multiple indexes in your
specicic use case, you mention...

: In our case, there is no relationship between the multiple indices,
: but I do see value here (more on this below). In our specific case, we
: have a one to one mapping between a database table and a Lucene index
: and have not needed to search across tables.

but then you also said..

:      address. One approach is to introduce a "partitionField" that
:      SOLR would use to figure out if a new index is required. For each
:      unique value of the partitionField, we create a separate physical
:      index. If the query does NOT contain a term for the
:      partitionField, we use a multi reader to search across all
:      indexes. If the query DOES contain the term, we only search
:      across those partitions.

...which confuses me.  I also don't see how the partitionField idea could
work cleanly, because i can't think of anyway you could safely use a
MultiReader (or even a MultiSearcher) accross indexes that had differnet
schemas ... even if field names were the same, they might have been
analyzed completely differnetly, or use field types that encode the terms
in non-uniform ways.


are we discussig two differnet use cases?

  1) support for multiple indexes with heterogeneous schemas in the same
     JVM for better memory management (but with no interaction at query
     time)
  2) support for multiple indexes using hte same index in the same JVM for
     better partitioning, with query time access to either a multireader
     across all, or a single reader of an individual partition.

...if so, perhaps the two use cases should be solved in differnet ways:
multiple webaps for #1, and some new config/schema options and a new
SolrQueryRequest.getSearcher(String partition) method for #2.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Interest in Extending SOLR

Yonik Seeley
In reply to this post by Bryzek.Michael
Michael,

I'm not sure that objectType should be tied to which index something
is stored in.
If Solr does evolve multiple index support, one usecase would be
partitioning data based on other factors than objectType
(documentType).

It would seem more flexible for clients (the direct updater or querier
of Solr) to identify which index should be used.  Of course each index
could have it's own schema, but it shouldn't be mandatory... it seems
like a new index should be able to be created on-the-fly somehow,
perhaps using an existing index as a template.

On 4/12/06, Bryzek.Michael <[hidden email]> wrote:
> We did rough tests and found that creating multiple indexes performed
> better at run time, especially as the logic to determine what results
> should be presented to which customer became more complex.

I would expect searching a small index would be somewhat faster than
searching a large index with the small one embedded in it.  How much
faster though?  Is it really worth the effort to separate things out?
When you did the benchmarks, did you make sure to discount the first
queries (because of first-use norm and FieldCache loading)?  All that
can be done in the background...

I'm not arguing against extending Solr to support multiple indicies,
but wondering if you could start using it as-is until such support is
well hashed out.  Seems so, since it seems to be an issue of
performance (an optimization) and not functionallity, right?

Another easy optimization you might be able to make external to Solr
is to segment your site data into different Solr collections (on
different boxes).  This assumes that search traffic is naturally
partitioned by siteId (but I may be misunderstanding).

>   a) Minimize the number of instances of SOLR. If I have 3 web
>      applications, each with 12 database tables to index, I don't want
>      to run 36 JVMs. I think introducing an objectType would address
>      this.

Another possible option is to run multiple Solr instances (webapps)
per appserver... I recall someone else going after this solution.

>   b) Optimize retrieval when I have some knowledge that I can use to
>      define partitions of data. This may actually be more appropriate
>      for Lucene itself, but I see SOLR pretty well positioned to
>      address. One approach is to introduce a "partitionField" that
>      SOLR would use to figure out if a new index is required. For each
>      unique value of the partitionField, we create a separate physical
>      index. If the query does NOT contain a term for the
>      partitionField, we use a multi reader to search across all
>      indexes. If the query DOES contain the term, we only search
>      across those partitions.

While that approach might be better w/o caching, it might be worse
with caching... it really depends on the nature of the index and the
queries.
It would really complicate Solr's caching though since a cache item
would only be valid for certain combinations of sub-indicies.

>      We have tried using cached bitsets to implement this sort of
>      approach, but have found that when we have one large document set
>      partitioned into much smaller sets (e.g. 1-10% of the total
>      document space), creating separate indexes gives us a much higher
>      boost in performance.

I assume this was with Lucene and not Solr?
Solr has better/faster filter representations... (and if I ever get
around to finishing it, a faster BitSet implementation too).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Extending SOLR

Yonik Seeley
In reply to this post by Chris Hostetter-3
On 4/13/06, Chris Hostetter <[hidden email]> wrote:
> This has come up before, and personally i'm still in favor of
> implimenting this via multiple webapps in the same servlet container

That would certainly be easier (and you bring up a good point about
probably wanting different solrconfigs for truely separate indicies
also).
I think we are 99% of the way to multiple webapps now... just need to
set the default directory to look for config based on the webapp name.

-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: Interest in Extending SOLR

Brian Lucas
In reply to this post by Bryzek.Michael
Hello there.  I moved from Lucene to Solr and am thus far impressed with the
speed, even with the added XML transport necessary to send and receive data.


However, one thing I did like about the Lucene implementation was the
ability to specify indices for each invocation.  

I agree with Michael Bryzek's comments about extending Solr to include
multiple indices.  This would be an extremely useful feature for me, as I
tend to place data into stratifications to maximize low seek times and data
sizes.  I'm currently looking to run multiple instances of Solr if necessary
to handle this.  

Another possible way to implement multiple indices (in addition to Michael's
suggestion through objectTypes) could be simply appending the index name to
the query, similar to the following:

via Indexing:
<add index="products">
.
</add>

via Query:
http://localhost/solr/select?index=products&q= .

And either schema.xml broken into <index name="products">, or
schema-<indexname>.xml could handle that particular schema.

Regardless how it gets done, it would be extremely valuable to offer this
functionality and afford a better deployment "strategy" for those of us
dealing with different types of data.

Thanks Yonik, Chris, and others for releasing this - it seems to work great
for my needs.

Brian

---
Michael Bryzek wrote:
  a) Minimize the number of instances of SOLR. If I have 3 web
     applications, each with 12 database tables to index, I don't want
     to run 36 JVMs. I think introducing an objectType would address
     this.


Reply | Threaded
Open this post in threaded view
|

RE: Interest in Extending SOLR

Bryzek.Michael
In reply to this post by Bryzek.Michael
I definitely like the idea of support for multiple indexes based on
partitioning data that is NOT tied to a predefined element named
objectType. If we combine this with Chris' mention of completing the
work to support multiple schemas via multiple webapps in the same
servlet container, then I no longer see an immediate need to have more
than one schema per webapp. The concept would be:

 

*         One schema per webapp, Multiple webapps per JVM

*         Partitioning of data into multiple indexes in each webapp
based on logic that you provide

 

For our own applications, my preference is to migrate away from our
homegrown solution to SOLR, prior to investing further in what we
currently have built. I will plan on testing performance a bit more
formally to see if SOLR out of the box would work for us. Note that in
our present environment, our performance changed significantly (factor
of ~10) when we partitioned data into multiple indexes, though our tests
were very rough.

 

I would be very happy to contribute time to expand SOLR to provide
initial support for the partitioning concept as I believe this will
prove critical when we evaluate how our database structure maps to a
query index.

 

One last note: last night, I did spend a bit of time looking into what
exactly it would mean to add support for object types in SOLR. I
modified the code base to support the object type tag in the schema,
providing a working proof of concept (I'm happy to send a sample schema
if anybody is interested). The main changes:

 

*         Modify IndexSchema to keep an object type

*         Provide a factory in SolrCore that returns the correct
instance of SolrCore based on object type

*         Modify loading of schema to load one copy per object type

 

I really do like where this conversation has gone, but if the community
does chose to support multiple object types, on the surface (to a
newcomer) it appears highly doable.

 

-Mike

 

-----Original Message-----
From: Yonik Seeley [mailto:[hidden email]]
Sent: Thursday, April 13, 2006 2:53 PM
To: [hidden email]
Subject: Re: Interest in Extending SOLR

 

Michael,

 

I'm not sure that objectType should be tied to which index something

is stored in.

If Solr does evolve multiple index support, one usecase would be

partitioning data based on other factors than objectType

(documentType).

 

It would seem more flexible for clients (the direct updater or querier

of Solr) to identify which index should be used.  Of course each index

could have it's own schema, but it shouldn't be mandatory... it seems

like a new index should be able to be created on-the-fly somehow,

perhaps using an existing index as a template.

 

On 4/12/06, Bryzek.Michael <[hidden email]> wrote:

> We did rough tests and found that creating multiple indexes performed

> better at run time, especially as the logic to determine what results

> should be presented to which customer became more complex.

 

I would expect searching a small index would be somewhat faster than

searching a large index with the small one embedded in it.  How much

faster though?  Is it really worth the effort to separate things out?

When you did the benchmarks, did you make sure to discount the first

queries (because of first-use norm and FieldCache loading)?  All that

can be done in the background...

 

I'm not arguing against extending Solr to support multiple indicies,

but wondering if you could start using it as-is until such support is

well hashed out.  Seems so, since it seems to be an issue of

performance (an optimization) and not functionallity, right?

 

Another easy optimization you might be able to make external to Solr

is to segment your site data into different Solr collections (on

different boxes).  This assumes that search traffic is naturally

partitioned by siteId (but I may be misunderstanding).

 

>   a) Minimize the number of instances of SOLR. If I have 3 web

>      applications, each with 12 database tables to index, I don't want

>      to run 36 JVMs. I think introducing an objectType would address

>      this.

 

Another possible option is to run multiple Solr instances (webapps)

per appserver... I recall someone else going after this solution.

 

>   b) Optimize retrieval when I have some knowledge that I can use to

>      define partitions of data. This may actually be more appropriate

>      for Lucene itself, but I see SOLR pretty well positioned to

>      address. One approach is to introduce a "partitionField" that

>      SOLR would use to figure out if a new index is required. For each

>      unique value of the partitionField, we create a separate physical

>      index. If the query does NOT contain a term for the

>      partitionField, we use a multi reader to search across all

>      indexes. If the query DOES contain the term, we only search

>      across those partitions.

 

While that approach might be better w/o caching, it might be worse

with caching... it really depends on the nature of the index and the

queries.

It would really complicate Solr's caching though since a cache item

would only be valid for certain combinations of sub-indicies.

 

>      We have tried using cached bitsets to implement this sort of

>      approach, but have found that when we have one large document set

>      partitioned into much smaller sets (e.g. 1-10% of the total

>      document space), creating separate indexes gives us a much higher

>      boost in performance.

 

I assume this was with Lucene and not Solr?

Solr has better/faster filter representations... (and if I ever get

around to finishing it, a faster BitSet implementation too).

 

-Yonik

Reply | Threaded
Open this post in threaded view
|

RE: Interest in Extending SOLR

Chris Hostetter-3

: One last note: last night, I did spend a bit of time looking into what
: exactly it would mean to add support for object types in SOLR. I
: modified the code base to support the object type tag in the schema,
: providing a working proof of concept (I'm happy to send a sample schema
: if anybody is interested). The main changes:

: *         Modify IndexSchema to keep an object type
: *         Provide a factory in SolrCore that returns the correct
: instance of SolrCore based on object type
: *         Modify loading of schema to load one copy per object type

I'm confused ... once you made these modifications, did you have a
seperate index per objectType, each with it's own schema? ... the seperate
SolrCore instances seem to suggest total isolation, so there was no way to
query across all objectTypes?


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Interest in Extending SOLR

Chris Hostetter-3
In reply to this post by Bryzek.Michael

: I definitely like the idea of support for multiple indexes based on
: partitioning data that is NOT tied to a predefined element named
: objectType. If we combine this with Chris' mention of completing the
: work to support multiple schemas via multiple webapps in the same
: servlet container, then I no longer see an immediate need to have more
: than one schema per webapp. The concept would be:

Yonik already added support for multiple webapp instances (with unique
schemas) to the Near Term task list ... i've also added a
brainstorming page to the wiki with some ideas for implimenting index
partitioning to the "Ideas for the future" section...

        http://wiki.apache.org/solr/TaskList
        http://wiki.apache.org/solr/IndexPartitioning

...the more i think about though, the less i'm convinced this is
absolutely neccessary.  i have a feeling that the built in DocSet caching
solr does and the search methods that allow you to filter by a DocSet (or
a query which is converted to a DocSet) would proably be "fast enough"
most times.

I would encourage you to experiment more with SOlr and test out it's
performance before assuming you have to get down into the nitty gritty
stuff and partition hte index (just becuase it improved the performance of
straight Lucene, doesn't mean Solr's built in caching mechanisms aren't
already better)


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Interest in Extending SOLR

Bryzek.Michael
In reply to this post by Bryzek.Michael
Yonik already added support for multiple webapp instances (with unique
schemas) to the Near Term task list ... i've also added a
brainstorming page to the wiki with some ideas for implimenting index
partitioning to the "Ideas for the future" section...

        http://wiki.apache.org/solr/TaskList
        http://wiki.apache.org/solr/IndexPartitioning

--

Excellent - I've updated the index partitioning page to include one
additional scenario to consider for how this may work, allowing us to
define the partitions in advance rather than dynamically. I believe this
minimizes the impact on Solr while supporting the majority of use cases.
This model also follows the conceptual model of how partitioning in a
database works.
 

--

I would encourage you to experiment more with SOlr and test out it's
performance before assuming you have to get down into the nitty gritty
stuff and partition hte index (just becuase it improved the performance
of
straight Lucene, doesn't mean Solr's built in caching mechanisms aren't
already better)


I am planning to benchmark in our environment hopefully over the next
1-2 weeks, and if completed and useful, will post back any data that we
find.


-Mike

Reply | Threaded
Open this post in threaded view
|

RE: Interest in Extending SOLR

Bryzek.Michael
In reply to this post by Bryzek.Michael
I defined objectTypes as:

  * Share almost everything in the global schema file (e.g. caching,
dynamic fields, field types, etc.)

  * Each objectType defined its own set of available fields

This allowed me to easily index completely different types of objects w/
NO way to query across the different types of objects. Each object type
was stored in its own physical index.

What it enabled me to do easily was:

  * Translate a table design in the database to an XML document that
only defined the fields and their types for that table

  * Provide for multiple tables in one physical XML schema in one
instance of Solr. This gives me a simple way to provide for indexing new
tables in the database w/out needed to do more work with Solr than
restart the instance.

  * Run queries against one instance of SOLR for each of the types, e.g.
/solr/select?ot=users&q=email:mbryzek


Running multiple webapps likely accomplishes the same set of goals, with
more flexibility in customizing all of the schema for each type of
object and thus seems like a better solution.

-Mike


-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: Friday, April 14, 2006 6:40 PM
To: [hidden email]
Subject: RE: Interest in Extending SOLR


: One last note: last night, I did spend a bit of time looking into what
: exactly it would mean to add support for object types in SOLR. I
: modified the code base to support the object type tag in the schema,
: providing a working proof of concept (I'm happy to send a sample
schema
: if anybody is interested). The main changes:

: *         Modify IndexSchema to keep an object type
: *         Provide a factory in SolrCore that returns the correct
: instance of SolrCore based on object type
: *         Modify loading of schema to load one copy per object type

I'm confused ... once you made these modifications, did you have a
seperate index per objectType, each with it's own schema? ... the
seperate
SolrCore instances seem to suggest total isolation, so there was no way
to
query across all objectTypes?


-Hoss


Reply | Threaded
Open this post in threaded view
|

Re: Interest in Extending SOLR

Yonik Seeley
On 4/15/06, Bryzek.Michael <[hidden email]> wrote:
>   * Translate a table design in the database to an XML document that
> only defined the fields and their types for that table

Just so anyone new to Lucene/Solr isn't mislead by this thread...

Lucene (and hence Solr) documents don't need to be homogeneous within
a single Lucene index.  Some documents can have one set of fields, and
other documents in the same index can have a completely different set
of fields.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Extending SOLR

Vish D.
Yonik/Chris,

Do we have a eta on " Allow multiple independent Solr *webapps* in the same
app server"?

After reading up, silently, on the many emails on this topic, I agree with
you that it would be worthwhile to test out the current implementation and
see how it performs. But, it makes sense to run a comparison against the
multiple 'indexes' idea...and so, my question above.

Thanks!

On 4/15/06, Yonik Seeley <[hidden email]> wrote:

>
> On 4/15/06, Bryzek.Michael <[hidden email]> wrote:
> >   * Translate a table design in the database to an XML document that
> > only defined the fields and their types for that table
>
> Just so anyone new to Lucene/Solr isn't mislead by this thread...
>
> Lucene (and hence Solr) documents don't need to be homogeneous within
> a single Lucene index.  Some documents can have one set of fields, and
> other documents in the same index can have a completely different set
> of fields.
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Extending SOLR

Yonik Seeley
On 4/18/06, Vish D. <[hidden email]> wrote:
> Do we have a eta on " Allow multiple independent Solr *webapps* in the same
> app server"?

We've been discussing it on solr-dev
http://www.mail-archive.com/solr-dev%40lucene.apache.org/msg00292.html

> But, it makes sense to run a comparison against the
> multiple 'indexes' idea...and so, my question above.

I think it will probably be the rare application where it will make
much of a difference.
Taken from the point of view of meeting requirements, an extra 10%
performance for a certain type of query that's already fast enough
will probably not be worth the extra complexity.

-Yonik