[jira] Created: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)
Improve IDF and relevance by separately indexing different entity types sharing a common schema
-----------------------------------------------------------------------------------------------

                 Key: SOLR-1599
                 URL: https://issues.apache.org/jira/browse/SOLR-1599
             Project: Solr
          Issue Type: New Feature
          Components: Schema and Analysis
            Reporter: Graham Poulter


In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of _Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  However, _numDocs_ is just the total number of documents: the document frequency (DF) for a query term of a _track_ search would also need to exclude _artist_ entities from the DF_t total to get the IDF_t=log(N/DF_t).  However, DF_t needs to be calculated at index time, when Solr has no idea what filters will be applied.

I suggest using a metadata field _entitytype_ to specified on submitting a batch of documents, with a configured list of allowed values: in the example the document could specify either entitytype="track" or entitytype="artist" (defaulting to _track_).  The document frequency would then calculated for each entity type during indexing. so for term "foo" there will be two DF's stored: the DF of "foo" for entitytype="artist" and the DF of "foo" for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would then be implemented by searching only the _artist_ index.  

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Graham Poulter updated SOLR-1599:
---------------------------------

           Description:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  However, _numDocs_ is just the total number of documents: the document frequency (DF) for a query term of a _track_ search would also need to exclude _artist_ entities from the DF_t total to get the IDF_t=log(N/DF_t).  However, DF_t needs to be calculated at index time, when Solr has no idea what filters will be applied.

I suggest using a metadata field _entitytype_ to specified on submitting a batch of documents, with a configured list of allowed values: in the example the document could specify either entitytype="track" or entitytype="artist" (defaulting to _track_).  The document frequency would then calculated for each entity type during indexing. so for term "foo" there will be two DF's stored: the DF of "foo" for entitytype="artist" and the DF of "foo" for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would then be implemented by searching only the _artist_ index.  

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute for every entity type.

  was:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of _Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  However, _numDocs_ is just the total number of documents: the document frequency (DF) for a query term of a _track_ search would also need to exclude _artist_ entities from the DF_t total to get the IDF_t=log(N/DF_t).  However, DF_t needs to be calculated at index time, when Solr has no idea what filters will be applied.

I suggest using a metadata field _entitytype_ to specified on submitting a batch of documents, with a configured list of allowed values: in the example the document could specify either entitytype="track" or entitytype="artist" (defaulting to _track_).  The document frequency would then calculated for each entity type during indexing. so for term "foo" there will be two DF's stored: the DF of "foo" for entitytype="artist" and the DF of "foo" for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would then be implemented by searching only the _artist_ index.  

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute for every entity type.

    Remaining Estimate: 504h  (was: 672h)
     Original Estimate: 504h  (was: 672h)

> Improve IDF and relevance by separately indexing different entity types sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.
> The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  However, _numDocs_ is just the total number of documents: the document frequency (DF) for a query term of a _track_ search would also need to exclude _artist_ entities from the DF_t total to get the IDF_t=log(N/DF_t).  However, DF_t needs to be calculated at index time, when Solr has no idea what filters will be applied.
> I suggest using a metadata field _entitytype_ to specified on submitting a batch of documents, with a configured list of allowed values: in the example the document could specify either entitytype="track" or entitytype="artist" (defaulting to _track_).  The document frequency would then calculated for each entity type during indexing. so for term "foo" there will be two DF's stored: the DF of "foo" for entitytype="artist" and the DF of "foo" for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would then be implemented by searching only the _artist_ index.  
> With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Graham Poulter updated SOLR-1599:
---------------------------------

    Description:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest using a metadata field _entitytype_ specified on submitting a batch of documents, where the schema specifies the list of allowed entity types. In the example the document could specify either entitytype="track" or entitytype="artist" (defaulting to _track_).  During indexing each entity type has its set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would then be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute for every entity type.

  was:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  However, _numDocs_ is just the total number of documents: the document frequency (DF) for a query term of a _track_ search would also need to exclude _artist_ entities from the DF_t total to get the IDF_t=log(N/DF_t).  However, DF_t needs to be calculated at index time, when Solr has no idea what filters will be applied.

I suggest using a metadata field _entitytype_ to specified on submitting a batch of documents, with a configured list of allowed values: in the example the document could specify either entitytype="track" or entitytype="artist" (defaulting to _track_).  The document frequency would then calculated for each entity type during indexing. so for term "foo" there will be two DF's stored: the DF of "foo" for entitytype="artist" and the DF of "foo" for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would then be implemented by searching only the _artist_ index.  

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute for every entity type.


> Improve IDF and relevance by separately indexing different entity types sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.
> The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest using a metadata field _entitytype_ specified on submitting a batch of documents, where the schema specifies the list of allowed entity types. In the example the document could specify either entitytype="track" or entitytype="artist" (defaulting to _track_).  During indexing each entity type has its set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would then be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Graham Poulter updated SOLR-1599:
---------------------------------

    Description:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute separate solr cores for every entity type.

  was:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest using a metadata field _entitytype_ specified on submitting a batch of documents, where the schema specifies the list of allowed entity types. In the example the document could specify either entitytype="track" or entitytype="artist" (defaulting to _track_).  During indexing each entity type has its set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would then be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute for every entity type.


> Improve IDF and relevance by separately indexing different entity types sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.
> The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute separate solr cores for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Graham Poulter updated SOLR-1599:
---------------------------------

    Description:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute separate solr cores for every entity type.

  was:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute separate solr cores for every entity type.


> Improve IDF and relevance by separately indexing different entity types sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.
> The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute separate solr cores for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Graham Poulter updated SOLR-1599:
---------------------------------

    Description:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configureing, replicating and shardoeg a Solr core for every entity type.

  was:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with index distribution, because you must now maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of having to configure maintain, replicate and distribute separate solr cores for every entity type.


> Improve IDF and relevance by separately indexing different entity types sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.
> The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configureing, replicating and shardoeg a Solr core for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Graham Poulter updated SOLR-1599:
---------------------------------

    Description:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and more so with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configuring, replicating and sharding a Solr core for every entity type.

  was:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and even more complicated with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configureing, replicating and shardoeg a Solr core for every entity type.


> Improve IDF and relevance by separately indexing different entity types sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.
> The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and more so with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configuring, replicating and sharding a Solr core for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782388#action_12782388 ]

Graham Poulter commented on SOLR-1599:
--------------------------------------

This is what could happen when indexing multiple entity types in the same core. For instance, indexing artists and tracks and using a filter to "search for artists". You then search for artists, with two dismax terms _A_ or _B_ on the _name_ field.  Term _A_ is rare amongst artist _name_, so it should have a low docFreq and a relatively high weight compared to term _B_.   However, term _A_ happens to be common in track _name_, so its docFreq is higher, making the IDF weight for _A_ lower than it should be relative to term _B_.  The filtered-out track instances are invisibly modifying the weight of query terms in a query for artists, which would not happen with separate indeces (and thus separate docFreq's)

> Improve IDF and relevance by separately indexing different entity types sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.
> The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and more so with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configuring, replicating and sharding a Solr core for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782388#action_12782388 ]

Graham Poulter edited comment on SOLR-1599 at 11/25/09 12:28 PM:
-----------------------------------------------------------------

This is what could happen when indexing multiple entity types in the same core. For instance, indexing artists and tracks and using a filter to "search for artists". You then search for artists, with two dismax terms _A_ or _B_ on the _name_ field.  Term _A_ is rare amongst artist _name_, so it should have a low docFreq and a relatively high weight compared to term _B_.   However, term _A_ happens to be common in track _name_, so its docFreq is higher, making the IDF weight for _A_ lower than it should be relative to term _B_.  The track entities are invisibly altering the term weights in a query for artist entities, which would not happen if they had separate indeces and thus separate docFreq's.

      was (Author: grahamp):
    This is what could happen when indexing multiple entity types in the same core. For instance, indexing artists and tracks and using a filter to "search for artists". You then search for artists, with two dismax terms _A_ or _B_ on the _name_ field.  Term _A_ is rare amongst artist _name_, so it should have a low docFreq and a relatively high weight compared to term _B_.   However, term _A_ happens to be common in track _name_, so its docFreq is higher, making the IDF weight for _A_ lower than it should be relative to term _B_.  The filtered-out track instances are invisibly modifying the weight of query terms in a query for artists, which would not happen with separate indeces (and thus separate docFreq's)
 

> Improve IDF and relevance by separately indexing different entity types sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index.  This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists".   The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities.  The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa.
> The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores.  This would be more complicated with replication, and more so with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters.  He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be calculated at index time, when Solr does not know what filters will be applied.
> I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist".  Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track".   This might be implemented by instantiating a separate Lucene index for each configured entity type.  Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configuring, replicating and sharding a Solr core for every entity type.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.