I am looking for advice on how to vertically partition an index (break each documents fields across > 1 core/instance).
- Our system stores all document metadata in database tables
- The contents of each document is stored on a filesystem
- Metadata changes frequently, and index must be updated to match (eg. minutes delay, not hours)
- Contents changes infrequently, and is a high cost to reindex (large files, complex analyzers)
Having the contents stored in the same index as the metadata means that it will be frequently & needlessly reanalyzed. This causes a lot of wasted cycles as there may be a large number of documents that have a single field changed, but the system ends up re-analyzing the gigabytes of text contents for these documents.
One suggested solution was to store the contents field, and copy the field (rather than re-analyze) each time a document is reindexed. However, this would cause a lot of wasted storage, as we have terrabytes of documents.
We are currently looking at a vertical partioning scheme, that uses multiple solr cores. One core contains the schema for all the metadata, the other core has the schema for the contents. We have successfully made a custom request handler that pushes documents to both cores, effectively producing the split indexes.
The problem now, is how to split the queries across both cores? Given that there could be AND/OR/NOT clauses, containing both metadata & contents fields, we'll need to find some way to divide a query into to different parts that can be run on each core, and have the hits joined back together afterwards. This is similar to the sharding feature, but requires intersection as well as union of result hits.
Does anyone have any advice on how to go about dividing up the different query clauses, and how we could merge results? Or can anyone suggest a different approach to vertical partitioning?
Just an update on my own research:
I have discovered the 'ParallelReader' class (subclass of IndexReader) in lucene, which is designed for searching across multiple indexes.
This appears to suit our needs - and I do not expect will be too difficult to integrate into Solr.
ParallelReader is definitely out there on the Lucene landscape. See http://www.lucidimagination.com/search/page:2?q=ParallelReader
for some background discussion, including Doug's original post on it
and some others view of the use case. The key is that the small index
has to be rebuilt in exactly the same order as the large index, which
seems particularly onerous in high-update environments. I will add
that it is definitely one of those areas most people do not use, so
getting help on it may be difficult.
I've often thought about an AsynchronousParallelReader that maintained
a mapping between the two indexes such that you could let the indexes
get out of sync, but have never implemented it or gone far enough down
the path to know whether it would even work or not. The devil is
likely in the details what with Lucene's merging, etc.
Thinking out loud, you might also try a custom component (or some
changes to the QueryComponent) that uses the MultiSearcher or maybe
some lower level Solr changes. The MultiSearcher is also designed to
search across multiple indexes.
On Feb 9, 2009, at 8:37 PM, Mark Kranz wrote:
> Just an update on my own research:
> I have discovered the 'ParallelReader' class (subclass of
> IndexReader) in
> lucene, which is designed for searching across multiple indexes.
> This appears to suit our needs - and I do not expect will be too
> to integrate into Solr.
> View this message in context: http://www.nabble.com/Vertical-Partitioning-advice-tp21906668p21926031.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
I ended up pursuing the ParallelWriter http://issues.apache.org/jira/browse/LUCENE-600 , so we can map different fields to different indexes. This appears to keep the indexes in sync, although I still need to do more testing.
However, some ugly hackery was needed to get it to extend SolrIndexWriter, so it could be dropped in as a replacement for the existing writers. The writer gets created by a custom UpdateHandler, which overrides createMainIndexWriter.
Most of this can be done with extensions/plugins to Solr, but there are a few parts that need to patch Solr directly (eg. SolrCore directly creating Searchers & Writers, need more than one index dirs, etc)
thanks for the comments
|Powered by Nabble||Edit this page|