[jira] Created: (SOLR-1308) Cache docsets and docs at the SegmentReader level

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (SOLR-1308) Cache docsets and docs at the SegmentReader level

Tim Allison (Jira)
Cache docsets and docs at the SegmentReader level
-------------------------------------------------

                 Key: SOLR-1308
                 URL: https://issues.apache.org/jira/browse/SOLR-1308
             Project: Solr
          Issue Type: Improvement
    Affects Versions: 1.4
            Reporter: Jason Rutherglen
            Priority: Minor
             Fix For: 1.5


Solr caches docsets and documents at the top level Multi*Reader
level. After a commit, the caches are flushed. Reloading the
caches in near realtime (i.e. commits every 1s - 2min)
unnecessarily consumes IO resources, especially for largish
indexes.

We can cache docsets and documents at the SegmentReader level.
The cache settings in SolrConfig can be applied to the
individual SR caches.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735159#action_12735159 ]

Yonik Seeley commented on SOLR-1308:
------------------------------------

Absolutely!  We need to get 1.4 out of the way first of course.

One interesting question is the structure of the cache and how to size caches.

One way: if someone specifies a document cache of 128 docs, and we have a cache per segment, how big should each segment cache be?
One answer is that if a segment represents 10% of the total index, then it should get 10% of the cache.  There are downsides to that though - it fails to take into account non-uniform access in the index (hotspots).

> Cache docsets and docs at the SegmentReader level
> -------------------------------------------------
>
>                 Key: SOLR-1308
>                 URL: https://issues.apache.org/jira/browse/SOLR-1308
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets and documents at the top level Multi*Reader
> level. After a commit, the caches are flushed. Reloading the
> caches in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources, especially for largish
> indexes.
> We can cache docsets and documents at the SegmentReader level.
> The cache settings in SolrConfig can be applied to the
> individual SR caches.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735466#action_12735466 ]

Hoss Man commented on SOLR-1308:
--------------------------------

bq. One interesting question is the structure of the cache and how to size caches.

i feel like i'm missing something here ... wouldn't the simplest approach still be the best?

if i currently have a single filterCache of size=1024, and 1million docs then that uses up some quantity of memory =~ func(1024,1mil) (based on sparseness of each query)

if i start having per segment caches, and there are 22 segments each with a filterCache of size=1024, then the amount of memory used by all the caches will be ~22*func(1024,(1mil/22)) ... which should wind up being roughtly the same as before.

smaller segments will wind up using less ram for their caches, even if the "size" of the cache is the same for each segment.

> Cache docsets and docs at the SegmentReader level
> -------------------------------------------------
>
>                 Key: SOLR-1308
>                 URL: https://issues.apache.org/jira/browse/SOLR-1308
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets and documents at the top level Multi*Reader
> level. After a commit, the caches are flushed. Reloading the
> caches in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources, especially for largish
> indexes.
> We can cache docsets and documents at the SegmentReader level.
> The cache settings in SolrConfig can be applied to the
> individual SR caches.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735679#action_12735679 ]

Jason Rutherglen commented on SOLR-1308:
----------------------------------------

Perhaps in another issue we can implement a cache that is RAM
usage aware. Implement sizeof(bitset), and keep the cache below
a predefined limit?

Do we need to have a cache per reader or can the cachekey
include the reader? If segments are created rapidly, we may not
want the overhead of creating a new cache and managing it's
size, especially if we move to a RAM usage model.  


> Cache docsets and docs at the SegmentReader level
> -------------------------------------------------
>
>                 Key: SOLR-1308
>                 URL: https://issues.apache.org/jira/browse/SOLR-1308
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets and documents at the top level Multi*Reader
> level. After a commit, the caches are flushed. Reloading the
> caches in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources, especially for largish
> indexes.
> We can cache docsets and documents at the SegmentReader level.
> The cache settings in SolrConfig can be applied to the
> individual SR caches.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784668#action_12784668 ]

Jason Rutherglen commented on SOLR-1308:
----------------------------------------

I'm taking a look at this, it's straightforward to cache and
reuse docsets per reader in SolrIndexSearcher, however, we're
passing docsets all over the place (i.e. UnInvertedField). We
can't exactly rip out DocSet without breaking most unit tests,
and writing a bunch of facet merging code. We'd likely lose
functionality?

Will the MultiDocSet concept SOLR-568 as an easy way to get
something that works up and running? Then we can benchmark and
see if we've lost performance?

> Cache docsets and docs at the SegmentReader level
> -------------------------------------------------
>
>                 Key: SOLR-1308
>                 URL: https://issues.apache.org/jira/browse/SOLR-1308
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets and documents at the top level Multi*Reader
> level. After a commit, the caches are flushed. Reloading the
> caches in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources, especially for largish
> indexes.
> We can cache docsets and documents at the SegmentReader level.
> The cache settings in SolrConfig can be applied to the
> individual SR caches.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785433#action_12785433 ]

Jason Rutherglen commented on SOLR-1308:
----------------------------------------

I realized because of UnInvertedField, we'll need to merge facet
results from UIF per reader, so using a MultiDocSet won't help. Can we
leverage the distributed merging FacetComponent implements
(i.e. reuse and/or change the code to work in both the
distributed and local cases)? Ah well, I was hoping for an easy
solution for realtime facets.

> Cache docsets and docs at the SegmentReader level
> -------------------------------------------------
>
>                 Key: SOLR-1308
>                 URL: https://issues.apache.org/jira/browse/SOLR-1308
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets and documents at the top level Multi*Reader
> level. After a commit, the caches are flushed. Reloading the
> caches in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources, especially for largish
> indexes.
> We can cache docsets and documents at the SegmentReader level.
> The cache settings in SolrConfig can be applied to the
> individual SR caches.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1308) Cache docsets and docs at the SegmentReader level

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785435#action_12785435 ]

Yonik Seeley commented on SOLR-1308:
------------------------------------

bq. we'll need to merge facet results from UIF per reader

Yeah... that's a pain.
We could easily do per-segment faceting for non-string types though (int, long, etc) since they don't need to be merged.

> Cache docsets and docs at the SegmentReader level
> -------------------------------------------------
>
>                 Key: SOLR-1308
>                 URL: https://issues.apache.org/jira/browse/SOLR-1308
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets and documents at the top level Multi*Reader
> level. After a commit, the caches are flushed. Reloading the
> caches in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources, especially for largish
> indexes.
> We can cache docsets and documents at the SegmentReader level.
> The cache settings in SolrConfig can be applied to the
> individual SR caches.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-1308) Cache docsets at the SegmentReader level

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated SOLR-1308:
-----------------------------------

    Description:
Solr caches docsets at the top level Multi*Reader level. After a
commit, the filter/docset caches are flushed. Reloading the
cache in near realtime (i.e. commits every 1s - 2min)
unnecessarily consumes IO resources when reloading the filters,
especially for largish indexes.

We'll cache docsets at the SegmentReader level. The cache key
will include the reader.

  was:
Solr caches docsets and documents at the top level Multi*Reader
level. After a commit, the caches are flushed. Reloading the
caches in near realtime (i.e. commits every 1s - 2min)
unnecessarily consumes IO resources, especially for largish
indexes.

We can cache docsets and documents at the SegmentReader level.
The cache settings in SolrConfig can be applied to the
individual SR caches.

        Summary: Cache docsets at the SegmentReader level  (was: Cache docsets and docs at the SegmentReader level)

I changed the title because we're not going to cache docs in
this issue (though I think it's possible to cache docs by the
internal id, rather than the doc id).

Per-segment facet caching and merging per segment can go into a
different issue.

> Cache docsets at the SegmentReader level
> ----------------------------------------
>
>                 Key: SOLR-1308
>                 URL: https://issues.apache.org/jira/browse/SOLR-1308
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets at the top level Multi*Reader level. After a
> commit, the filter/docset caches are flushed. Reloading the
> cache in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources when reloading the filters,
> especially for largish indexes.
> We'll cache docsets at the SegmentReader level. The cache key
> will include the reader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-1308) Cache docsets at the SegmentReader level

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786240#action_12786240 ]

Jason Rutherglen commented on SOLR-1308:
----------------------------------------

{quote} Yeah... that's a pain. We could easily do per-segment
faceting for non-string types though (int, long, etc) since they
don't need to be merged. {quote}

I opened SOLR-1617 for this. I think doc sets can be handled
with a multi doc set (hopefully). Facets however, argh,
FacetComponent is really hairy, though I think it boils down to
simply adding field values of the same up? Then there seems to
be edge cases which I'm scared of. At least it's easy to test
whether we're fulfilling todays functionality by randomly unit
testing per-segment and multi-segment side by side (i.e. if the
results of one are different than the results of the other, we
know there's something to fix).

Perhaps we can initially add up field values, and test that
(which is enough for my project), and move from there. I'd still
like to genericize all of the distributed processes to work over
multiple segments (like Lucene distributed search uses a
MultiSearcher which also works locally), so that local or
distributed is the same API wise. However given I've had trouble
figuring out the existing distributed code (SOLR-1477 ran into a
wall). Maybe as part of SolrCloud
http://wiki.apache.org/solr/SolrCloud, we can rework the
distributed APIs to be more user friendly (i.e. *MultiSearcher
is really easy to understand). If Solr's going to work well in
the cloud, distributed search probably needs to be easy to multi
tier for scaling (i.e. if we have 1 proxy server and 100 nodes,
we could have 1 top proxy, and 1 proxy per 10 nodes, etc).

> Cache docsets at the SegmentReader level
> ----------------------------------------
>
>                 Key: SOLR-1308
>                 URL: https://issues.apache.org/jira/browse/SOLR-1308
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Solr caches docsets at the top level Multi*Reader level. After a
> commit, the filter/docset caches are flushed. Reloading the
> cache in near realtime (i.e. commits every 1s - 2min)
> unnecessarily consumes IO resources when reloading the filters,
> especially for largish indexes.
> We'll cache docsets at the SegmentReader level. The cache key
> will include the reader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.