Metrics API - Documentation

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Metrics API - Documentation

Richard Goodman
Hi there,

I'm currently working on using the prometheus exporter to provide some detailed insights for our Solr Cloud clusters.

Using the provided template killed our prometheus server, as well as the exporter due to the size of our clusters (each cluster is around 96 nodes, ~300 collections with 3way replication and 16 shards), so you can imagine the amount of data that comes through /admin/metrics and not filtering it down first.

I've began working on writing my own template to reduce the amount of data being requested and it's working fine, and I'm starting to build some nice graphs in Grafana.

The only difficulty I'm having with this, is I'm struggling to find decent documentation on the metrics themselves. I was using the resources metrics reporting - metrics-api and monitoring solr with prometheus and grafana but there is a lack of information on most metrics. 

For example:
"ADMIN./admin/collections.totalTime":6715327903,
I understand this is a counter, however, I'm not sure what unit this would be represented when displaying it, for example:



A latency of 1mil, not sure if this means milliseconds, million, etc., 
Another example would be the GC metrics:
      "gc.ConcurrentMarkSweep.count":7,
      "gc.ConcurrentMarkSweep.time":1247,
      "gc.ParNew.count":16759,
      "gc.ParNew.time":884173,
Which when displayed, doesn't give the clearest insight as to what the unit is:


If anyone has any advice / guidance, that would be greatly appreciated. If there isn't documentation for the API, then this would also be something I'll look into help contributing with too.

Thanks,
--

Richard Goodman

Reply | Threaded
Open this post in threaded view
|

Re: Metrics API - Documentation

Emir Arnautović
Hi Richard,
We do not use API to collect metrics but JMX, but I believe that those are the same (did not verify it in code). You can see how we handled those metrics into reports/charts or even use our agent to send data to Prometheus: https://github.com/sematext/sematext-agent-integrations/tree/master/solr <https://github.com/sematext/sematext-agent-integrations/tree/master/solr>

You can also see some links to Solr metric related blog posts in this repo. If you find out that managing your own monitoring stack is overwhelming, you can try our Solr integration.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Oct 2019, at 12:40, Richard Goodman <[hidden email]> wrote:
>
> Hi there,
>
> I'm currently working on using the prometheus exporter to provide some detailed insights for our Solr Cloud clusters.
>
> Using the provided template killed our prometheus server, as well as the exporter due to the size of our clusters (each cluster is around 96 nodes, ~300 collections with 3way replication and 16 shards), so you can imagine the amount of data that comes through /admin/metrics and not filtering it down first.
>
> I've began working on writing my own template to reduce the amount of data being requested and it's working fine, and I'm starting to build some nice graphs in Grafana.
>
> The only difficulty I'm having with this, is I'm struggling to find decent documentation on the metrics themselves. I was using the resources metrics reporting - metrics-api <https://lucene.apache.org/solr/guide/7_7/metrics-reporting.html#metrics-api> and monitoring solr with prometheus and grafana <https://lucene.apache.org/solr/guide/7_7/monitoring-solr-with-prometheus-and-grafana.html> but there is a lack of information on most metrics.
>
> For example:
> "ADMIN./admin/collections.totalTime":6715327903,
> I understand this is a counter, however, I'm not sure what unit this would be represented when displaying it, for example:
>
>
>
> A latency of 1mil, not sure if this means milliseconds, million, etc.,
> Another example would be the GC metrics:
>       "gc.ConcurrentMarkSweep.count":7,
>       "gc.ConcurrentMarkSweep.time":1247,
>       "gc.ParNew.count":16759,
>       "gc.ParNew.time":884173,
> Which when displayed, doesn't give the clearest insight as to what the unit is:
>
>
> If anyone has any advice / guidance, that would be greatly appreciated. If there isn't documentation for the API, then this would also be something I'll look into help contributing with too.
>
> Thanks,
> --
> Richard Goodman

Reply | Threaded
Open this post in threaded view
|

Re: Metrics API - Documentation

Andrzej Białecki-2
Hi,

Starting with Solr 7.0 all JMX metrics are actually internally driven by the metrics API - JMX (or Prometheus) is just a way of exposing them.

I agree that we need more documentation on metrics - contributions are welcome :)

Regarding your specific examples (btw. our mailing lists aggressively strip all attachments - your graphs didn’t make it):

* time units in time-based counters are in nanoseconds. This is just a unit of value, not necessarily precision. In this specific example `ADMIN./admin/collections.totalTime` (and similarly named metrics for all other request handlers) represents the total elapsed time spent processing requests.
* time-based histograms are expressed in milliseconds, where it is indicated by the “_ms” suffix.
* 1-, 5- and 15-min rates represent an exponentially weighted moving average over that time window, expressed in events/second.
* handlerStart is initialised with System.currentTimeMillis() when this instance of request handler is first created.
* details on GC, memory buffer pools, and similar JVM metrics are documented in JDK documentation on Management Beans. For example:
https://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html?is-external=true <https://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html?is-external=true>
* "A latency of 1mil” - no idea what that is, I don’t think Solr API uses this abbreviation anywhere.

Hope this helps.



Andrzej Białecki

> On 7 Oct 2019, at 13:41, Emir Arnautović <[hidden email]> wrote:
>
> Hi Richard,
> We do not use API to collect metrics but JMX, but I believe that those are the same (did not verify it in code). You can see how we handled those metrics into reports/charts or even use our agent to send data to Prometheus: https://github.com/sematext/sematext-agent-integrations/tree/master/solr <https://github.com/sematext/sematext-agent-integrations/tree/master/solr>
>
> You can also see some links to Solr metric related blog posts in this repo. If you find out that managing your own monitoring stack is overwhelming, you can try our Solr integration.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 7 Oct 2019, at 12:40, Richard Goodman <[hidden email]> wrote:
>>
>> Hi there,
>>
>> I'm currently working on using the prometheus exporter to provide some detailed insights for our Solr Cloud clusters.
>>
>> Using the provided template killed our prometheus server, as well as the exporter due to the size of our clusters (each cluster is around 96 nodes, ~300 collections with 3way replication and 16 shards), so you can imagine the amount of data that comes through /admin/metrics and not filtering it down first.
>>
>> I've began working on writing my own template to reduce the amount of data being requested and it's working fine, and I'm starting to build some nice graphs in Grafana.
>>
>> The only difficulty I'm having with this, is I'm struggling to find decent documentation on the metrics themselves. I was using the resources metrics reporting - metrics-api <https://lucene.apache.org/solr/guide/7_7/metrics-reporting.html#metrics-api> and monitoring solr with prometheus and grafana <https://lucene.apache.org/solr/guide/7_7/monitoring-solr-with-prometheus-and-grafana.html> but there is a lack of information on most metrics.
>>
>> For example:
>> "ADMIN./admin/collections.totalTime":6715327903,
>> I understand this is a counter, however, I'm not sure what unit this would be represented when displaying it, for example:
>>
>>
>>
>> A latency of 1mil, not sure if this means milliseconds, million, etc.,
>> Another example would be the GC metrics:
>>      "gc.ConcurrentMarkSweep.count":7,
>>      "gc.ConcurrentMarkSweep.time":1247,
>>      "gc.ParNew.count":16759,
>>      "gc.ParNew.time":884173,
>> Which when displayed, doesn't give the clearest insight as to what the unit is:
>>
>>
>> If anyone has any advice / guidance, that would be greatly appreciated. If there isn't documentation for the API, then this would also be something I'll look into help contributing with too.
>>
>> Thanks,
>> --
>> Richard Goodman
>

Reply | Threaded
Open this post in threaded view
|

Re: Metrics API - Documentation

Richard Goodman
Many thanks both for your responses, they've been helpful.

@Andrzej - Sorry I wasn't clear on the "A latency of 1mil" as I wasn't
aware the image wouldn't come through. But following your bullet points
helped me present a better unit for measurement in the axis.

In regards to contributing, would absolutely love to help there, just not
sure what the correct direction is? I wasn't sure if the web page source
code / contributions are in the apache-lucene repository?

Thanks,


On Tue, 8 Oct 2019 at 11:04, Andrzej Białecki <[hidden email]> wrote:

> Hi,
>
> Starting with Solr 7.0 all JMX metrics are actually internally driven by
> the metrics API - JMX (or Prometheus) is just a way of exposing them.
>
> I agree that we need more documentation on metrics - contributions are
> welcome :)
>
> Regarding your specific examples (btw. our mailing lists aggressively
> strip all attachments - your graphs didn’t make it):
>
> * time units in time-based counters are in nanoseconds. This is just a
> unit of value, not necessarily precision. In this specific example
> `ADMIN./admin/collections.totalTime` (and similarly named metrics for all
> other request handlers) represents the total elapsed time spent processing
> requests.
> * time-based histograms are expressed in milliseconds, where it is
> indicated by the “_ms” suffix.
> * 1-, 5- and 15-min rates represent an exponentially weighted moving
> average over that time window, expressed in events/second.
> * handlerStart is initialised with System.currentTimeMillis() when this
> instance of request handler is first created.
> * details on GC, memory buffer pools, and similar JVM metrics are
> documented in JDK documentation on Management Beans. For example:
>
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html?is-external=true
> <
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html?is-external=true
> >
> * "A latency of 1mil” - no idea what that is, I don’t think Solr API uses
> this abbreviation anywhere.
>
> Hope this helps.
>
> —
>
> Andrzej Białecki
>
> > On 7 Oct 2019, at 13:41, Emir Arnautović <[hidden email]>
> wrote:
> >
> > Hi Richard,
> > We do not use API to collect metrics but JMX, but I believe that those
> are the same (did not verify it in code). You can see how we handled those
> metrics into reports/charts or even use our agent to send data to
> Prometheus:
> https://github.com/sematext/sematext-agent-integrations/tree/master/solr <
> https://github.com/sematext/sematext-agent-integrations/tree/master/solr>
> >
> > You can also see some links to Solr metric related blog posts in this
> repo. If you find out that managing your own monitoring stack is
> overwhelming, you can try our Solr integration.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 7 Oct 2019, at 12:40, Richard Goodman <[hidden email]>
> wrote:
> >>
> >> Hi there,
> >>
> >> I'm currently working on using the prometheus exporter to provide some
> detailed insights for our Solr Cloud clusters.
> >>
> >> Using the provided template killed our prometheus server, as well as
> the exporter due to the size of our clusters (each cluster is around 96
> nodes, ~300 collections with 3way replication and 16 shards), so you can
> imagine the amount of data that comes through /admin/metrics and not
> filtering it down first.
> >>
> >> I've began working on writing my own template to reduce the amount of
> data being requested and it's working fine, and I'm starting to build some
> nice graphs in Grafana.
> >>
> >> The only difficulty I'm having with this, is I'm struggling to find
> decent documentation on the metrics themselves. I was using the resources
> metrics reporting - metrics-api <
> https://lucene.apache.org/solr/guide/7_7/metrics-reporting.html#metrics-api>
> and monitoring solr with prometheus and grafana <
> https://lucene.apache.org/solr/guide/7_7/monitoring-solr-with-prometheus-and-grafana.html>
> but there is a lack of information on most metrics.
> >>
> >> For example:
> >> "ADMIN./admin/collections.totalTime":6715327903,
> >> I understand this is a counter, however, I'm not sure what unit this
> would be represented when displaying it, for example:
> >>
> >>
> >>
> >> A latency of 1mil, not sure if this means milliseconds, million, etc.,
> >> Another example would be the GC metrics:
> >>      "gc.ConcurrentMarkSweep.count":7,
> >>      "gc.ConcurrentMarkSweep.time":1247,
> >>      "gc.ParNew.count":16759,
> >>      "gc.ParNew.time":884173,
> >> Which when displayed, doesn't give the clearest insight as to what the
> unit is:
> >>
> >>
> >> If anyone has any advice / guidance, that would be greatly appreciated.
> If there isn't documentation for the API, then this would also be something
> I'll look into help contributing with too.
> >>
> >> Thanks,
> >> --
> >> Richard Goodman
> >
>
>

--

Richard Goodman    |    Data Infrastructure engineer

[hidden email]


NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
PARIS   | SINGAPORE | SYDNEY

<https://www.brandwatch.com/blog/digital-consumer-intelligence/>
Reply | Threaded
Open this post in threaded view
|

Re: Metrics API - Documentation

Andrzej Białecki-2
We keep all essential user documentation (and some dev docs) in the Ref Guide.

The source for the Ref Guide is checked-in under solr/solr-ref-guide, it uses a simple ASCII markup so adding some content should be easy. You should follow the same workflow as with the code (create a JIRA, and then either add a patch or create a PR).

> On 15 Oct 2019, at 17:33, Richard Goodman <[hidden email]> wrote:
>
> Many thanks both for your responses, they've been helpful.
>
> @Andrzej - Sorry I wasn't clear on the "A latency of 1mil" as I wasn't
> aware the image wouldn't come through. But following your bullet points
> helped me present a better unit for measurement in the axis.
>
> In regards to contributing, would absolutely love to help there, just not
> sure what the correct direction is? I wasn't sure if the web page source
> code / contributions are in the apache-lucene repository?
>
> Thanks,
>
>
> On Tue, 8 Oct 2019 at 11:04, Andrzej Białecki <[hidden email]> wrote:
>
>> Hi,
>>
>> Starting with Solr 7.0 all JMX metrics are actually internally driven by
>> the metrics API - JMX (or Prometheus) is just a way of exposing them.
>>
>> I agree that we need more documentation on metrics - contributions are
>> welcome :)
>>
>> Regarding your specific examples (btw. our mailing lists aggressively
>> strip all attachments - your graphs didn’t make it):
>>
>> * time units in time-based counters are in nanoseconds. This is just a
>> unit of value, not necessarily precision. In this specific example
>> `ADMIN./admin/collections.totalTime` (and similarly named metrics for all
>> other request handlers) represents the total elapsed time spent processing
>> requests.
>> * time-based histograms are expressed in milliseconds, where it is
>> indicated by the “_ms” suffix.
>> * 1-, 5- and 15-min rates represent an exponentially weighted moving
>> average over that time window, expressed in events/second.
>> * handlerStart is initialised with System.currentTimeMillis() when this
>> instance of request handler is first created.
>> * details on GC, memory buffer pools, and similar JVM metrics are
>> documented in JDK documentation on Management Beans. For example:
>>
>> https://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html?is-external=true
>> <
>> https://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html?is-external=true
>>>
>> * "A latency of 1mil” - no idea what that is, I don’t think Solr API uses
>> this abbreviation anywhere.
>>
>> Hope this helps.
>>
>> —
>>
>> Andrzej Białecki
>>
>>> On 7 Oct 2019, at 13:41, Emir Arnautović <[hidden email]>
>> wrote:
>>>
>>> Hi Richard,
>>> We do not use API to collect metrics but JMX, but I believe that those
>> are the same (did not verify it in code). You can see how we handled those
>> metrics into reports/charts or even use our agent to send data to
>> Prometheus:
>> https://github.com/sematext/sematext-agent-integrations/tree/master/solr <
>> https://github.com/sematext/sematext-agent-integrations/tree/master/solr>
>>>
>>> You can also see some links to Solr metric related blog posts in this
>> repo. If you find out that managing your own monitoring stack is
>> overwhelming, you can try our Solr integration.
>>>
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
>>>> On 7 Oct 2019, at 12:40, Richard Goodman <[hidden email]>
>> wrote:
>>>>
>>>> Hi there,
>>>>
>>>> I'm currently working on using the prometheus exporter to provide some
>> detailed insights for our Solr Cloud clusters.
>>>>
>>>> Using the provided template killed our prometheus server, as well as
>> the exporter due to the size of our clusters (each cluster is around 96
>> nodes, ~300 collections with 3way replication and 16 shards), so you can
>> imagine the amount of data that comes through /admin/metrics and not
>> filtering it down first.
>>>>
>>>> I've began working on writing my own template to reduce the amount of
>> data being requested and it's working fine, and I'm starting to build some
>> nice graphs in Grafana.
>>>>
>>>> The only difficulty I'm having with this, is I'm struggling to find
>> decent documentation on the metrics themselves. I was using the resources
>> metrics reporting - metrics-api <
>> https://lucene.apache.org/solr/guide/7_7/metrics-reporting.html#metrics-api>
>> and monitoring solr with prometheus and grafana <
>> https://lucene.apache.org/solr/guide/7_7/monitoring-solr-with-prometheus-and-grafana.html>
>> but there is a lack of information on most metrics.
>>>>
>>>> For example:
>>>> "ADMIN./admin/collections.totalTime":6715327903,
>>>> I understand this is a counter, however, I'm not sure what unit this
>> would be represented when displaying it, for example:
>>>>
>>>>
>>>>
>>>> A latency of 1mil, not sure if this means milliseconds, million, etc.,
>>>> Another example would be the GC metrics:
>>>>     "gc.ConcurrentMarkSweep.count":7,
>>>>     "gc.ConcurrentMarkSweep.time":1247,
>>>>     "gc.ParNew.count":16759,
>>>>     "gc.ParNew.time":884173,
>>>> Which when displayed, doesn't give the clearest insight as to what the
>> unit is:
>>>>
>>>>
>>>> If anyone has any advice / guidance, that would be greatly appreciated.
>> If there isn't documentation for the API, then this would also be something
>> I'll look into help contributing with too.
>>>>
>>>> Thanks,
>>>> --
>>>> Richard Goodman
>>>
>>
>>
>
> --
>
> Richard Goodman    |    Data Infrastructure engineer
>
> [hidden email]
>
>
> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
> PARIS   | SINGAPORE | SYDNEY
>
> <https://www.brandwatch.com/blog/digital-consumer-intelligence/>