Monitoring DataHub
Monitoring DataHub's system components is critical for operating and improving DataHub. This doc explains how to add tracing and metrics measurements in the DataHub containers.
Tracing
Traces let us track the life of a request across multiple components. Each trace is consisted of multiple spans, which are units of work, containing various context about the work being done as well as time taken to finish the work. By looking at the trace, we can more easily identify performance bottlenecks.
We enable tracing by using the OpenTelemetry java instrumentation library. This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture telemetry from popular libraries.
Using the agent we are able to
- Plug and play different tracing tools based on the user's setup: Jaeger, Zipkin, or other tools
- Get traces for Kafka, JDBC, and Elasticsearch without any additional code
- Track traces of any function with a simple
@WithSpan
annotation
You can enable the agent by setting env variable ENABLE_OTEL
to true
for GMS and MAE/MCE consumers. In our
example docker-compose, we export metrics to a local Jaeger
instance by setting env variable OTEL_TRACES_EXPORTER
to jaeger
and OTEL_EXPORTER_JAEGER_ENDPOINT
to http://jaeger-all-in-one:14250
, but you can easily change this behavior by
setting the correct env variables. Refer to
this doc for
all configs.
Once the above is set up, you should be able to see a detailed trace as a request is sent to GMS. We added
the @WithSpan
annotation in various places to make the trace more readable. You should start to see traces in the
tracing collector of choice. Our example docker-compose deploys
an instance of Jaeger with port 16686. The traces should be available at http://localhost:16686.
Metrics
With tracing, we can observe how a request flows through our system into the persistence layer. However, for a more holistic picture, we need to be able to export metrics and measure them across time. Unfortunately, OpenTelemetry's java metrics library is still in active development.
As such, we decided to use Dropwizard Metrics to export custom metrics to JMX,
and then use Prometheus-JMX exporter to export all JMX metrics to
Prometheus. This allows our code base to be independent of the metrics collection tool, making it easy for people to use
their tool of choice. You can enable the agent by setting env variable ENABLE_PROMETHEUS
to true
for GMS and MAE/MCE
consumers. Refer to this example docker-compose for setting the
variables.
In our example docker-compose, we have configured prometheus to scrape from 4318 ports of each container used by the JMX exporter to export metrics. We also configured grafana to listen to prometheus and create useful dashboards. By default, we provide two dashboards: JVM dashboard and DataHub dashboard.
In the JVM dashboard, you can find detailed charts based on JVM metrics like CPU/memory/disk usage. In the DataHub dashboard, you can find charts to monitor each endpoint and the kafka topics. Using the example implementation, go to http://localhost:3001 to find the grafana dashboards! (Username: admin, PW: admin)
To make it easy to track various metrics within the code base, we created MetricUtils class. This util class creates a central metric registry, sets up the JMX reporter, and provides convenient functions for setting up counters and timers. You can run the following to create a counter and increment.
MetricUtils.counter(this.getClass(),"metricName").increment();
You can run the following to time a block of code.
try(Timer.Context ignored=MetricUtils.timer(this.getClass(),"timerName").timer()){
...block of code
}
Enable monitoring through docker-compose
We provide some example configuration for enabling monitoring in this directory. Take a look at the docker-compose files, which adds necessary env variables to existing containers, and spawns new containers (Jaeger, Prometheus, Grafana).
You can add in the above docker-compose using the -f <<path-to-compose-file>>
when running docker-compose commands.
For instance,
docker-compose \
-f quickstart/docker-compose.quickstart.yml \
-f monitoring/docker-compose.monitoring.yml \
pull && \
docker-compose -p datahub \
-f quickstart/docker-compose.quickstart.yml \
-f monitoring/docker-compose.monitoring.yml \
up
We set up quickstart.sh, dev.sh, and dev-without-neo4j.sh to add the above docker-compose when MONITORING=true. For
instance MONITORING=true ./docker/quickstart.sh
will add the correct env variables to start collecting traces and
metrics, and also deploy Jaeger, Prometheus, and Grafana. We will soon support this as a flag during quickstart.