Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Status

Current stateDraft
Proof of concept demo available here: https://github.com/yukim/cassandra-opentelemetry-demo

Discussion thread: -

JIRA: -

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Proof of concept demo available here: https://github.com/yukim/cassandra-opentelemetry-demo

...

Scope

  • Provide unified way of exporting tracing, metrics, and logging to external monitoring system with ease of configuration
    • Exporting repair tracing is out-of-scope in this proposal.

Goals

  The goal of this proposal is to integrate OpenTelemetry into Apache Cassandra so that exporting telemetries and setting up the monitoring system is much easier.

This feature is opt-in and does not remove the currently available telemetries and the way to export.

Image Added

Approach

  While While the goal of this CEP is to integrate all the OpenTelemetry features to be able to export tracing, metrics and logs, the proposal is separated into several steps.

OpenTelemetry spec and libraries are still evolving. Opentelemetry Java library provides stable support for Tracing and Metrics, however Logging support is experimental, as of the time of writing (May 2023).

See https://opentelemetry.io/status/ for up to date status.

  The steps need to be agreed areimplementation will be separated in three parts:

...

  • How to enable OpenTelemetry export
  1. Configuration and OpenTelemetry Tracing integration

...

  1. OpenTelemetry Metrics integration

...

  1. OpenTelemetry Logging integration

...

Timeline

todo

Mailing list / Slack channels

Mailing list: 

Slack channel: 

Discussion threads: 

Related JIRA tickets

JIRA(s): 

  • -

...

Motivation

...

Troubleshooting Apache Cassandra can be time-consuming and challenging when faced with failures or performance

...

issues. Without a proper observability system in place, it becomes difficult to identify the root cause of these problems.

Apache Cassandra already implements its own methods or relies on external libraries to provide operators and administrators with deep insights into the complex distributed database using the three pillars of observability: Tracing, Metrics, and Logging.

  • For tracing, Apache Cassandra has its own Tracing API.
  • For metrics collection and reporting, Apache Cassandra uses Codahale’s Metrics library to collect and expose through JMX.
  • For the logging, Apache Cassandra uses Slf4J/Logback logging library.

...

However, these features

...

become significantly more

...

valuable when

...

available

...

within an observability system that can correlate these telemetries together. Otherwise, operators and admins

...

are left manually pulling out individual telemetries and assembling information by hand

...

to make assumptions about the root cause of a problem.

...

To implement observability in Apache Cassandra, operators

...

must devise their own

...

methods to

...

extract these telemetries and establish

...

a monitoring stack

...

. This often involves using open-source software like Prometheus/Grafana

...

or commercial services like Datadog

...

. The process of setting up the stack varies depending on the software used, making it complex and oftentimes overlooked by operators.

Implementing observability of Apache Cassandra depends on what observation software is used where the method to set up is different from software to software, and oftentimes is ignored by operators for the complexity of setting up the stack.  OpenTelemetry is the project hosted at CNCF to provide "A single, vendor-agnostic instrumentation library per language with support for both automatic and manual instrumentation"(https://opentelemetry.io/docs/concepts/what-is-opentelemetry/). It specifies APIs to collect tracing, metrics, and logging, and protocols to export to the external observation software.

  By introducing Introducing OpenTelemetry , makes exporting telemetries and integrating observability software should be much easier with , requiring fewer configurations to be done. It also opens doors for the enables observability software projects / venders / , vendors, and service providers to implement develop Cassandra-specific solutions based on the standard that OpenTelemetry providesstandardized semantics provided by this CEP and OpenTelemetry spec.

Audience

  • Cassandra contributors / DevOps / DBAs
  • APM, Observation OSS projects / software venders / service providers

Proposed Changes

  This This CEP aims to integrate OpenTelemetry Tracing, Metrics and Logging with the existing implementations (Cassandra’s own Tracing API, Codahale/Dropwizard metrics, and Slf4J/Logback), while maintaining the backward compatibility.

Configuration

Apache Cassandra uses OpenTelemetry SDK to manually instrument itself to export tracing, metrics and logs. The operator Operators can configure how these telemetry are exported to the external system by changing the configuration in cassandra.yaml file and either jvm-server.options file or environmental variables.

Apache Cassandra uses OpenTelemetry SDK Autoconfigure to configure OpenTelemetry exporters.

By default, and only the OTLP exporter jars are included in the Apache Cassandra distribution, and when enabled, Apache Cassandra starts exporting telemetries to OTLP collector running in localhost:4317 through grpc.

Operators can provide the necessary jars and configuration to use other exporters (i.e. Jeager for . <appender name="OpenTelemetry"

            class="io.opentelemetry.instrumentation.logback.appender.v1_0.OpenTelemetryAppender">

  </appender> for tracing) as well. For example, if you want to export tracing to JeagerJaeger, you need to add opentelemetry-exporter-jeagerjaeger.jar file in the classpath, and configure through jvm-server.options:

# jvm-server.options

# Configure tracing export to JeagerJaeger
-Dotel.traces.exporter=jeagerjaeger
-Dotel.exporter.jaeger.endpoint=http://jeagerjaeger:14250

OpenTelemetry can be configured using environmental variables as well. This is useful for containerized environment like Kubernetes.

Exporting Trace through OpenTelemetry

...

OpenTelemetry

...

Tracing root Span will be created when a coordinator node receives the message from the client (the same timing when Cassandra Tracing starts).

In order to avoid re-instrumenting code with OpenTelemetry API, Tracing object will produce OpenTelemetry tracing event when its trace method is called.

...

Note that OpenTelemetry Tracing will not produce anything when it is not enabled.

Context propagation

From the client to Cassandra node through Native Protocol

The CEP proposes standardized way to propagate OpenTelemetry Context from the applications through Native Protocol’s custom payload, rather than introducing new header definitions.

...

Map<String, ByteBuffer> payload = new HashMap<>();
W3CTraceContextPropagator.getInstance().inject(Context.current(), payload, (carrier, key, value) -> {
    if (carrier != null) {
        carrier.put(key, ByteBuffer.wrap(value.getBytes(StandardCharsets.UTF_8)));
    }
});
Statement<?> injected = statement.setCustomPayload(payload);
session.execute(injected);

Between nodes through inter-node messaging protocol

To propagate OpenTelemetry Context between the nodes, the sender adds the new ParamType to the message only when:

...

Upon receiving the message containing Context ParamType, the receiver constructs the remote span context, which becomes the parent of the span in this node.

Between threads through ExecutorLocals

Like Cassandra TracingState, OpenTelemetry Context will be held in ExecutorLocals to propagate Context between threads. Alternatively, Context can be directly passed to Runnables to propagate context between threads.

List of Attributes associated in Tracing

The following Attributes are associated to Spans and events. Attribute names should follow the Attribute Naming rule.

NameDescriptionExample value
cassandra.query.message_typeMessage type of the query (QUERY / EXECUTE / BATCH / PREPARE)QUERY
cassandra.query.client.ipClient IP address
cassandra.query.coordinator.ip cassandra.query.coordinator.portCoordinator node’s broadcast rpc address and native transport port10.0.0.1 9042
cassandra.query.page_size
5000
cassandra.query.consistency_level

cassandra.query.serial_consistency_level

cassandra.net.verb

net.peer.ip net.peer.port

thread.nameName of the thread when the event is recordedMutationStage-1
thread.idID of the thread when the event is recorded105

Exporting Metrics through OpenTelemetry

Cassandra is already instrumented by Dropwizard Metrics library. Instead of rewriting every metrics with OpenTelemetry SDK, create OpenTelemetry adapter using MetricRegistryListener.

OpenTelemetry MetricRegistryListener is registered only when OpenTelemetry feature is enabled in cassandra.yaml.

This adapter uses OpenTelemetry Metrics SDK's asynchronous instrument variants to record metrics from the original Dropwizard metrics counterpart upon collection.

https://opentelemetry.io/docs/instrumentation/java/manual/#metrics


Exporting Logging through OpenTelemetry

Using Logback Adapter library, Apache Cassandra can directly export log to the collector without changing the current logging code in Apache Cassandra.

https://opentelemetry.io/docs/instrumentation/java/manual/#logs

Operators need to activate OpenTelemetry appender in logback.xml 

New or Changed Public Interfaces

New org.apache.cassandra.telemetry.Telemetry class

  • Holds the global io.opentelemtry.api.OpenTelemetry instance.
  • Initialize OpenTelemetry instance using OpenTelemetry SDK Autoconfigure when opentelemetry configuration in cassandra.yaml has enabled: true.
    • By using OpenTelemetry SDK Autoconfigure, users can configure OpenTelemetry exporters through Java system properties and environmental variables.
  • By default, OpenTelemetry is disabled.
    • When OpenTelemetry is disabled, No-Op OpenTelemetry related instances are created through OpenTelemetry SDK.
  • Sets up OpenTelemetry tracing, metrics and logs providers.

New cassandra.yaml entry

# Export Apache Cassandra telemetries(tracing, metrics and logs)
# through OpenTelemetry.
#
# Use jvm-server.options file to configure OpenTelemetry SDK with
# system properties.
# <https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/autoconfigure>
# Apache Cassandra only comes with the libraries to export telemetry
# through OLTP.
# You need appropriate jar files if you plan to use other exporters.
opentelemetry:
    enabled: false # default: false

Necessary config change

-Xss needs to be increased (from 256k to 512k) for OLTP export (Okhttp3 based gRPC call) to work

New library dependencies

The following dependencies will be added to the distribution.

OpenTelemetry instrumentation and configuration

  • io.opentelemetry:opentelemetry-api
  • io.opentelemetry:opentelemetry-sdk
  • io.opentelemetry:opentelemetry-sdk-extension-autoconfigure

Runtime dependencies for exporting via OTLP

  • io.opentelemetry:opentelemetry-exporter-otlp

Extra dependencies to consider

  • io.opentelemetry.instrumentation:opentelemetry-runtime-metrics
    • The library to enable JVM metrics instrumentation
  • io.opentelemetry:opentelemetry-exporter-otlp-logs
    • Necessary if Apache Cassandra decides to adapt unstable OpenTelemetry Log export. The library is separated at the moment, with -alpha in its version.
  • io.opentelemetry.instrumentation::opentelemetry-logback-appender-1.0
    • The library to enable exporting logs through Logback appender

Exporting Metrics through OpenTelemetry

TODO

...

TODO

Semantic convention

OpenTelemetry Resources are immutable attributes (key-value pairs) that describes the entity producing telemetries, in this case, Apache Cassandra server itself. Resources are created once when configuring OpenTelemetry, and those are associated to every telemetries produced by Apache Cassandra.

See Attribute Naming for the specification of naming of Resources.

Namespacing

Every Attribute names (including Resource names) begin with cassandra. For example, to describe Cassandra version, the Resource name will be cassandra.version.

Note that this namespace is used by OpenTelemetry’s JMX metrics gatherer already, however, since the source entity of telemetries is Apache Cassandra, conflicting with JMX metric gatherer shouldn’t be a problem.

List of Resources provided by Apache Cassandra

The following Resources are proposed to be configured when OpenTelemetry is enabled. OpenTelemetry standard Resources as well as Cassandra specific Resources are configured.

Resource nameValueDescriptionExample
service.namecassandra• This is required Resource to be set by OpenTelemetry.
• It is possible to override this value through https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/autoconfigure#opentelemetry-resource or https://opentelemetry.io/docs/reference/specification/sdk-environment-variables/#general-sdk-configuration.


service.namespaceApache Cassandra cluster name
Test Cluster
service.instance.idApache Cassandra Host ID
058321f5-29b3-4f12-9766-2ad793adb3a0
service.versionApache Cassandra version in use
45.20.0
net.host.ip / net.host.portApache Cassandra node’s endpoint address (listen_address or broadcast_address) and port
192.168.1.1 and 7000

TODO

  • Should we use Cassandra specific Resources such as cassandra.cluster.name over standard convention like service.namespace ?
    • Using standard may be better to use existing telemetry services.
  • Do we need anything else?

Compatibility, Deprecation, and Migration Plan

  OpenTelemetry integration adds capability to export existing Tracing/Metrics/Logs through OpenTelemetry Protocol. This feature will not prevent existing telemetry collection. Users should be able to use existing Cassandra tracing, export Metrics through JMX, and output logs to files.

Test Plan

  OpenTelemetry Java library provides in-memory telemetry sinks for testing. Use it to verify the exported telemetries.

Rejected Alternatives

  • Apache Skywalking project provides similar protocol to export observability, however, there seems no APM that supports APMs that support the Apache Skywalking protocol other than Apache Skywalking itself.

...