dutrevis / spark-resources-metrics-plugin   0.1

Apache License 2.0 GitHub

Spark plugin to retrieve metrics from a variety of cluster resources

Scala versions: 2.13 2.12

Spark Resources Metrics Plugin


Spark Resrouces Metrics Plugin logo

Spark Resources Metrics plugin is an Apache Spark plugin that registers metrics onto the Apache Spark metrics system, that will sink values collected from operational system's resources, aiming to cover metrics that the Spark metrics system does not provide, like the Ganglia monitoring system metrics.

Latest Version Build Coverage
GitHub tag (latest SemVer) GitHub Workflow Status (main branch) Codecov Coverage Report (main branch)

Table of Contents

Releases Scala 2.12 Scala 2.13
Maven Central Maven Central (Latest Release for Scala 2.12) Maven Central (Latest Release for Scala 2.13)
Sonatype Nexus Sonatype Nexus (Latest Release for Scala 2.12) Sonatype Nexus (Latest Release for Scala 2.13)
Snapshot Sonatype Nexus (Latest Snapshot for Scala 2.12) Sonatype Nexus (Latest Snapshot for Scala 2.13)

User Guide

Preparation

The Spark Resources Metrics plugin is intended to be used together with the native Spark metrics system (click for details). In order to properly show the metric values collected by this plugin, the Spark metrics system has to be set to report metrics on the plugin's supported Spark components, which currently are:

  • driver
  • executor

This is done in the sink configuration properties of the Spark metrics system. Usually to cover the supported components, you may set the component configuration detail to "all components" (*) in the property names. e.g.: spark.metrics.conf.*.sink.graphite.class

Installation

Package

Choose your preferred method to include Spark Resources Metrics plugin package in your Spark environment:

⚠️ Attention: switch to the desired Scala version (2.12 or 2.13) when using the links or JARs names.

1. Adding a new property to the Spark configuration file (click to expand)

You may choose to edit the spark-default.conf file, adding the following property and respective value, which will download the package through Maven Central:

spark.jars.packages io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1
2. Using a local package manager to download and copy (click to expand)

You may opt to use a local package manager like Maven to download and copy the package and its dependencies into your local Spark JARs folder:

mvn dependency:get -Dartifact="io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1"
mvn dependency:copy -Dartifact="io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1" -DoutputDirectory="$SPARK_HOME/jars"
3. Adding a flag to your CLI Spark call (click to expand)

You may choose to add one of these flags with property-value pairs to your CLI Spark call:

  • External download - choose one (Internet access required):

    --packages io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1

    --jars https://s01.oss.sonatype.org/content/repositories/releases/io/github/dutrevis/spark-resources-metrics-plugin_2.12/0.1/spark-resources-metrics-plugin_2.12-0.1.jar

  • Local deployments - choose one (prior JAR download required):

    --conf spark.driver.extraClassPath=/path/to/spark-resources-metrics-plugin_2.12-0.1.jar

    --jars /path/to/spark-resources-metrics-plugin_2.12-0.1.jar

πŸ’‘ Tip: the --jars CLI option can be used in YARN backend to make the plugin JAR available to both executors and cluster-mode drivers.

Classes

The plugin is composed of classes that, once activated in Apache Spark, register a group of metrics related to their distinct resources into the native Spark metrics system.

After the package is installed, these classes may be activated by being declared in the spark.plugins property.

1. You may add the property and its value to the "spark-default.conf" file (click to expand)
spark.plugins io.github.dutrevis.CPUMetrics,io.github.dutrevis.MemoryMetrics
2. You may add a flag (with its property-value pair) to your CLI Spark call (click to expand)
--conf "spark.plugins"="io.github.dutrevis.CPUMetrics,io.github.dutrevis.MemoryMetrics"
3. You may add a config to the Spark Context created with PySpark or Scala (click to expand)
.config("spark.plugins", "io.github.dutrevis.CPUMetrics,io.github.dutrevis.MemoryMetrics")

πŸ“ Note: as seen on Spark docs, properties set programmatically on the Spark Context take highest precedence, then flags passed through CLI calls like spark-submit or spark-shell, then options in the spark-defaults.conf file.

Usage

Once configured and registered onto the Spark metrics system, sinked metrics will be found in the following naming format:

{SparkComponent}.plugin.io.github.dutrevis.{ClassName}.{Metric}

Current plugin classes

Each class registers a group of metrics collected. For details of each metric, consult its specific collect method inside its class. The current available plugin classes are meant to be used when Spark is running in clusters with standalone, Mesos or YARN resource managers.

MemoryMetrics: collects memory resource metrics from a unix-based operating system. Memory metrics are obtained from the numbers of each line of the /proc/meminfo file, available at the proc pseudo-filesystem of unix-based operating systems. The file has statistics about memory usage on the system, arranged in lines consisted of a parameter name, followed by a colon, the value of the parameter, and an option unit of measurement.

πŸ“ Notes:

  • While the /proc/meminfo file shows kilobytes (kB; 1 kB equals 1000 B), its unit is actually kibibytes (KiB; 1 KiB equals 1024 B). This imprecision is known, but is not corrected due to legacy concerns.
  • Many fields have been present since at least Linux 2.6.0, but most of the other fields are available at specific Linux versions (as noted in each method docstring) or are displayed only if the kernel was configured with specific options. If these fields are not found, the metrics won't be registered onto Dropwizard's metric system.

Metrics registered:

  • TotalMemory
  • FreeMemory
  • UsedMemory
  • SharedMemory
  • BufferMemory
  • TotalSwapMemory
  • FreeSwapMemory
  • CachedSwapMemory
  • UsedSwapMemory

CPUMetrics: collects CPU resource metrics from a unix-based operating system. CPU metrics are obtained from the numbers of the first line of the /proc/stat file, available at the proc pseudo-filesystem of unix-based operating systems. These numbers identify the amount of time the CPU has spent performing different kinds of work, arranged in columns at the following order: "cpu_user", "cpu_nice", "cpu_system", "cpu_idle", "cpu_iowait", "cpu_irq" and "cpu_softirq".

πŸ“ Notes:

  • All of the numbers retrieved are aggregates since the system first booted.
  • Time units are in USER_HZ or Jiffies (typically hundredths of a second).
  • Values for "cpu_steal", "cpu_guest" and "cpu_guest_nice", available at spectific Linux versions, are not parsed from the file.

Metrics registered:

  • UserCPU
  • NiceCPU
  • SystemCPU
  • IdleCPU
  • WaitCPU

Example instrumentation: Graphite sink

Say you will use the Graphite sink to report metrics and you wish to monitor the memory usage of your Spark cluster, then you want to instrument your Spark deployment using Spark Resources Metrics plugin's class MemoryMetrics.

First, you have to decide how to install Spark Resources Metrics plugin.

For environments with customizable configurations, like container images, you may consider adding an installation step with Maven or adding a custom Spark configuration file, such as the example below:

Spark configuration file example
spark.jars.packages                       io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1
spark.plugins.defaultList                 io.github.dutrevis.MemoryMetrics
spark.metrics.conf.*.sink.graphite.class  org.apache.spark.metrics.sink.GraphiteSink
spark.metrics.conf.*.sink.graphite.host   your-graphite-host.com
spark.metrics.conf.*.sink.graphite.port   2003
spark.metrics.conf.*.sink.graphite.period 5
spark.metrics.conf.*.sink.graphite.unit   seconds

For environments with static configurations, like ever-standing or shared clusters, consider adding the properties into the Spark CLI call or into the Spark job itself, such as shown in the examples below:

CLI example
bin/spark-shell  --master yarn \
  --packages io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1 \
  --conf "spark.plugins"="io.github.dutrevis.MemoryMetrics" \
  --conf "spark.metrics.conf.*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"   \
  --conf "spark.metrics.conf.*.sink.graphite.host"="your-graphite-host.com" \
  --conf "spark.metrics.conf.*.sink.graphite.port"=2003 \
  --conf "spark.metrics.conf.*.sink.graphite.period"=5 \
  --conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
PySpark example
from pyspark.sql import SparkSession

spark = (SparkSession.builder.appName("Instrumented app").master("yarn")
      .config("spark.jars.packages", "io.github.dutrevis:spark-resources-metrics-plugin_2.12:0.1")
      .config("spark.plugins", "io.github.dutrevis.MemoryMetrics")
      .config("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
      .config("spark.metrics.conf.*.sink.graphite.host", "your-graphite-host.com")
      .config("spark.metrics.conf.*.sink.graphite.port", 2003)
      .config("spark.metrics.conf.*.sink.graphite.period", 5)
      .config("spark.metrics.conf.*.sink.graphite.unit", "seconds")
      .getOrCreate()
    )

Then run your Spark job and check the output of a Graphite plaintext protocol reader or exporter to find the metrics.

Developer Guide

Stack used

The plugin was developed using the technologies below, but feel free to try other solutions when contributing!

Developed to: Apache Spark
Language: Scala
CI: GitHub Actions
Public distribution: Sonartype

The source code compiles and runs in these Scala versions:

Compiles Runs
Scala 2.12 βœ… βœ…
Scala 2.13 βœ… βœ…
Scala 3 ❌ βœ…

Current class diagram

classDiagram
    direction RL

    class ProcFileMetricCollector{
      #String procFilePath

      +getMetricValue(String procFileContent,String originalMetricName) Any
      +defaultPathGetter(String s1, String s2) Path
      +getProcFileContent(String path_getter, String file_reader) String
    }

    class MeminfoMetricCollector{
      #String procFilePath

      +getMetricValue(String procFileContent,String originalMetricName) Long
    }
    MeminfoMetricCollector --|> ProcFileMetricCollector

    class StatMetricCollector{
      #String procFilePath

      +getMetricValue(String procFileContent,String originalMetricName) Long
    }
    StatMetricCollector --|> ProcFileMetricCollector

    class MemoryMetrics{
      +Map metricMapping

      +registerMetric(MetricRegistry metricRegistry, String metricName, Metric metricInstance ) Unit
    }
    MemoryMetrics --|> MeminfoMetricCollector

    class CPUMetrics{
      +Map metricMapping

      +registerMetric(MetricRegistry metricRegistry, String metricName, Metric metricInstance ) Unit
    }
    CPUMetrics --|> StatMetricCollector
Loading

Prior art

Spark Resources Metrics plugin is inspired by cerndb/SparkPlugins and LucaCanali/sparkMeasure.

Conceptually, Spark Resources Metrics plugin is very similar to SparkPlugins, but:

  1. It aims to collect metrics that other less supported metric systems used to cover, like Ganglia;
  2. It is modular for each resource, being more complete and flexible to use;
  3. It is more easily extensible, with its development considering design patterns and unit tests.

Next steps

Use it on different Spark environments

Support