holdenk / spark-validator   0.0.1

Apache License 2.0 GitHub

A library you can include in your Spark job to validate the counters and perform operations on success. Goal is scala/java/python support.

Scala versions: 2.10

buildstatus codecov.io

Spark Validator

A library you can include in your Spark job to validate the counters and perform operations on success.

This software should be considered pre-alpha.

Why you should validate counters

Maybe you are really lucky and you never have intermitent outages or bugs in your code.

If you have accumulators for things like records processed or number of errors, its really easy to write bounds for these. Even if you don't have custom counters you can use Spark's built in metrics (bytes read, time, etc.) and by looking at historic values we can establish reasonable bounds. This can help catch jobs which fail to process some of your records. This is not a replacement for unit or integration testing.

How spark validation works

We store all of the metrics from each run along with all of the accumulators you pass in.

If a run is successful it will run your on success handler. If you just want to mark the run as success you can specify a file for spark validator to touch.

How to write your validation rules

Absolute

Relative

How to build

sbt - Remember when it was called the simple build tool?

sbt/sbt compile

How to use

Scala

At the start of your Spark program once you have constructed your spark context call

import com.holdenkarau.spark_validator
...
val rules = List(
    new AbsoluteValueRule(counter = "recordsRead", min=Some(1000), max=None).
    ...)
val vc = new ValidationConf(counterPath, jobName, firstTime, rules)
val vl = new Validation(vc)
...
validator.validate()

Java

vNext

Python

vNext+1

License