A library provides a more easy way to describe DataFrame schema for Spark and MLSQL.
This library requires Spark 2.3+/2.4+ (tested).
You can link against this library in your program at the following coordinates:
groupId: tech.mlsql
artifactId: simple-schema_2.11
version: 0.2.0
val s = SparkSimpleSchemaParser.parse("st(field(column1,string),field(column2,string),field(column3,string))")
assert(s == StructType(Seq(StructField("column1", StringType), StructField("column2", StringType), StructField("column3", StringType))))
Spark DataFrame schema normally is represented by json
, but json is not easy to write and used as plain-text in quote.
Simple schema create a new format to make this easy.
st means StructType, filed means StructField,the first value in field is columnName,and the second is type. For now, simple schema supports type like following:
- st
- field
- string
- float
- double
- integer
- short
- date
- binary
- map
- array
- long
- boolean
- byte
- decimal
Suppose you have a json data:
{"column1":{"key":"value"}}
you can describe it like this:
st(field(column1,map(string,string)))
st also supports nesting:
st(field(column1,map(string,array(st(field(columnx,string))))))