The library provides Spark Source to efficiently handle streaming from object store for files saved in dynamically changing paths,
e.g. s3://my_bucket/my-data/2022/06/09/
, that might be described as s3://my_bucket/my-data/{YYYY}/{MM}/{DD}/
.
The library is available on Maven Central.
If you use SBT you can add it to your build.sbt
:
libraryDependencies += "ai.hunters" %% "spark-adaptive-file-connector" % "1.0.0"
for Maven:
<dependency>
<groupId>ai.hunters</groupId>
<artifactId>spark-adaptive-file-connector_2.12</artifactId>
<version>1.0.0</version>
</dependency>
or for other dependency manager.
This is the simplest example of using the connector in Scala application:
val readFileJsonStream = spark.readStream
.format("dynamic-paths-file")
.option("fileFormat", "json")
.schema(yourSchema)
.load("s3://my-bucket/prefix/{YYYY}/{MM}/{DD}")
sbt "clean;compile"
sbt "clean;test"
The easiest way to start playing with this library is an integration test DynamicPathsFileStreamSourceProviderITest
.
It shows how to:
- pass parameters to Spark to be able to use this source connector
- check what paths have been read and how fast
- recover from checkpoint
- see output
When you use this library in your app please remember to register
the DynamicPathsFileStreamSourceProvider in your DataSourceRegister. You can see example what was needed for tests in
src/test/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
.
An Advanced S3 Connector for Spark to Hunt for Cyber Attacks (Data&AI Summit, San Francisco 2022)