modakanalytics / bigquery.almaren   0.0.1

Apache License 2.0 GitHub

BigQuery Connector For Almaren Framework

Scala versions: 2.12 2.11

BigQuery Connector

Build Status

To add Bigquery dependency to your sbt build:

libraryDependencies += "com.github.music-of-the-ainur" %% "bigquery-almaren" % "0.0.8-3.4"

To run in spark-shell:

For scala version(2.12):

spark-shell --packages "com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.4,com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.4"

For scala version(2.13):

spark-shell --packages "com.github.music-of-the-ainur:bigquery-almaren_2.13:0.0.8-3.4,com.github.music-of-the-ainur:almaren-framework_2.13:0.9.10-3.4"

BigQuery Connector was implemented using https://github.com/GoogleCloudDataproc/spark-bigquery-connector. For more details check the following link.

spark-shell --master "local[*]" --packages "com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.4,com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.4"

Maven / Ivy Package Usage

The connector is also available from the Maven Central repository. It can be used using the --packages option or the spark.jars.packages configuration property. Use the following value

version Connector Artifact
Spark 3.4.x and scala 2.13 com.github.music-of-the-ainur:bigquery-almaren_2.13:0.0.8-3.4
Spark 3.4.x and scala 2.12 com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.4
Spark 3.3.x and scala 2.13 com.github.music-of-the-ainur:bigquery-almaren_2.13:0.0.8-3.3
Spark 3.3.x and scala 2.12 com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.3
Spark 3.2.x and scala 2.12 com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.2
Spark 3.1.x and scala 2.12 com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.1
Spark 2.4.x and scala 2.12 com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-2.4
Spark 2.4.x and scala 2.11 com.github.music-of-the-ainur:bigquery-almaren_2.11:0.0.8-2.4

Source and Target

Source

Parameteres

Parameters Description
table The BigQuery table which is present in a dataset in the format [[project:]dataset.]table
options Description
------------- -------------
parentProject The Google Cloud resource hierarchy resembles the file system which manages entities hierarchically . The Google Cloud Project ID of the table.
project The Google Cloud Project ID of the table. A project organizes all your Google Cloud resources .For example, all of your Cloud Storage buckets and objects, along with user permissions for accessing them, reside in a project.
dataset A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views
query Standard SQL SELECT query. (Table name should be in grave accent)

Example 1

import com.github.music.of.the.ainur.almaren.Almaren
import com.github.music.of.the.ainur.almaren.bigquery.BigQuery.BigQueryImplicit
import com.github.music.of.the.ainur.almaren.builder.Core.Implicit

val almaren = Almaren("App Name")

spark.conf.set("gcpAccessToken","token")

val df =  almaren
         .builder
         .sourceBigQuery("dataset.table",Map("parentProject"->"project_name","project"->"project_name"))
         .batch

df.show(false)

You can run any Standard SQL SELECT query on BigQuery and fetch its results directly to a Spark Dataframe.
In order to use this feature the following configurations MUST be set:

  • viewsEnabled must be set to true.
  • materializationDataset must be set to a dataset where the GCP user has table creation permission.

Example 2

import com.github.music.of.the.ainur.almaren.Almaren
import com.github.music.of.the.ainur.almaren.bigquery.BigQuery.BigQueryImplicit
import com.github.music.of.the.ainur.almaren.builder.Core.Implicit

val almaren = Almaren("App Name")

spark.conf.set("gcpAccessToken","token")
spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")

val df =  almaren
         .builder
         .sourceBigQuery("query",Map("parentProject"->"project_name","project"->"project_name"))
         .batch

df.show(false)

Target:

Parameters

Parameters Description
table The BigQuery table which is present in a dataset in the format [[project:]dataset.]table
options Description
------------- -------------
parentProject The Google Cloud resource hierarchy resembles the file system which manages entities hierarchically . The Google Cloud Project ID of the table.
project The Google Cloud Project ID of the table. A project organizes all your Google Cloud resources .For example, all of your Cloud Storage buckets and objects, along with user permissions for accessing them, reside in a project.
dataset A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views
temporaryGcsBucket The GCS bucket that temporarily holds the data before it is loaded to BigQuery. Required unless set in the Spark configuration (spark.conf.set(...)).

Example

import com.github.music.of.the.ainur.almaren.Almaren
import com.github.music.of.the.ainur.almaren.bigquery.BigQuery.BigQueryImplicit
import com.github.music.of.the.ainur.almaren.builder.Core.Implicit
import org.apache.spark.sql.SaveMode

val almaren = Almaren("App Name")

spark.conf.set("gcpAccessToken","token")

almaren.builder
    .sourceSql("""SELECT sha2(concat_ws("",array(*)),256) as id,*,current_timestamp from deputies""")
    .coalesce(30)
    .targetBigQuery("dataset.table",Map("parentProject"->"project_name","project"->"project_name","temporaryGcsBucket"->"bucket"),SaveMode.Overwrite)
    .batch