To add Bigquery dependency to your sbt build:
libraryDependencies += "com.github.music-of-the-ainur" %% "bigquery-almaren" % "0.0.8-3.4"
To run in spark-shell:
For scala version(2.12):
spark-shell --packages "com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.4,com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.4"
For scala version(2.13):
spark-shell --packages "com.github.music-of-the-ainur:bigquery-almaren_2.13:0.0.8-3.4,com.github.music-of-the-ainur:almaren-framework_2.13:0.9.10-3.4"
BigQuery Connector was implemented using https://github.com/GoogleCloudDataproc/spark-bigquery-connector. For more details check the following link.
spark-shell --master "local[*]" --packages "com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.4,com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.4"
The connector is also available from the
Maven Central
repository. It can be used using the --packages
option or the
spark.jars.packages
configuration property. Use the following value
version | Connector Artifact |
---|---|
Spark 3.4.x and scala 2.13 | com.github.music-of-the-ainur:bigquery-almaren_2.13:0.0.8-3.4 |
Spark 3.4.x and scala 2.12 | com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.4 |
Spark 3.3.x and scala 2.13 | com.github.music-of-the-ainur:bigquery-almaren_2.13:0.0.8-3.3 |
Spark 3.3.x and scala 2.12 | com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.3 |
Spark 3.2.x and scala 2.12 | com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.2 |
Spark 3.1.x and scala 2.12 | com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-3.1 |
Spark 2.4.x and scala 2.12 | com.github.music-of-the-ainur:bigquery-almaren_2.12:0.0.8-2.4 |
Spark 2.4.x and scala 2.11 | com.github.music-of-the-ainur:bigquery-almaren_2.11:0.0.8-2.4 |
Parameters | Description |
---|---|
table | The BigQuery table which is present in a dataset in the format [[project:]dataset.]table |
options | Description |
------------- | ------------- |
parentProject | The Google Cloud resource hierarchy resembles the file system which manages entities hierarchically . The Google Cloud Project ID of the table. |
project | The Google Cloud Project ID of the table. A project organizes all your Google Cloud resources .For example, all of your Cloud Storage buckets and objects, along with user permissions for accessing them, reside in a project. |
dataset | A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views |
query | Standard SQL SELECT query. (Table name should be in grave accent) |
import com.github.music.of.the.ainur.almaren.Almaren
import com.github.music.of.the.ainur.almaren.bigquery.BigQuery.BigQueryImplicit
import com.github.music.of.the.ainur.almaren.builder.Core.Implicit
val almaren = Almaren("App Name")
spark.conf.set("gcpAccessToken","token")
val df = almaren
.builder
.sourceBigQuery("dataset.table",Map("parentProject"->"project_name","project"->"project_name"))
.batch
df.show(false)
You can run any Standard SQL SELECT query on BigQuery and fetch its results directly to a Spark Dataframe.
In order to use this feature the following configurations MUST be set:
viewsEnabled
must be set totrue
.materializationDataset
must be set to a dataset where the GCP user has table creation permission.
import com.github.music.of.the.ainur.almaren.Almaren
import com.github.music.of.the.ainur.almaren.bigquery.BigQuery.BigQueryImplicit
import com.github.music.of.the.ainur.almaren.builder.Core.Implicit
val almaren = Almaren("App Name")
spark.conf.set("gcpAccessToken","token")
spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")
val df = almaren
.builder
.sourceBigQuery("query",Map("parentProject"->"project_name","project"->"project_name"))
.batch
df.show(false)
Parameters | Description |
---|---|
table | The BigQuery table which is present in a dataset in the format [[project:]dataset.]table |
options | Description |
------------- | ------------- |
parentProject | The Google Cloud resource hierarchy resembles the file system which manages entities hierarchically . The Google Cloud Project ID of the table. |
project | The Google Cloud Project ID of the table. A project organizes all your Google Cloud resources .For example, all of your Cloud Storage buckets and objects, along with user permissions for accessing them, reside in a project. |
dataset | A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views |
temporaryGcsBucket | The GCS bucket that temporarily holds the data before it is loaded to BigQuery. Required unless set in the Spark configuration (spark.conf.set(...)). |
import com.github.music.of.the.ainur.almaren.Almaren
import com.github.music.of.the.ainur.almaren.bigquery.BigQuery.BigQueryImplicit
import com.github.music.of.the.ainur.almaren.builder.Core.Implicit
import org.apache.spark.sql.SaveMode
val almaren = Almaren("App Name")
spark.conf.set("gcpAccessToken","token")
almaren.builder
.sourceSql("""SELECT sha2(concat_ws("",array(*)),256) as id,*,current_timestamp from deputies""")
.coalesce(30)
.targetBigQuery("dataset.table",Map("parentProject"->"project_name","project"->"project_name","temporaryGcsBucket"->"bucket"),SaveMode.Overwrite)
.batch