permutive-engineering / functional-google-clients

Apache License 2.0 GitHub

Pure Scala implementations of clients for working with Google Cloud Platform

Functional Google Clients

This Repository contains pure Scala implementations of clients for interacting with a handful of Google Cloud services.

Common GCP refined types, such as ProjectId.

Dependency

"com.permutive" %% "gcp-types" % "<VERSION>"

Communicates with the Google OAuth 2.0 API.

This module also includes a http client wrapper AuthedClient that automatically adds the oauth token produced by a given TokenProvider to the headers of each outgoing request.

Dependency

"com.permutive" %% "google-auth" % "<VERSION>"

Utility for resolving the project ID from instance metadata.

Dependency

"com.permutive" %% "google-project-id" % "<VERSION>"

Pureconfig configuration decoders for either specifying the project ID or looking it up from instance metadata. For example, for config class:

case class Config(projectId: ProjectIdConfig)

Your application.conf file should look like:

project-id {
  type = static
  value = "foo-123"
}

or:

project-id {
  type = gcp
}

Dependency

"com.permutive" %% "google-project-id-pureconfig" % "<VERSION>"

Methods to interact with Google BigQuery over HTTP.

Provides interfaces over the BigQuery Rest API and the BigQuery Data Transfer API.

You probably want to look at Google's own BigQuery Java libraries unless this library happens to meet your requirements:

Our functional-gax library may help ease the pain of these libraries exposing Google Java "effect" types (e.g. APIFuture).

Intentions of this library and when to use it

This library was originally written for an internal monolith at Permutive. At the time the service includeded dependencies for multiple different Google products (e.g. BigQuery, DFP, PubSub) all of which frequently led to dependency hell with shared dependencies like gRPC and GAX. This library was written to interact with BigQuery using purely the Typelevel stack to avoid more hell.

This library is not a complete implementation of the BigQuery HTTP API. It has been (mostly) feature-frozen since it was originally written (late 2018, early 2019). If it fits your use-case and you're happy with this, great! There are Google libraries available though that may be more suitable:

Our functional-gax library may help ease the pain of these libraries exposing Google Java "effect" types (e.g. APIFuture).

Dependency

"com.permutive" %% "google-auth" % "<VERSION>"
"com.permutive" %% "google-bigquery" % "<VERSION>"

Modules

The two main modules available are datatransfer (communicates with the Data Transfer API) and rest (communicates with the Rest API).

Retrying is possible in both of these modules and can be configured for implementations, see section on retrying for details.

Communicates with the BigQuery Data Transfer API to:

  • Retrieve scheduled queries
  • Create scheduled queries which load data into a table

Interface: BigQueryDataTransfer Implementation: HttpBigQueryDataTransfer

Communicates with the BigQuery Rest API.

Separated into separate modules based on behaviour: job and schema.

When results may run over many pages they are returned as an fs2 Stream, see section on pagination for details.

job

Contains interfaces and implementations to create BigQuery jobs and retrieve results.

Interfaces and implementations:

  • BigQueryJob, implementation HttpBigQueryJob:
    • Create query jobs (can create and poll to completion as well)
      • e.g SELECT statements or DML jobs
    • Create query jobs which load data into a table
    • Retrieve job status
    • Poll until job completion
    • Dry run jobs (shows cost of running the actual job)
    • Jobs can include query parameters
  • BigQueryDmlJob, implementation in companion object
    • Create, run and retrieve results of DML jobs
  • BigQuerySelectJob, implementation in companion object
    • Create, run and retrieve results of SELECT jobs
Query Parameters

Jobs can include query parameters to avoid SQL injection and make it easier to insert variables into queries. These must be named, rather than positional, query parameters.

Jobs accept parameters represented by the QueryParameter class but this can be difficult, and confusing, to define for anything more complicated than a simple type. QueryParameterEncoder is provided to derive encoders to QueryParameter for some generic types (e.g. case classes and lists of values).

As an example:

import cats.data.NonEmptyList
import com.permutive.google.bigquery.rest.models.job.queryparameters._

val simpleParameters: List[Int] = List(1, 2, 3, 4, 5)
// This encoder does not need to be created explicitly, but it may be useful to if it will be reused (to prevent re-deriving)
val simpleParametersEncoder: QueryParameterEncoder[List[Int]] = QueryParameterEncoder.deriveEncoder
val encodedSimpleParameters = simpleParametersEncoder.encode("simple_parameters_name", simpleParameters)

case class CaseClassParameter(foo: Double, bar: List[String], baz: Boolean)
val caseClassParameter = CaseClassParameter(1.5, List("a", "b", "c"), baz = false)
// In this case the encoder has not been created explicitly
val encodedCaseClass: QueryParameter = QueryParameterEncoder[CaseClassParameter].encode("case_class_parameter_name", caseClassParameter)

// This can be provided to methods creating jobs now
val queryParameters: NonEmptyList[QueryParameter] = NonEmptyList.of(encodedSimpleParameters, encodedCaseClass)

schema

Contains a single interface and implementation to:

  • Create tables
  • Create views
  • List tables and views in a dataset
  • Create datasets

Interface: BigQuerySchema Implementation: HttpBigQuerySchema

Pagination

Methods are available in all cases which avoid the details of pagination. They unroll all results and return an fs2 Stream. In these cases PaginationSettings controls how pages should be unrolled. prefetchPages controls how many pages of results should be prefetched, this is useful to prevent fetching pages blocking downstream processing. maxResults sets an optional upper limit on the number of results returned by each request. If this is not set then the BigQuery limit of 10 MB per response applies.

More generally many methods in job and schema take, and return, an optional PageToken; this means the results may be paginated from BigQuery. If a PageToken is returned it means there are further results available that can be retrieved by supplying the returned page token to a subsequent fetch. A corresponding parameter (maxResults) controls the maximum number of results that should be returned in a request. In the case of results returning a Stream a PageToken can still be provided to start unrolling all results from a specified point.

Retrying

It is possible to configure the library to retry requests if they fail. This is passed in as a cats-retry RetryPolicy to many implementations in both datatransfer and rest.

These settings actually are used to control behaviour of an internal interface, HttpMethods. This interface is used by all Http* implementations in datatransfer and rest. It is not necessary to create an HttpMethods to use any public interface in the library but it may be beneficial if you create many (e.g. BigQuerySchema and BigQuerySelectJob) sharing retry logic (there are overloads on implementations that accept HttpMethods and no RetryPolicy).

To provide a real-world example some timeouts were noticed in project-bigquery-sync-job running with the http4s Blaze client; these manifested as java.util.concurrent.TimeoutException. Configuring to only retry these failures fixed the issue.