This Repository contains pure Scala implementations of clients for interacting with a handful of Google Cloud services.
Common GCP refined types, such as ProjectId
.
"com.permutive" %% "gcp-types" % "<VERSION>"
Communicates with the Google OAuth 2.0 API.
ServiceAccountTokenProvider
- get a Google Service Account TokenUserAccountTokenProvider
- get a Google User Account TokenInstanceMetadataTokenProvider
- get a Google Service Account Token via the instance metadata APICachedTokenProvider
- caches each token generated by aTokenProvider
for the lifespan of that token, then generates a new one (SeeInstanceMetadataOAuthSafetyPeriod
for use with the instance metadata token provider)
This module also includes a http client wrapper AuthedClient that automatically adds the oauth token produced by a given
TokenProvider
to the headers of each outgoing request.
"com.permutive" %% "google-auth" % "<VERSION>"
Utility for resolving the project ID from instance metadata.
"com.permutive" %% "google-project-id" % "<VERSION>"
Pureconfig configuration decoders for either specifying the project ID or looking it up from instance metadata. For example, for config class:
case class Config(projectId: ProjectIdConfig)
Your application.conf
file should look like:
project-id {
type = static
value = "foo-123"
}
or:
project-id {
type = gcp
}
"com.permutive" %% "google-project-id-pureconfig" % "<VERSION>"
Methods to interact with Google BigQuery over HTTP.
Provides interfaces over the BigQuery Rest API and the BigQuery Data Transfer API.
You probably want to look at Google's own BigQuery Java libraries unless this library happens to meet your requirements:
java-bigquery
- Provides all BigQuery featuresjava-bigquerystorage
- Provides much higher-performance reads, implementation of the BigQuery Storage Read API
Our functional-gax
library may help ease the pain of these libraries
exposing Google Java "effect" types (e.g. APIFuture
).
This library was originally written for an internal monolith at Permutive. At the time the service includeded dependencies for multiple different Google products (e.g. BigQuery, DFP, PubSub) all of which frequently led to dependency hell with shared dependencies like gRPC and GAX. This library was written to interact with BigQuery using purely the Typelevel stack to avoid more hell.
This library is not a complete implementation of the BigQuery HTTP API. It has been (mostly) feature-frozen since it was originally written (late 2018, early 2019). If it fits your use-case and you're happy with this, great! There are Google libraries available though that may be more suitable:
java-bigquery
- For feature-completenessjava-bigquerystorage
- For high-performance reads, implementation of the BigQuery Storage Read API
Our functional-gax
library may help ease the pain of these libraries
exposing Google Java "effect" types (e.g. APIFuture
).
"com.permutive" %% "google-auth" % "<VERSION>"
"com.permutive" %% "google-bigquery" % "<VERSION>"
The two main modules available are datatransfer
(communicates with the Data Transfer API) and rest
(communicates with the Rest API).
Retrying is possible in both of these modules and can be configured for implementations, see section on retrying for details.
Communicates with the BigQuery Data Transfer API to:
- Retrieve scheduled queries
- Create scheduled queries which load data into a table
Interface: BigQueryDataTransfer
Implementation: HttpBigQueryDataTransfer
Communicates with the BigQuery Rest API.
Separated into separate modules based on behaviour: job
and schema
.
When results may run over many pages they are returned as an fs2 Stream
, see section on pagination for details.
Contains interfaces and implementations to create BigQuery jobs and retrieve results.
Interfaces and implementations:
BigQueryJob
, implementationHttpBigQueryJob
:- Create query jobs (can create and poll to completion as well)
- e.g
SELECT
statements or DML jobs
- e.g
- Create query jobs which load data into a table
- Retrieve job status
- Poll until job completion
- Dry run jobs (shows cost of running the actual job)
- Jobs can include query parameters
- Create query jobs (can create and poll to completion as well)
BigQueryDmlJob
, implementation in companion object- Create, run and retrieve results of DML jobs
BigQuerySelectJob
, implementation in companion object- Create, run and retrieve results of
SELECT
jobs
- Create, run and retrieve results of
Jobs can include query parameters to avoid SQL injection and make it easier to insert variables into queries. These must be named, rather than positional, query parameters.
Jobs accept parameters represented by the QueryParameter
class but this can be difficult, and confusing, to define
for anything more complicated than a simple type. QueryParameterEncoder
is provided to derive encoders to
QueryParameter
for some generic types (e.g. case classes and lists of values).
As an example:
import cats.data.NonEmptyList
import com.permutive.google.bigquery.rest.models.job.queryparameters._
val simpleParameters: List[Int] = List(1, 2, 3, 4, 5)
// This encoder does not need to be created explicitly, but it may be useful to if it will be reused (to prevent re-deriving)
val simpleParametersEncoder: QueryParameterEncoder[List[Int]] = QueryParameterEncoder.deriveEncoder
val encodedSimpleParameters = simpleParametersEncoder.encode("simple_parameters_name", simpleParameters)
case class CaseClassParameter(foo: Double, bar: List[String], baz: Boolean)
val caseClassParameter = CaseClassParameter(1.5, List("a", "b", "c"), baz = false)
// In this case the encoder has not been created explicitly
val encodedCaseClass: QueryParameter = QueryParameterEncoder[CaseClassParameter].encode("case_class_parameter_name", caseClassParameter)
// This can be provided to methods creating jobs now
val queryParameters: NonEmptyList[QueryParameter] = NonEmptyList.of(encodedSimpleParameters, encodedCaseClass)
Contains a single interface and implementation to:
- Create tables
- Create views
- List tables and views in a dataset
- Create datasets
Interface: BigQuerySchema
Implementation: HttpBigQuerySchema
Methods are available in all cases which avoid the details of pagination. They unroll all results and return an fs2
Stream
. In these cases PaginationSettings
controls how pages should be unrolled. prefetchPages
controls how many
pages of results should be prefetched, this is useful to prevent fetching pages blocking downstream processing.
maxResults
sets an optional upper limit on the number of results returned by each request. If this is not
set then the BigQuery limit of 10 MB per response applies.
More generally many methods in job
and schema
take, and return, an optional PageToken
; this means the results may
be paginated from BigQuery. If a PageToken
is returned it means there are further results available that can be
retrieved by supplying the returned page token to a subsequent fetch. A corresponding parameter (maxResults
) controls
the maximum number of results that should be returned in a request. In the case of results returning a Stream
a
PageToken
can still be provided to start unrolling all results from a specified point.
It is possible to configure the library to retry requests if they fail. This is passed in as a cats-retry
RetryPolicy
to many implementations in both datatransfer
and rest
.
These settings actually are used to control behaviour of an internal interface, HttpMethods
. This interface is used
by all Http*
implementations in datatransfer
and rest
. It is not necessary to create an HttpMethods
to use
any public interface in the library but it may be beneficial if you create many (e.g. BigQuerySchema
and
BigQuerySelectJob
) sharing retry logic (there are overloads on implementations that accept HttpMethods
and no
RetryPolicy
).
To provide a real-world example some timeouts were noticed in
project-bigquery-sync-job
running with the http4s Blaze client; these manifested asjava.util.concurrent.TimeoutException
. Configuring to only retry these failures fixed the issue.