This is the core library containing the domain logic for the ComPass application. For more information, please refer to the repository for the web interface.
This version is built for Spark 2.4.7 .
In this incremental version we refined the perturbation information fields to store multiple similar perturbation measurements into a single perturbation data structure. This is necessary as we can now use additonally preprocessed data that combines these measurements and thus reduce duplicate hits when doing analysis on the data.
While the perturbation data structure has remained largely the same, the information field was adapted to be a sequence of structures:
Perturbation
id id of the perturbation [as before]
info information of the perturbation
processing_level extra identifier how the data has been processed
details sequence of perturbation measurement information
cell [as before]
batch [as before]
plate [as before]
well [as before]
year [as before]
extra [as before]
profiles a list of profiles for future compabiblity [as before]
trtType trt_cp / trt_lig / ... [as before]
trt treatment information [as before]
filters sequence of filters [as before]
meta sequence of meta information, discrete representation of a dictionary; key/value pairs
key
value
A meta field was added to the perturbation data structure as well. This can contain additional information such as when the data was processed. Since this contains key/value pairs, this is much easier to contain additional information that can be extended upon without having to update the model. This is especially useful for information that won't be analysed or filtered on.
The previous api calls tied to the v4 model are moved to api/v4. New api calls tied to the v4.1 model were created as api/v4_1.
For version 4, we wanted to included different types of perturbagens. Similar to the earlier versions we want to be able to express the data in a typesafe model in the Scala world. While this can be achieved in many ways using the features available in Scala for typing, compatibility with the Parquet storage format and especially Spark's ability to load and query it are crucial.
After extensive experimentation, we found out the best way to model different subtypes in Scala in line with Parquet is to create a union type that combines all subtypes into one supertype. Therefore the Treatment
type contains different slots that are just different types of treatment. There is also a generic
slot that is available for convenience purposes.
We refer to the tests for examples on how to use the data model for loading.
Accessing the fields in the model is best achieved using the available lenses. Combined lenses are available so that deep access can easily be performed.
A schematic overview of the model:
Perturbation
id
info information about the experiment
cell
batch
plate
well
year
extra
profiles a list of profiles for future compabiblity
pType
length
t
p
r
logFc
trtType trt_cp / trt_lig / ...
trt treatment information
trt_generic a generic representation
trt_cp slot for trt_cp type perturbagens
trt_lig slot for trt_lig type perturbagens
trt_sh ...
ctl_vector ...
filters
A schematic overview of the model. Field access is best performed by means of the available Lens
's.
DbRow
id Option[String]
sampleAnnotations
sample
id Option[String]
batch Option[String]
plateid Option[String]
well Option[String]
protocolname Option[String]
concentration Option[String]
year Option[String]
time Option[String]
t Option[Array[Double]]
p Option[Array[Double]]
r Option[Array[Double]]
compoundAnnotations
compound
jnjs Option[String]
jnjb Option[String]
smiles Option[String]
inchikey Option[String]
name Option[String]
ctype Option[String]
knownTargets Option[Seq[Gene]]
predictedTargets Option[Seq[Gene]]
filters Seq[Filter]