Note: This repository is no longer being maintained. Please see https://github.com/combust-ml/mleap for the active project.
Easily put your Spark ML Pipelines into action with MLeap. Train your feature and regression/classification pipeline with Spark then easily convert them to MLeap pipelines to deploy them anywhere. Take your pipelines to an API server, Hadoop, or even back to Spark to execute on a DataFrame.
MLeap allows for easy serialization of your estimator and transformer pipelines so you can save them for reuse later. Executing an MLeap pipeline does not require a SparkContext or DataFrame so there is very little overhead for realtime one-off predictions. You don't have to worry about Spark dependencies for executing your models, just add the lightweight MLeap runtime library instead.
MLeap makes deploying your Spark ML pipelines with 3 core functions:
- Release: Deploy your entire ML pipeline without a SparkContext or any dependency on Spark libraries.
- Reuse: Export your ML pipeline to easy-to-read JSON files so you can reuse pipelines.
- Recycle: Export your training pipelines to easy-to-read JSON files so you can easily modify your training pipelines.
MLeap is cross-compiled for Scala 2.10 and 2.11, so just replace 2.10 with 2.11 wherever you see it if you are running Scala version 2.11 and using a POM file for dependency management. Otherwise, use the %%
operator if you are using SBT and the correct Scala version will be used.
libraryDependencies += "com.truecar.mleap" %% "mleap-runtime" % "0.1.3"
<dependency>
<groupId>com.truecar.mleap</groupId>
<artifactId>mleap-runtime_2.10</artifactId>
<version>0.1.3</version>
</dependency>
libraryDependencies += "com.truecar.mleap" %% "mleap-spark" % "0.1.3"
<dependency>
<groupId>com.truecar.mleap</groupId>
<artifactId>mleap-spark_2.10</artifactId>
<version>0.1.3</version>
</dependency>
MLeap is now a Spark Package. The package includes mleap-spark
and mleap-serialization
, so you should have full functionality with it. Here is how you can run a Spark shell with MLeap loaded.
$ bin/spark-shell --packages com.truecar.mleap:mleap-package_2.10:0.1.3
MLeap is broken into 4 modules:
- mleap-core - Core execution building blocks, includes runtime for executing linear regressions, random forest models, logisitic regressions, assembling feature vectors, string indexing, one hot encoding, etc. It provides a core linear algebra system for all of these tasks.
- mleap-runtime - Provides LeapFrame, which is essentially a lightweight DataFrame without any dependencies on the Spark libraries. LeapFrames support 3 data types: double, string, and vector. Also provides MLeap Transformers for executing ML pipelines on LeapFrames. Spark ML pipelines get converted to MLeap pipelines which are provided with this library.
- mleap-spark - Provides Spark/MLeap integration. SparkLeapFrame is an implementation of LeapFrame with a Spark RDD backing the data so you can execute MLeap transformers on a Spark cluster. Provides conversion from Spark Transformers to MLeap Transformers. Provides conversion from MLeap Estimators to Spark Estimators. This allows a very intuitive usage of MLeap without worrying about how Spark is being used under the hood: MLeap Estimator -> Spark Estimator -> Spark Transformer -> MLeap Transformer.
- mleap-serialization - Provides serialization for MLeap and Spark models to common JSON/Protobuf format.
Please see the mleap-demo project for an example of building and using a pipeline with MLeap.
Currently MLeap only supports a select set of estimators/transformers in Spark as a proof of concept.
- StringIndexer
- Tokenizer
- HashingTF
- VectorAssembler
- StandardScaler
- LinearRegression
- RandomForestRegressor
- RandomForestClassification
- SupportVectorMachine (Must Use Estimator Provided with mleap-spark)
- Pipeline
- Provide Python/R bindings
- Unify linear algebra and core ML models library with Spark
- Deploy outside of the JVM to embedded systems
- Full support for all Spark transformers
There are a few ways to contribute to MLeap.
- Write documentation. As you can see looking through the source code, there is very little.
- Contribute an Estimator/Transformer from Spark.
- Use MLeap at your company and tell us what you think.
- Make a feature request or report a bug in github.
- Make a pull request for an existing feature request or bug report.
- Join the discussion of how to get MLeap into Spark as a dependency.
- Hollin Wilkins ([email protected])
- Mikhail Semeniuk ([email protected])
- Ram Sriharsha ([email protected])
See LICENSE and NOTICE file in this repository.
Copyright 2016 TrueCar, inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.