Deploying machine learning data pipelines and algorithms should not be a time-consuming or difficult task. MLeap allows data scientists and engineers to deploy machine learning pipelines from Spark and Scikit-learn to a portable format and execution engine (MLeap documentation).
MLeap helps you to serialize and deserialize your model/pipeline. It exists for Spark, Scikit-Learn and Tensorflow. You can use it with Scikit-Learn but the documentation is quite dry for the moment. This is why I decided to do this small tutorial.
Prepare some data
I will use Numpy and Pandas to get some data.
Suppose you want to do a classification. You need a dataframe with a feature column and a label column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 1), columns=['a'])
df["y"] = (df['a'] > 0.5).astype(int)
df.head()
You might have a result like that:
a | y |
---|---|
0.61 | 1 |
0.26 | 0 |
0.20 | 0 |
A dataframe with variable a as a feature column and y as a label column for a classification.
Serialize with MLeap and Scikit-Learn
I will use a logistic regression.
from mleap.sklearn.logistic import LogisticRegression
logistic_regression = LogisticRegression(fit_intercept=True)
logistic_regression.mlinit(input_features='a',
prediction_column='e_binary')
logistic_regression.fit(df[['a']], df[['y']])
logistic_regression.serialize_to_bundle(
"/dbfs/FileStore/tables/mleaptestmodel",
"model.json"
)
In mleaptestmodel folder, you have different files. They represent your model in a serialized format.
Deserialize with MLeap and Scikit-Learn
To get back the model and use it for a prediction, you can do that:
from mleap.sklearn.logistic import LogisticRegression
node_name = "{}.node".format("model.json")
logistic_regression_tf = LogisticRegression()
model = logistic_regression_tf.deserialize_from_bundle(
"/dbfs/FileStore/tables/mleaptestmodel",
node_namelog
)
expected = model.predict(df[["a"]])
Conclusion
MLeap seems to be a nice project but not very active. I’m not sure I would recommend it. But at least, you’ve got a getting started with logistic regression if you must deal with it.
For the moment, the documentation is quite dry. For instance, I found nothing about deserializing with Scikit-Learn.
The tests of the project help me to understand this part.
I plan to make a pull request to add some documentation.
Thank you for reading.