Machine Learning Design Patterns
In this newsletter, I would like to focus on a book: Machine Learning Design Patterns. I strongly recommend it. It deals with common problems and possible solutions that you can encounter in your ML journey.
From data representation to responsible AI, it is quite uplifting.
It consists of 30 patterns. To give you a taste of them, I would give you a brief overview of some of them.
Data Representation - Feature Cross
Sometimes, when using separately the features x_1 and x_2, it’s not possible to find a linear boundary that separates the classes. This means that to solve the problem, we have to make the model more complex, perhaps by adding more layers to the model. However, a simpler solution exists.
It’s the ability to cross two features. For instance, you can concatenate hour_of_day + day_of_week.
Problem Representation - Reframing
You can reframe your problem from regression to classification.
For example, instead of predicting the perfect time needed to do a task, we can use bucketizer for the outputs and predict that.
We can also do the contrary. For example, we can consider ratings as infinite value.
The more a distribution is large, the more the model will be accurate if considered as a classification. The more a distribution is tiny, the more the model will be accurate if considered as a regression.
Model Training - Transfer learning
Transfer learning works because it lets us stand on the shoulders of giants, utilizing models that have already been trained on a large dataset to do image classification.
Transfer learning works for texts and images. It doesn’t work for tabular data because they are too specific.
Reproducibility - Workflow Pipeline
In the Worflow Pipeline design pattern, we address the problem of creating an end-to-end reproducible pipeline by containerizing and orchestrating the steps in our machine learning process. The containerization might be done explicitly, or using a framework that simplifies the process.
Responsible AI - Heuristic Benchmark
The Heuristic Benchmark pattern compares and ML model against a simple, easy-to-understand heuristic in order to explain the model’s performance to business decision makers.
Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available
With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available. This is the achievement of 3 years of booming community contribution and adoption of the project - since initial support for Spark-on-Kubernetes was added in Spark 2.3 (February 2018).
Holidays \O/
I’m going to take a break (2 weeks). As a consequence, there will no newsletter during this time.
See you soon!
Thank you for reading. Feel free to contact me on Twitter if you want to discuss that.