Comparison of different tools to do unit tests for data

Jan 08, 2021
2 min read

TL;DR:

Recently, I benchmarked different tools to do unit tests for data. Here are the results.

	Pydeequ	Drunken Data Quality	Dataframes rule engine	Great Expectations	Apache Griffin
Language	Python	Python	Scala	Python	Json, outside of code
Compatible Spark 3	No	No	Not tested	Yes	X
Compatible Databricks	Unknown error. Pydeequ never tested in Databricks	Not tested	Not tested	Yes	Not tested. Need Hive to work

Pydeequ

I must say that I’m disappointed. I like Deequ, the version for Scala users, very much. But I was unable to make Pydeequ work in Databricks. There’s an open issue. Besides, it’s not compatible with Spark 3.

Apache Griffin

Apache Griffin seems to be mature. But it’s a bit scary. You have to dedicate a special infrastructure and project to work with it.

Dataframes rules engine

This project is developed by Databricks but only for Scala users.

Note that a new project named Delta expectations from Databricks is supposed to be released soon to do unit tests for data.

Drunken Data Quality

As this project is not compatible with Spark 3, I didn’t test it.

Great Expectations

Great Expectations can check all my checkboxes: Python, Spark 3, and Databricks. Besides, the support is excellent.

My only problem is that some helpful features (profiling) take time to be executed.

Thank you for reading. Feel free to contact me on Twitter if you want to discuss that.

Machine learning in real life

Nastasia Saby