Is anyone currently using Great Expectations or lo...
# plugins-general
p
Is anyone currently using Great Expectations or looking to start using it soon? I'm testing it out with Meltano as were planning to support it as a plugin but I'm having trouble with finding use cases that arent easily solved with dbt tests or dbt_expectations. I'd love some feedback from people in the community using it!
@monika_rajput I saw your thread and wondered if you ever got it working and what your use case was
l
@pat_nadolny I implemented Great Expectations with meltano/airflow. We only used it for null checks after the transform pipeline was done so nothing too fancy. The way I implemented it, was through a custom DAG we ran with airflow that ran separate from our meltano elt pipeline. But recently I switched over to dbt tests since I needed some Conditional Expectations but they didn't work with our use case since it's still experimental. Hope that helps.
b
@pat_nadolny I'm not using it and I'm not related with it yet but it is the road map. I have some use cases in mind. I would like to compare that I have the updated information in my WH against the source in my incremental syncs. 1. Have the same amount of data filtered in the source by the max key value using for the incremental sync. 2. Have the same count of data for the columns that has to be updated. Context for the point 2, I'm running two pipelines with the same source and destination, I only change the incremental key to get my data updated. We see that only one pipeline doesnt update some fields. Idk if that could be done in great expectations, Actually I validate it manually now and improvement to an python script. Update: We use dbt tests for check the freshness.
p
@lars thanks for that! So it sounds like youre now preferring to use dbt tests over GE but GE has more fine grain control so you use it for a subset of tests that cant be done in dbt test, do I have that right? Since you've used both, do you have any insights on limitations of either? Things that could be improved by Meltano wrapping it. I've found it pretty slow to get an expectation set together vs dbt tests.
@boggdan_barrientos thanks for your note! Thats a good use case, I also had a similar situation in the past where we were syncing a postgres db to snowflake and wanted to periodically verify that all the records exist in snowflake (using count, aggregates, etc.), we ended up with a python script that wasnt the best. I do think GE could solve this "source to target" comparison use case based on this thread I found in their slack community and a linked online forum thread.
l
@pat_nadolny Yes dbt tests is easier to use and set up in my opinion. We didn't need the jupyter notebooks that great_expectations created, so it had features we didn't use and our use case currently is very small. We completely moved to dbt tests and got rid off great_expectations. But from what I can say is that working with great expectations and implementing it into our meltano/airflow setup was a bit complicated at first. dbt test was a breeze to set up. I do find the use case described above interesting and could see us needing that in the future as well.
v
The most solid use case I've heard is that a team doesn't want to use dbt because they are a python shop, and don't know SQL that well. It makes some sense to stay away from dbt in that case, and stick to what your team knows The conversation I had wasn't about great expectations, but I think it's the core of the issue (new language/framework or not)!
p
@lars good to know, this is awesome insight. I also havent really loved the jupyter notebook workflow but I'm sure some people do. The profiling feature for helping build expectations is relatively useful because it seems like it just goes off and runs a bunch of queries against your dataset, so even if theyre not useful for expectations its still good for understanding your data during development. Unfortunately my profiling was constantly erroring out but im using Athena which isnt always the most supported platform so maybe something like Redshift or Snowflake would have an easier time.
@visch thanks for sharing - yeah that sounds like a good use case for sure, it seems like most people using the meltano stack are in SQL land though, although maybe someone using target-s3-csv or target-s3-avro (these dont look to be on MeltanoHub for some reason though 🤔), someone like that might find value.
c
I am super new to Meltano and Great Expectations but I am sure they are part of the toolset to move forward with my use case. I am using opendata datasets from my city to do some basic analysis. So far, I found errors in most of the datasets and have been manually fixing those using Pandas. As I plan to keep on working on this topic, it makes no sense to do any of those manual fixes, but I plan to move to a system with pipelines to load, verify and notify city helpdesk of issues with their datasets. In return, this will improve the quality for everyone involved. Great Expectations seems like a good fit, as I can share the expected values and test. This also has the potential for additional collaborators as they could be simpler to read in plan language compared to SQL or Python manipulations.