Hey team; very off-topic, so please direct me towa...
# singer-tap-development
s
Hey team; very off-topic, so please direct me towards the best channel. I'm just hard-pressed to find a better team of Data Engineers 😛 After creating a solid basis for our EL pipeline using Meltano, our next big issue in our company is applying our T directly in our warehouse (GBQ). The goal here is to apply a separation between our raw directly imported from the source and our resulting dashboard. Would anyone have any insights on what are best practices for pipeline implementation Between your warehouse and your dashboards? This may include: • Creating additional views with dbt • Manipulating complicated data using python • Applying auditing • Using tools to distribute load Thank you so much!
d
Not sure I can dictate best practices, but I can tell you a high level what we do 😅 We use Airflow (composer), to orchestrate our pipelines. In short a datsource specific DAG would: • Run the meltano ELT job • Run the resulting downstream dbt models For auditing, we export all query logs to BQ. Users only have read access, so any DDL has to be done via git/CICD. Pretty much all gcp infra is managed via terraform (Including table schemas). We also use Data Catalog's policy tags to limit access to certain columns.
p
DBT's styleguide can be a good starting point @Stéphane Burwash https://github.com/dbt-labs/corp/blob/main/dbt_style_guide.md
Also more generally their best practices page is a good reference too https://docs.getdbt.com/docs/guides/best-practices
If possible, try to avoid the temptation of using Python to manipulate data, usually with some thought you can manipulate everything using SQL and then you have a unified place for all transformations
A notable exception is obviously Machine Learning
s
Thank you so much 😄