hey folks we re trying out meltano s environments feature an Meltano #best-practices

hey folks! we’re trying out meltano’s environments...

prratek_ramchandani

01/04/2022, 5:13 PM

hey folks! we’re trying out meltano’s environments feature and i’d love to be able to do something like set start date for an extractor to a relative date (like yesterday) for a dev environment. is there a way to achieve this? i looked at the squared project which seems to just hard-code dates but if the objective is to limit API calls or run time then you end up having to update that config every now and then.

visch

01/04/2022, 5:16 PM

I think the reason this isn't out there is this is kind of an anti-pattern. State files track dates for you, meltano stores those dates. There's a whole host of issues you run into when you say "run everything since yesterday", simple ones that come off the top of my head are failures (what happens if you failed yesterday, now you want the last two days right?)

prratek_ramchandani

01/04/2022, 5:20 PM

fair point - i imagine there are cases however where you don’t care about state between runs and just want to run a elt command for a short time window to ensure it runs without errors. do you have thoughts on how you’d approach. something like that?

visch

01/04/2022, 5:22 PM

Can I flip this on you? What's the problem you're trying to solve? Is it testing to be sure the tap runs properly? The target runs properly?

prratek_ramchandani

01/04/2022, 5:27 PM

there are two problems i’m trying to solve currently: 1. test to ensure i’m able to load data from the tap to a bigquery target - these can sometimes fail if a column data type is inconsistent or if the schema for a stream has an

object

with unspecified types for nested fields 2. run and test downstream dbt models and i guess more generally i often want to just run an entire ELT pipeline before pushing to prod to ensure there aren’t other issues i haven’t anticipated

visch

01/04/2022, 5:29 PM

So today, I'm assuming you have a full pipeline that runs meltano with dbt. Today sometimes the run fails when column data types are inconsistent? And then you have to go manually fix something, and run the job again?

visch

01/04/2022, 5:30 PM

Issue is the time it takes you to fix the problem is too long right now? Really a meltano run should "just" work every run, you're implying that it doesn't right now, so what do you mean. Your source schema changes a lot due to

object

json types?

visch

01/04/2022, 5:31 PM

and i guess more generally i often want to just run an entire ELT pipeline before pushing to prod to ensure there aren’t other issues i haven’t anticipated

My thought here is a dev environment

visch

01/04/2022, 5:34 PM

these can sometimes fail if a column data type is inconsistent or if the schema for a stream has an object with unspecified types for nested fields

hmm, what do you do to fix this right now?

nigel_vining

01/04/2022, 6:43 PM

I'm struggling with this challenge as well … Schema inconsistencies (BQ is really particular). Generally the state pattern works well and I just drop the persisted state file occasionally to trigger a full extract back to the start date.

aaronsteers

01/04/2022, 6:44 PM

Hi, @prratek_ramchandani. I've run into this also - but specifically for dev/test and CI pipelines where we want to limit the data to an arbitrary and smaller time window. To @visch’s point, this would very likely be an anti-pattern in any prod-like environments. Somewhat related is the discussion around "builds" here, where the concept of a build would be an end-to-end run on a full meltano project, including EL, T, etc. Can you confirm the use cases you had in mind are matching with the concept of a build/test step? Or are you also looking for prod-like relative dates support?

aaronsteers

01/04/2022, 6:47 PM

When we are running a built/test cycle (such as in a CI/CD pipeline), we: 1. Don't want a pre-captured state reference to cause a stream to entirely not run. (For instance, if there are zero records new, or not enough records new to trigger a representative test.) 2. Don't want to constantly have to push forward a hardcoded state_date. (What starts out as a "fast" test of a month of data eventually becomes a much longer-running test covering multiple months.) Does that cover it?

prratek_ramchandani

01/04/2022, 6:59 PM

yep those two requirements cover it! and yes my use case for relative dates is limited to CI and dev environments

visch

01/04/2022, 7:01 PM

Quick solution for the date thing @prratek_ramchandani https://stackoverflow.com/questions/49173988/how-to-get-commit-date-and-time-on-gitlab-ci , something like that on an action would do the trick if it's for a dev environment. There's still more that would be good to dive into here 😄

visch

01/04/2022, 7:02 PM

Could parse something like https://docs.gitlab.com/ee/ci/variables/predefined_variables.html CI_JOB_STARTED_AT

prratek_ramchandani

01/04/2022, 7:03 PM

oh that’s neat

aaronsteers

01/04/2022, 8:12 PM

Feels like this is common enough use case to warrant a SO question: continuous integration - How to provide a relative start date for Meltano CI/CD pipelines on Singer - Stack Overflow cc @amanda.folson

pat_nadolny

01/04/2022, 9:56 PM

+1 on this thread. For the squared repo it would be nice to have a relative start_date set for testing but also for the user_dev local development environment. Sometimes I just want 1 day of data but my local db state or start_date is weeks ago so I need to manually update the start_date in the yaml to not load tons of data

Open in Slack

Previous Next