hey folks! we’re trying out meltano’s environments...
# best-practices
p
hey folks! we’re trying out meltano’s environments feature and i’d love to be able to do something like set start date for an extractor to a relative date (like yesterday) for a dev environment. is there a way to achieve this? i looked at the squared project which seems to just hard-code dates but if the objective is to limit API calls or run time then you end up having to update that config every now and then.
v
I think the reason this isn't out there is this is kind of an anti-pattern. State files track dates for you, meltano stores those dates. There's a whole host of issues you run into when you say "run everything since yesterday", simple ones that come off the top of my head are failures (what happens if you failed yesterday, now you want the last two days right?)
p
fair point - i imagine there are cases however where you don’t care about state between runs and just want to run a elt command for a short time window to ensure it runs without errors. do you have thoughts on how you’d approach. something like that?
v
Can I flip this on you? What's the problem you're trying to solve? Is it testing to be sure the tap runs properly? The target runs properly?
p
there are two problems i’m trying to solve currently: 1. test to ensure i’m able to load data from the tap to a bigquery target - these can sometimes fail if a column data type is inconsistent or if the schema for a stream has an
object
with unspecified types for nested fields 2. run and test downstream dbt models and i guess more generally i often want to just run an entire ELT pipeline before pushing to prod to ensure there aren’t other issues i haven’t anticipated
v
So today, I'm assuming you have a full pipeline that runs meltano with dbt. Today sometimes the run fails when column data types are inconsistent? And then you have to go manually fix something, and run the job again?
Issue is the time it takes you to fix the problem is too long right now? Really a meltano run should "just" work every run, you're implying that it doesn't right now, so what do you mean. Your source schema changes a lot due to
object
json types?
and i guess more generally i often want to just run an entire ELT pipeline before pushing to prod to ensure there aren’t other issues i haven’t anticipated
My thought here is a dev environment
these can sometimes fail if a column data type is inconsistent or if the schema for a stream has an object with unspecified types for nested fields
hmm, what do you do to fix this right now?
n
I'm struggling with this challenge as well … Schema inconsistencies (BQ is really particular). Generally the state pattern works well and I just drop the persisted state file occasionally to trigger a full extract back to the start date.
a
Hi, @prratek_ramchandani. I've run into this also - but specifically for dev/test and CI pipelines where we want to limit the data to an arbitrary and smaller time window. To @visch’s point, this would very likely be an anti-pattern in any prod-like environments. Somewhat related is the discussion around "builds" here, where the concept of a build would be an end-to-end run on a full meltano project, including EL, T, etc. Can you confirm the use cases you had in mind are matching with the concept of a build/test step? Or are you also looking for prod-like relative dates support?
When we are running a built/test cycle (such as in a CI/CD pipeline), we: 1. Don't want a pre-captured state reference to cause a stream to entirely not run. (For instance, if there are zero records new, or not enough records new to trigger a representative test.) 2. Don't want to constantly have to push forward a hardcoded state_date. (What starts out as a "fast" test of a month of data eventually becomes a much longer-running test covering multiple months.) Does that cover it?
p
yep those two requirements cover it! and yes my use case for relative dates is limited to CI and dev environments
v
Quick solution for the date thing @prratek_ramchandani https://stackoverflow.com/questions/49173988/how-to-get-commit-date-and-time-on-gitlab-ci , something like that on an action would do the trick if it's for a dev environment. There's still more that would be good to dive into here 😄
Could parse something like https://docs.gitlab.com/ee/ci/variables/predefined_variables.html CI_JOB_STARTED_AT
p
oh that’s neat
a
Feels like this is common enough use case to warrant a SO question: continuous integration - How to provide a relative start date for Meltano CI/CD pipelines on Singer - Stack Overflow cc @amanda.folson
p
+1 on this thread. For the squared repo it would be nice to have a relative start_date set for testing but also for the user_dev local development environment. Sometimes I just want 1 day of data but my local db state or start_date is weeks ago so I need to manually update the start_date in the yaml to not load tons of data