Hello! Is there any SQL LIMIT equivalent in meltan...
# random
a
Hello! Is there any SQL LIMIT equivalent in meltano? How to extract sampled data?
c
The closest thing I can think of is the
--test
switch for Meltano SDK taps. Using that switch, at most 1 record will be produced by the tap for each ot its streams.
a
It should be used on 'run ', 'invoke' or 'config'?
c
it definitely works for
invoke
I don't actually know if it would be possible to make it work for
run
Do you have a quick description of what your use case is for this "sample data" functionality?
a
For example, I have a csv file with 100'000 rows and I want to test the pipeline and see a sample of data on target. But debugging on the whole amount would be inefficient. In other solutions there is a flag 'sample data' or 'row number' parameter.
c
I see. That makes a lot of sense. I wonder if an inline stream_map could be used to achieve this functionality.
a
I haven't tried maps and transforms yet, but it make sense to encapsulate this function in one module instead of adding in every existing tap. It would be great to know about this from tutorial, for me it is a basic feature I expect in every elt/etl data solutions
t
I suggest
cat sourceData.csv | head -10 > smallSample.csv
😛
a
@thomas_briggs okay, what about remote files and API?
t
I was mostly being facetious. 😉 It's Monday morning, I couldn't resist. But to your point, sources like that are why I would argue that this needs to be defined per-tap and not managed by meltano. If you're paying per-record for an API or running an expensive query against a DB you don't want the tap to pull all the data and then have meltano throw most of it away... you want the tap to impose that limit. The fact that taps and targets are independent of everything else (including meltano) is one of the great strengths of the Singer spec but the need to duplicate things like this is one of the weaknesses. 😕 I think the Meltano SDK will help with a lot of that eventually but even then we'll never see all taps and targets based on that.
a
@aleksei_razvodov - Internally we had some discussions last week around how/if we should make a user config interface for the the internal record limit for connection tests in the SDK. One challenge of doing so was that we cannot ensure that an arbitrary record limit per stream will not break referential integrity and the ability to do incremental sync from that arbitrary stop/pause point. I logged some thoughts here if anyone wants to provide input there: • Consider exposing config like `dry_run_record_limit` · Issue #1366 · meltano/sdk (github.com)
Of course, to @thomas_briggs' point, this wouldn't solve for all taps, although it in theory could solve for the ones built on the SDK.
@christoph - I like your stream maps idea, combined with @aleksei_razvodov's "row count" limit suggestion. Logged for discussion: • Consider exposing `nth_record` or similar in stream maps · Issue #1367 · meltano/sdk (github.com) This, in theory, could work for non-SDK taps, since the limit would exist between tap and target. (Downside though is that the tap is still iterating through all the records.)
c
I like your stream maps idea,
I couldn't immediately think of a way of writing a filter expression to only drop records after a certain count of records has been reached. I'm sure it's possible somehow. 😁
a
Yeah - I don't know if it'd be possible today, but adding some tally/counter into the evaluation context could in theory make that possible. (I just fixed the broken link in my message above, which explores this.)