Hello Is there any SQL LIMIT equivalent in meltano How to ex Meltano #random

Join Slack

Hello! Is there any SQL LIMIT equivalent in meltan...

# random

aleksei_razvodov

01/29/2023, 8:55 AM

Hello! Is there any SQL LIMIT equivalent in meltano? How to extract sampled data?

christoph

01/29/2023, 8:21 PM

The closest thing I can think of is the

--test

switch for Meltano SDK taps. Using that switch, at most 1 record will be produced by the tap for each ot its streams.

aleksei_razvodov

01/30/2023, 5:07 AM

It should be used on 'run ', 'invoke' or 'config'?

christoph

01/30/2023, 5:12 AM

it definitely works for

invoke

I don't actually know if it would be possible to make it work for

run

Do you have a quick description of what your use case is for this "sample data" functionality?

aleksei_razvodov

01/30/2023, 5:16 AM

For example, I have a csv file with 100'000 rows and I want to test the pipeline and see a sample of data on target. But debugging on the whole amount would be inefficient. In other solutions there is a flag 'sample data' or 'row number' parameter.

christoph

01/30/2023, 5:38 AM

I see. That makes a lot of sense. I wonder if an inline stream_map could be used to achieve this functionality.

aleksei_razvodov

01/30/2023, 6:35 AM

I haven't tried maps and transforms yet, but it make sense to encapsulate this function in one module instead of adding in every existing tap. It would be great to know about this from tutorial, for me it is a basic feature I expect in every elt/etl data solutions

thomas_briggs

01/30/2023, 1:35 PM

I suggest

cat sourceData.csv | head -10 > smallSample.csv

😛

aleksei_razvodov

01/30/2023, 1:43 PM

@thomas_briggs okay, what about remote files and API?

thomas_briggs

01/30/2023, 1:56 PM

I was mostly being facetious. 😉 It's Monday morning, I couldn't resist. But to your point, sources like that are why I would argue that this needs to be defined per-tap and not managed by meltano. If you're paying per-record for an API or running an expensive query against a DB you don't want the tap to pull all the data and then have meltano throw most of it away... you want the tap to impose that limit. The fact that taps and targets are independent of everything else (including meltano) is one of the great strengths of the Singer spec but the need to duplicate things like this is one of the weaknesses. 😕 I think the Meltano SDK will help with a lot of that eventually but even then we'll never see all taps and targets based on that.

aaronsteers

01/30/2023, 11:20 PM

@aleksei_razvodov - Internally we had some discussions last week around how/if we should make a user config interface for the the internal record limit for connection tests in the SDK. One challenge of doing so was that we cannot ensure that an arbitrary record limit per stream will not break referential integrity and the ability to do incremental sync from that arbitrary stop/pause point. I logged some thoughts here if anyone wants to provide input there: • Consider exposing config like `dry_run_record_limit` · Issue #1366 · meltano/sdk (github.com)

aaronsteers

01/30/2023, 11:21 PM

Of course, to @thomas_briggs' point, this wouldn't solve for all taps, although it in theory could solve for the ones built on the SDK.

aaronsteers

01/30/2023, 11:27 PM

@christoph - I like your stream maps idea, combined with @aleksei_razvodov's "row count" limit suggestion. Logged for discussion: • Consider exposing `nth_record` or similar in stream maps · Issue #1367 · meltano/sdk (github.com) This, in theory, could work for non-SDK taps, since the limit would exist between tap and target. (Downside though is that the tap is still iterating through all the records.)

christoph

01/30/2023, 11:30 PM

I like your stream maps idea,

I couldn't immediately think of a way of writing a filter expression to only drop records after a certain count of records has been reached. I'm sure it's possible somehow. 😁

aaronsteers

01/31/2023, 12:27 AM

Yeah - I don't know if it'd be possible today, but adding some tally/counter into the evaluation context could in theory make that possible. (I just fixed the broken link in my message above, which explores this.)

2 Views

Open in Slack

Previous Next