Hey y’all :wave: I’m noticing that the meltano var...
# troubleshooting
d
Hey y’all 👋 I’m noticing that the meltano variant of
target-snowflake
executes a single
INSERT
statement per record when using parent-child streams. The tap in question has thousands of parent items, so this is leading to really long and inefficient load times. Is it possible that I have something misconfigured? Has anyone else noticed this issue?
I think I see why this happens. It looks like it’s because when you use parent/child streams, each context value supplied to the children is considered its own stream. The snowflake target seems to flush batches based on individual streams, so we’re ending up with thousands of batches that contain single records.
a
Since singer doesn't (yet) have a concept of end-of-stream, what's probably happening here is that your target is interpreting the STATE message as signal to flush records to the target table. We had a similar case with tap-github, where we actually had 5000 state messages for only ~2000 records, which caused the target to load very inefficiently, leading to https://gitlab.com/meltano/sdk/-/merge_requests/137 being implemented in the SDK targets. What we did on the SDK side was to change the logic to not drain unless either a min record count was met or the age of the records past some time threshold. (We started with 30 minutes but even 4 or 5 minutes would give a huge benefit. You are probably seeing flushes multiple times per second.)
Unfortunately, nothing immediately comes to mind in terms of a mitigation here. The spec does not mandate that targets flush per STATE message but its reasonable to do so in most cases, and it's how we started off with the SDK also.
What's the tap out of curiosity?
d
its a private one for the Lessonly API
happy to share code snippets though. its very straightforward
@aaronsteers here is a snippet of what the logs look like when syncing the child streams:
Copy code
time=2021-09-16 12:31:58 name=tap-lessonly level=INFO message=Beginning full_table sync of 'lesson_assignments' with context: {'id': 412829}...
time=2021-09-16 12:31:58 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.10486, 'tags': {'endpoint': '/lessons/{id}/assignments', 'http_status_code': 200, 'status': 'succeeded', 'context': {'id': 412829}}}
time=2021-09-16 12:31:58 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'counter', 'metric': 'record_count', 'value': 0, 'tags': {'stream': 'lesson_assignments', 'context': {'id': 412829}}}
time=2021-09-16 12:32:03 name=tap-lessonly level=INFO message=Beginning full_table sync of 'lesson_details' with context: {'id': 412830}...
time=2021-09-16 12:32:03 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.111176, 'tags': {'endpoint': '/lessons/{id}', 'http_status_code': 200, 'status': 'succeeded', 'context': {'id': 412830}}}
time=2021-09-16 12:32:03 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'counter', 'metric': 'record_count', 'value': 1, 'tags': {'stream': 'lesson_details', 'context': {'id': 412830}}}
time=2021-09-16 12:32:04 name=tap-lessonly level=INFO message=Beginning full_table sync of 'lesson_assignments' with context: {'id': 412830}...
time=2021-09-16 12:32:04 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.109025, 'tags': {'endpoint': '/lessons/{id}/assignments', 'http_status_code': 200, 'status': 'succeeded', 'context': {'id': 412830}}}
time=2021-09-16 12:32:04 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'counter', 'metric': 'record_count', 'value': 0, 'tags': {'stream': 'lesson_assignments', 'context': {'id': 412830}}}
time=2021-09-16 12:32:08 name=tap-lessonly level=INFO message=Beginning full_table sync of 'lesson_details' with context: {'id': 412753}...
time=2021-09-16 12:32:08 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.128393, 'tags': {'endpoint': '/lessons/{id}', 'http_status_code': 200, 'status': 'succeeded', 'context': {'id': 412753}}}
time=2021-09-16 12:32:10 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'counter', 'metric': 'record_count', 'value': 1, 'tags': {'stream': 'lesson_details', 'context': {'id': 412753}}}
a
Yeah, that makes sense. This is common with parent-child relationships when the number of children is low and/or the number of parents is relatively very large. I think you make a strong case for us to move our meltano fork of snowflake over to the SDK. 🙂 I don't know when we'll have a chance to do so though. You might check other snowflake variants to see if any other handle this case in a more performant manner.
d
okay so it does seem like this is actually related to the snowflake target then, yeah?
and yeah for this API parent:child ratio is 1:1 unfortunately
a
It's both - high number of transitions between parent-child and a target that just isn't optimized (yet) for that usage pattern.
d
got it okay
a
I really can't say if Datamill or Pipelinewise variants handle it better.
d
I did a tiny bit of digging and it didnt look like they did, and we’ve also had rough experiences with those in the past
so I think we’ll stick with meltano variant
is there an open issue that I can thumbs up for moving the fork over to the SDK?
a
Okay. Our of curiousity, are we talking minutes, hours, tens of hours?
d
hours
its not only single
INSERT
statements, it seems to be an additional 5-10 queries surrounding each one as well
a
Yeah, that's painful. Our github tap went from timing out after 3 hours to now finishing in about 15 minutes.
d
it adds up when youre syncing tens of thousands of records
a
Fewer flushes is certainly better in these cases. And although I don't know when we'll get to prioritize it, do you think you could open an issue in our target-snowflake with a quick description of this issue?
d
yeah
a
The best solution would be to port the target to the SDK but perhaps there's a short-term solution that's easier/quicker.
d
I was just gonna ask that - the SDK fixes this?
I may take a stab at that if I have free time
a
That would be amazing. And yes, the SDK fixes this.
Because the SDK will only drain records if the min record count is met or a minimum elapsed time is passed.
Completely elimnates the "drain every record and/or multiple times per second" problem
d
perfect
okay I will admit I have very little free time 😅 but maybe Ill see if I can string something together
thanks for the help here 👍
a
No prob. It's also possible you could adapt the MR I linked above in this thread, without doing a full port.