Hey y all wave I m noticing that the meltano variant of `tar Meltano #troubleshooting

Hey y’all :wave: I’m noticing that the meltano var...

david_wallace

09/15/2021, 3:13 PM

Hey y’all 👋 I’m noticing that the meltano variant of

target-snowflake

executes a single

INSERT

statement per record when using parent-child streams. The tap in question has thousands of parent items, so this is leading to really long and inefficient load times. Is it possible that I have something misconfigured? Has anyone else noticed this issue?

david_wallace

09/16/2021, 10:48 PM

I think I see why this happens. It looks like it’s because when you use parent/child streams, each context value supplied to the children is considered its own stream. The snowflake target seems to flush batches based on individual streams, so we’re ending up with thousands of batches that contain single records.

aaronsteers

09/16/2021, 11:34 PM

Since singer doesn't (yet) have a concept of end-of-stream, what's probably happening here is that your target is interpreting the STATE message as signal to flush records to the target table. We had a similar case with tap-github, where we actually had 5000 state messages for only ~2000 records, which caused the target to load very inefficiently, leading to https://gitlab.com/meltano/sdk/-/merge_requests/137 being implemented in the SDK targets. What we did on the SDK side was to change the logic to not drain unless either a min record count was met or the age of the records past some time threshold. (We started with 30 minutes but even 4 or 5 minutes would give a huge benefit. You are probably seeing flushes multiple times per second.)

aaronsteers

09/16/2021, 11:37 PM

Unfortunately, nothing immediately comes to mind in terms of a mitigation here. The spec does not mandate that targets flush per STATE message but its reasonable to do so in most cases, and it's how we started off with the SDK also.

aaronsteers

09/16/2021, 11:37 PM

What's the tap out of curiosity?

david_wallace

09/16/2021, 11:38 PM

its a private one for the Lessonly API

david_wallace

09/16/2021, 11:38 PM

happy to share code snippets though. its very straightforward

david_wallace

09/16/2021, 11:40 PM

@aaronsteers here is a snippet of what the logs look like when syncing the child streams:

Copy code

time=2021-09-16 12:31:58 name=tap-lessonly level=INFO message=Beginning full_table sync of 'lesson_assignments' with context: {'id': 412829}...
time=2021-09-16 12:31:58 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.10486, 'tags': {'endpoint': '/lessons/{id}/assignments', 'http_status_code': 200, 'status': 'succeeded', 'context': {'id': 412829}}}
time=2021-09-16 12:31:58 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'counter', 'metric': 'record_count', 'value': 0, 'tags': {'stream': 'lesson_assignments', 'context': {'id': 412829}}}
time=2021-09-16 12:32:03 name=tap-lessonly level=INFO message=Beginning full_table sync of 'lesson_details' with context: {'id': 412830}...
time=2021-09-16 12:32:03 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.111176, 'tags': {'endpoint': '/lessons/{id}', 'http_status_code': 200, 'status': 'succeeded', 'context': {'id': 412830}}}
time=2021-09-16 12:32:03 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'counter', 'metric': 'record_count', 'value': 1, 'tags': {'stream': 'lesson_details', 'context': {'id': 412830}}}
time=2021-09-16 12:32:04 name=tap-lessonly level=INFO message=Beginning full_table sync of 'lesson_assignments' with context: {'id': 412830}...
time=2021-09-16 12:32:04 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.109025, 'tags': {'endpoint': '/lessons/{id}/assignments', 'http_status_code': 200, 'status': 'succeeded', 'context': {'id': 412830}}}
time=2021-09-16 12:32:04 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'counter', 'metric': 'record_count', 'value': 0, 'tags': {'stream': 'lesson_assignments', 'context': {'id': 412830}}}
time=2021-09-16 12:32:08 name=tap-lessonly level=INFO message=Beginning full_table sync of 'lesson_details' with context: {'id': 412753}...
time=2021-09-16 12:32:08 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'timer', 'metric': 'http_request_duration', 'value': 0.128393, 'tags': {'endpoint': '/lessons/{id}', 'http_status_code': 200, 'status': 'succeeded', 'context': {'id': 412753}}}
time=2021-09-16 12:32:10 name=tap-lessonly level=INFO message=INFO METRIC: {'type': 'counter', 'metric': 'record_count', 'value': 1, 'tags': {'stream': 'lesson_details', 'context': {'id': 412753}}}

aaronsteers

09/16/2021, 11:42 PM

Yeah, that makes sense. This is common with parent-child relationships when the number of children is low and/or the number of parents is relatively very large. I think you make a strong case for us to move our meltano fork of snowflake over to the SDK. 🙂 I don't know when we'll have a chance to do so though. You might check other snowflake variants to see if any other handle this case in a more performant manner.

david_wallace

09/16/2021, 11:42 PM

okay so it does seem like this is actually related to the snowflake target then, yeah?

david_wallace

09/16/2021, 11:42 PM

and yeah for this API parent:child ratio is 1:1 unfortunately

aaronsteers

09/16/2021, 11:43 PM

It's both - high number of transitions between parent-child and a target that just isn't optimized (yet) for that usage pattern.

david_wallace

09/16/2021, 11:43 PM

got it okay

aaronsteers

09/16/2021, 11:43 PM

I really can't say if Datamill or Pipelinewise variants handle it better.

david_wallace

09/16/2021, 11:43 PM

I did a tiny bit of digging and it didnt look like they did, and we’ve also had rough experiences with those in the past

david_wallace

09/16/2021, 11:44 PM

so I think we’ll stick with meltano variant

david_wallace

09/16/2021, 11:44 PM

is there an open issue that I can thumbs up for moving the fork over to the SDK?

aaronsteers

09/16/2021, 11:44 PM

Okay. Our of curiousity, are we talking minutes, hours, tens of hours?

david_wallace

09/16/2021, 11:44 PM

hours

david_wallace

09/16/2021, 11:45 PM

its not only single

INSERT

statements, it seems to be an additional 5-10 queries surrounding each one as well

aaronsteers

09/16/2021, 11:45 PM

Yeah, that's painful. Our github tap went from timing out after 3 hours to now finishing in about 15 minutes.

david_wallace

09/16/2021, 11:45 PM

it adds up when youre syncing tens of thousands of records

aaronsteers

09/16/2021, 11:47 PM

Fewer flushes is certainly better in these cases. And although I don't know when we'll get to prioritize it, do you think you could open an issue in our target-snowflake with a quick description of this issue?

david_wallace

09/16/2021, 11:47 PM

yeah

aaronsteers

09/16/2021, 11:47 PM

The best solution would be to port the target to the SDK but perhaps there's a short-term solution that's easier/quicker.

david_wallace

09/16/2021, 11:49 PM

I was just gonna ask that - the SDK fixes this?

david_wallace

09/16/2021, 11:49 PM

I may take a stab at that if I have free time

aaronsteers

09/16/2021, 11:49 PM

That would be amazing. And yes, the SDK fixes this.

aaronsteers

09/16/2021, 11:50 PM

Because the SDK will only drain records if the min record count is met or a minimum elapsed time is passed.

aaronsteers

09/16/2021, 11:50 PM

Completely elimnates the "drain every record and/or multiple times per second" problem

david_wallace

09/16/2021, 11:50 PM

perfect

david_wallace

09/16/2021, 11:50 PM

okay I will admit I have very little free time 😅 but maybe Ill see if I can string something together

david_wallace

09/16/2021, 11:50 PM

thanks for the help here 👍

aaronsteers

09/16/2021, 11:54 PM

No prob. It's also possible you could adapt the MR I linked above in this thread, without doing a full port.

aaronsteers

09/16/2021, 11:54 PM

This one.

Open in Slack

Previous Next