Hey everyone! I'm finding that my pipelines that i...
# singer-tap-development
h
Hey everyone! I'm finding that my pipelines that involve parent-child streams that load to DB are incredibly slow (+2hrs for a couple thousands entities per day), and I believe this is because each entity and it's children are being written to the database one-by-one rather than being batched. Anyone know where I should start to address this? I'm seeing this with both target-redshift and target-duckdb (not as slow as Redshift).
e
Hi @hawkar_mahmod! Are you using https://github.com/TicketSwap/target-redshift?
(and this may be #1025 rearing its head)
h
@Edgar Ramírez (Arch.dev) I am using our own fork of the pipelinewise target-redshift - https://github.com/hrm13/pipelinewise-target-redshift/tree/sso-credential-provider-support
I didn't realise it could be down to the schema messages. Do you know if the SDK built target would perform better? I had a cursory look and it does similar batching. So perhaps it's a tap issue?
e
So perhaps it's a tap issue?
It might be, since I see your target flushes streams whenever a new schema is received. There's a comment in that issue
I believe we dealt with this recently by deduping SCHEMA messages. Will try and find the exact PR.
but I don't think that PR ever got merged. The target could also be updated to check if the schema has actually changed, i.e. storing a mapping of known stream -> schema, and adding another condition to check if the schema has changed here: https://github.com/hrm13/pipelinewise-target-redshift/blob/9e141f1d33df784c70593c652909e6b1273aa03c/target_redshift/__init__.py#L209-L213
h
OK I guess the easiest thing to do is switch out for the SDK based target and see if it works.
Will report back
e
Cool, thanks!