Hi I am new to data pipeline concepts so these questions mig Meltano #best-practices

Hi. I am new to data pipeline concepts so these qu...

Emre Üstündağ

02/04/2025, 9:55 PM

Hi. I am new to data pipeline concepts so these questions might be piece of cakes to most of you. Any help would be much appreciated. As seen on the attachment, I am creating an EL pipeline to extract data with tap-amazon-sp and load to target-clickhouse. I 've made some configurations and the data flow seems ok now (after countless daily tests actually :)). What I did is executing "... run tap-amazon-sp target-clickhouse" command and after 3-4 hours, doing it again. When I check my local clickhouse, orders related tables such as orders, orderitems, orderaddress etc. seem they worked as they are expected due to incremental replications. However, the products related tables, product_details etc. seems duplicated all rows. For example, they were 5000 after first run and now 10000 after second run. I dont think it is expected. Should I be missing something to configure tap-amazon-sp (I checked tap.properties.json, these tables have no replication keys and replication methods are full-table after each task ran) or should target-clickhouse handle this by dropping previous table and create the new one with a single configuration? And I also wonder how can I configure the tap-amazon-sp to perform EL pipeline schedules seperately? I mean the orders related tables are varying frequently, so they need to be run daily at least in my case. But product related tables' rows dont change frequently, so the EL task for these tables dont need to be run daily, maybe weekly or even monthly would be enough. What is the best practise to make this configurations to perform running meltano tasks seperately? I dont know if I should do these with different projects based on data needs. In addition, the solution should not affect the other ones working properly for example let's say I configured target-clickhouse load method to "overwrite", it will cause deleting all existing rows for incremental pipelines that it shouldn't. In short, I need a best practise approach to make all these happen, if possible in a single meltano project

8 Views

Open in Slack

Previous Next