Hello everyone I would like to know if anyone else faced a s Meltano #getting-started

Hello everyone, I would like to know if anyone els...

matheus_dantas

07/21/2023, 9:09 AM

Hello everyone, I would like to know if anyone else faced a similar situation: We have a pipeline that is extracting data from a MySQL database using the

tap-mysql

. As most of these tables does not have a way to identify the changes, we have decided to use the full_table replication method. This data is loaded to Redshift using the

target-redshift

. The problem is that the behaviour that I was expecting is that as we are using the full table extraction, the target should be able to truncate the table and load everything. But instead it loads the data to a staging table, using the COPY and reading from S3. After that it compares the new data with the existing table and it runs some UPSERTS (update records that already exists and insert new ones). This is ok for small table, but for large tables it becomes a problem. So we have cases where the amount of data that was updated or created is small compared to the total number of records, but the process still tries to update everything every day. Updates are expensive operations we should avoid if possible. Do you guys have a solution for this situation?

thomas_briggs

07/21/2023, 1:17 PM

I think the meltano variant of tap-mysql might include a solution to this (namely, activate_version messages) but if you're using the pipelinewise variant of tap-mysql you'll have to set up something to truncate the tables in the destination yourself before running meltano. 😕

thomas_briggs

07/21/2023, 1:18 PM

That aside, I assume using LOG_BASED replication isn't an option for you? That would solve this too.

pat_nadolny

07/21/2023, 3:53 PM

Yeah I think this is a common request and theres several features that are being discussed to try to address it using the SDK, which this variant isnt using so its not very helpful unfortunately 😢 . More practically I think theres a setting in the target that allows you to use inserts instead of updates by not requiring a primary key https://hub.meltano.com/loaders/target-redshift#primary_key_required-setting. If none of your streams have PKs then the full table of data will be inserted, avoiding your update bottleneck, and then you'd probably want to have a transformation step afterwards to only select the newest set of data or delete the old set.

Open in Slack

Previous Next