Has anyone set up LOG_BASED replication on top of ...
# troubleshooting
i
Has anyone set up LOG_BASED replication on top of an RDS Postgres replica? I'm getting stuck in setting wal_level = logical. It says 'replica' instead and thus cannot seem to be able to create a replication slot. Not sure what I'm doing wrong.
p
I have not done this with Postgres, but i think you might have to point to the master instance, since that is the one that has the log available, or make sure that the replica also has the log available somehow (not sure if it’s possible)
i
AWS allowed to make replicas of replicas since engine version 14.something, so it should be possible to do.
p
But i think it doesn’t expose the log unless you actually create the replica of replica, i might be wrong.
i
Yeah, you're right.
It's possible to do this on the primary without implementing a replica, so I'm hoping it's possible to do on the replica too.
I don't see a way of doing this in an AWS RDS. Looks like they support "replicas of replicas", but not manipulating the replica so that it would expose the logs for other use.
d
RDS will always require a master-instance for cdc. We went through this same scenario last year with a postgres instance on RDS, we ended up using kafka through debezium to on our master instance to stream data instead of meltano. You can also filter which logs to make available (i.e. which tables to send cdc logs from)
i
Nice, thanks. I'll look into this. How satisfied have you been? What made you choose Debezium/Kafka over Meltano? I'd like to keep our set of tools to a minimum, but CDC in itself is such a complicated thing, that I can see it deserving a purpose-made tool on its own.
d
Hi, we haven’t had any issues. We considered self-hosting, but ended up going hosted via confluent, it is pricier, but we’ve done some optimization to get that down which would work in a self-hosted environment as well. We are considering transitioning over to self-hosted (we self-host meltano and looking to self-host observability tools as well). Haven’t had any issues for a full year. We went with kafka primarily for multiple consumer route as we had s3/snowflake and other destinations to push to. Additionally, you may see a lower cost for your backend if self-hosting (i.e. your hosting cost), but depending on your destination you may see elevated compute cost. For us with the amount of data we’d be writing into snowflake as an example we need a streaming based connector which doesn’t use a virtual warehouse for inserts. I assume you would need a virtual warehouse for all snowflake connectors in meltano which makes that potentially pricier alternative especially if batch and inserting happens simultaneously. We had that issue when we considered stitch for cdc (cost on stitch side not bad, but snowflake cost was very high). Depending on your use case (i.e. your destination) this may not be as relevant. Of course we’re connecting to a master instance so disk utilization is important, but as long as you configure things correctly (i.e. implementing heartbeats as an example) you should be confident in the streaming pipeline, while keeping your master instance online.