Hi,
@millions-toddler-72102. This is an area I’ve been focusing on for a little while also. As of today, many of the “big data” targets like Redshift and Snowflake already land their data in S3 prior to ingesting to the target DB. So for those cases, an emerging pattern is to simply retain those files in S3 after the load is complete. This essentially is building out the data lake while also populating the target DB.
However, this is not (at least not yet) part of the postgres target, and there are challenges you’d run into if you landed your data in S3 and then ingested it again to the the downstream target. First, if you use CSV as the file type in S3, you’d lose the ability to confidently detect data types after landing the CSV files. A good solution to that challenge would be to land the S3 data in a type-aware format such as Parquet, but to my knowledge we don’t yet have a stable S3-parquet target.
Another option is that you could manually or programmatically create a catalog file for S3 CSV files which matches your upstream data types. This should be possible, but in practice I haven’t seen this done before and it would likely take some trial and error in JSON manipulation. Depending on how many sources you have, I don’t know how well this would scale.
Can you confirm if Postgres is your intended target? If so, I think one attractive option might be to fork an existing repo to function like the Redshift/Snowflake targets, landing data in S3 and then ingesting with the postgres extension function
aws_s3.table_import_from_s3()
documented here. (It’s also possible someone is already working on this or has built it under their own target-postgres fork which I’m not yet aware of.)
Does this help at all?