How am I supposed to determine in an sdk target wh...
# singer-target-development
b
How am I supposed to determine in an sdk target whether a stream is
FULL_TABLE
or
INCREMENTAL
replication? And if it's
INCREMENTAL
how do I determine if a
--full-refresh
flag has been passed in to the
meltano elt
?
d
Targets don’t have a way to know that,
--full-refresh
and the
replication-method
only have impact on the tap’s behavior. The tap does not communicate to the target if it’s doing a full refresh or an incremental sync. This is Singer behavior that we could change if the need is high enough. What are you trying to accomplish that would require the target to know this?
b
When pulling a tap incrementally, I would like to replace the target table on a full refresh, vs append to it.
I figured I could "infer" whether or not a table is
incremental
from the presence of a
replication_key
in the schema, though maybe that's out of spec.
sorry, in the singer
SCHEMA
message
d
Truncating the target table before appending new rows is what the
ACTIVATE_VERSION
message was invented for: https://gitlab.com/meltano/meltano/-/issues/2508
Some but not all targets implement support for that, and some but not all taps output it when running in Full Table mode.
The SDK doesn’t currently support it, but it’s on the roadmap: https://gitlab.com/meltano/sdk/-/issues/18
Maybe this is an opportunity for us to work together to add this to the tap and target SDKs, since you have a need for this now? cc @aaronsteers
a
@brandon_isom - Fully agree with what @douwe_maan suggest above. The ACTIVATE_MESSAGE extension to the spec is specifically for this purpose. You can actually find the plug here in the SDK today where we would handle that processing.
Probably would want to add a generic interface like
Sink.remove_old_records(prior_to_version_num)
and Sink implementations could optionally add logic to remove records prior to that specified version.
b
👀
Is there a draft for the spec for that message floating around anywhere?
So, a bit of context that may help, I'm putting together a target that will write partitioned (or sometimes not) parquet out to s3 with glue catalog tables/partitions on top of it, so they're queryable via spectrum. For
FULL_TABLE
streams, we'd want to overwrite. For
INCREMENTAL
streams, we'd want to append, normally, but we'd like to support
--full-refresh
functionality, without having to do any out of band janitoring. In some cases in the past, we've also run into issues where some source system switched from integer ids to uuids, which ends up breaking spectrum queries, as the table now expects a string, but the older partitions are integers. So, gracefully rewriting partitions on a schema change like that would be a nice-to-have.
d
I don’t know if there’s been an attempt to write it into the spec. The closest thing is https://github.com/singer-io/getting-started/issues/71, and https://gitlab.com/meltano/meltano/-/issues/2508 links to some taps/targets that implement it, that can serve as a reference implementation
@aaronsteers Thoughts on the usecase?
a
So, a bit of context that may help, I'm putting together a target that will write partitioned (or sometimes not) parquet out to s3 with glue catalog tables/partitions on top of it, so they're queryable via spectrum.
First, just as a quick aside, @brandon_isom, have you seen our new/WIP target-athena as discussed in #C01ASPH8GSX? While the name has "Athena", technically we also register the tables in the Glue catalog and could be used for Spark and other Glue-compatible services. We're adding Parquet support and partitioning their as well. Would love to pool resources with you if that's something you wanted to work together on. (cc @andrew_stewart)
re:
In some cases in the past, we've also run into issues where some source system switched from integer ids to uuids, which ends up breaking spectrum queries, as the table now expects a string, but the older partitions are integers. So, gracefully rewriting partitions on a schema change like that would be a nice-to-have.
The pattern I've seen work well in these cases is - when a column type is modified to an incompatible type (such as uuid->int or str->date), rename the old column with a suffix and create a new column with the orginal name and new data type.
These are generally operations to perform on a database layer, but with file-based solutions like parquet, you'd need to (1) modify and republish all prior files/partitions, or (2) add an abstraction layer on top like a view that can smartly use the best from each file/partition without physically rewriting past files.
For 
FULL_TABLE
 streams, we'd want to overwrite. For 
INCREMENTAL
 streams, we'd want to append, normally, but we'd like to support 
--full-refresh
 functionality, without having to do any out of band janitoring.
I think this ties back to the ACTIVATE_VERSION message type we discussed above - OR some kind of an override of the target table name so your automated processes can just delete the older version.
I should call out that not all taps are going to send the ACTIVATE_VERSION message, so you'd want to check the specific taps in question before investing too much time investing in the target's implementation.
Also - I logged this to explore if Redshift Spectrum can be supported natively with the
target-athena
implementation, or if additional effort would be needed.
b
have you seen our new/WIP target-athena as discussed in #C01ASPH8GSX?
Yeah, I've taken a peek. We've got some processes already that we're basically wrapping with this target, and I don't think we want to add a dependency on Athena, atm.