How am I supposed to determine in an sdk target whether a st Meltano #singer-target-development

How am I supposed to determine in an sdk target wh...

brandon_isom

07/01/2021, 1:37 PM

How am I supposed to determine in an sdk target whether a stream is

FULL_TABLE

INCREMENTAL

replication? And if it's

INCREMENTAL

how do I determine if a

--full-refresh

flag has been passed in to the

meltano elt

douwe_maan

07/01/2021, 2:45 PM

Targets don’t have a way to know that,

--full-refresh

and the

replication-method

only have impact on the tap’s behavior. The tap does not communicate to the target if it’s doing a full refresh or an incremental sync. This is Singer behavior that we could change if the need is high enough. What are you trying to accomplish that would require the target to know this?

brandon_isom

07/01/2021, 4:34 PM

When pulling a tap incrementally, I would like to replace the target table on a full refresh, vs append to it.

brandon_isom

07/01/2021, 4:35 PM

I figured I could "infer" whether or not a table is

incremental

from the presence of a

replication_key

in the schema, though maybe that's out of spec.

brandon_isom

07/01/2021, 4:35 PM

sorry, in the singer

SCHEMA

message

douwe_maan

07/01/2021, 5:41 PM

Truncating the target table before appending new rows is what the

ACTIVATE_VERSION

message was invented for: https://gitlab.com/meltano/meltano/-/issues/2508

douwe_maan

07/01/2021, 5:42 PM

Some but not all targets implement support for that, and some but not all taps output it when running in Full Table mode.

douwe_maan

07/01/2021, 5:43 PM

The SDK doesn’t currently support it, but it’s on the roadmap: https://gitlab.com/meltano/sdk/-/issues/18

douwe_maan

07/01/2021, 5:45 PM

Maybe this is an opportunity for us to work together to add this to the tap and target SDKs, since you have a need for this now? cc @aaronsteers

aaronsteers

07/01/2021, 5:50 PM

@brandon_isom - Fully agree with what @douwe_maan suggest above. The ACTIVATE_MESSAGE extension to the spec is specifically for this purpose. You can actually find the plug here in the SDK today where we would handle that processing.

aaronsteers

07/01/2021, 5:54 PM

Probably would want to add a generic interface like

Sink.remove_old_records(prior_to_version_num)

and Sink implementations could optionally add logic to remove records prior to that specified version.

brandon_isom

07/01/2021, 7:22 PM

👀

brandon_isom

07/01/2021, 7:42 PM

Is there a draft for the spec for that message floating around anywhere?

brandon_isom

07/01/2021, 7:50 PM

So, a bit of context that may help, I'm putting together a target that will write partitioned (or sometimes not) parquet out to s3 with glue catalog tables/partitions on top of it, so they're queryable via spectrum. For

FULL_TABLE

streams, we'd want to overwrite. For

INCREMENTAL

streams, we'd want to append, normally, but we'd like to support

--full-refresh

functionality, without having to do any out of band janitoring. In some cases in the past, we've also run into issues where some source system switched from integer ids to uuids, which ends up breaking spectrum queries, as the table now expects a string, but the older partitions are integers. So, gracefully rewriting partitions on a schema change like that would be a nice-to-have.

douwe_maan

07/01/2021, 8:17 PM

I don’t know if there’s been an attempt to write it into the spec. The closest thing is https://github.com/singer-io/getting-started/issues/71, and https://gitlab.com/meltano/meltano/-/issues/2508 links to some taps/targets that implement it, that can serve as a reference implementation

douwe_maan

07/01/2021, 8:18 PM

@aaronsteers Thoughts on the usecase?

aaronsteers

07/01/2021, 8:34 PM

So, a bit of context that may help, I'm putting together a target that will write partitioned (or sometimes not) parquet out to s3 with glue catalog tables/partitions on top of it, so they're queryable via spectrum.

First, just as a quick aside, @brandon_isom, have you seen our new/WIP target-athena as discussed in #C01ASPH8GSX? While the name has "Athena", technically we also register the tables in the Glue catalog and could be used for Spark and other Glue-compatible services. We're adding Parquet support and partitioning their as well. Would love to pool resources with you if that's something you wanted to work together on. (cc @andrew_stewart)

aaronsteers

07/01/2021, 8:36 PM

re:

In some cases in the past, we've also run into issues where some source system switched from integer ids to uuids, which ends up breaking spectrum queries, as the table now expects a string, but the older partitions are integers. So, gracefully rewriting partitions on a schema change like that would be a nice-to-have.

The pattern I've seen work well in these cases is - when a column type is modified to an incompatible type (such as uuid->int or str->date), rename the old column with a suffix and create a new column with the orginal name and new data type.

aaronsteers

07/01/2021, 8:38 PM

These are generally operations to perform on a database layer, but with file-based solutions like parquet, you'd need to (1) modify and republish all prior files/partitions, or (2) add an abstraction layer on top like a view that can smartly use the best from each file/partition without physically rewriting past files.

aaronsteers

07/01/2021, 8:41 PM

For
FULL_TABLE
streams, we'd want to overwrite. For
INCREMENTAL
streams, we'd want to append, normally, but we'd like to support
--full-refresh
functionality, without having to do any out of band janitoring.

I think this ties back to the ACTIVATE_VERSION message type we discussed above - OR some kind of an override of the target table name so your automated processes can just delete the older version.

aaronsteers

07/01/2021, 8:42 PM

I should call out that not all taps are going to send the ACTIVATE_VERSION message, so you'd want to check the specific taps in question before investing too much time investing in the target's implementation.

aaronsteers

07/01/2021, 8:43 PM

Also - I logged this to explore if Redshift Spectrum can be supported natively with the

target-athena

implementation, or if additional effort would be needed.

brandon_isom

07/01/2021, 8:44 PM

have you seen our new/WIP target-athena as discussed in #C01ASPH8GSX?

Yeah, I've taken a peek. We've got some processes already that we're basically wrapping with this target, and I don't think we want to add a dependency on Athena, atm.

Open in Slack

Previous Next