Hi Everyone had a question regarding meltano SDK and schemas Meltano #getting-started

Hi Everyone, had a question regarding meltano SDK ...

gary_lucas

06/21/2021, 7:48 PM

Hi Everyone, had a question regarding meltano SDK and schemas. The source I’m trying to ingest from ships data columnwise. and the number of columns will change (each column is named for

mm/yyyy

02/2020

) As we run this in the days and months ahead seeing more columns is expected and I’m wondering if we can generate the schema in sequance with hitting the source API. IE: 1. Call API, get response. 2. From response generate schema. 3. Yield rows that have been post processed. The API we’re calling claims to be RESTfull, it’s absolutely not. I’m using a custom client for this. Right now it’s working fine but with a hardcoded Schema… IE:

Copy code

class StreamyStream(StreamyStream):
    """Define custom stream."""

    name = "Versions"
    primary_keys = [
        "Field_a",
        "Field_b",
        "Field_c",
        "Field_d",
        "Field_e",
    ]
    replication_key = None

    schema = th.PropertiesList(
        th.Property("Field_a", th.StringType),
        th.Property("Field_b", th.StringType),
        th.Property("Field_c", th.StringType),
        th.Property("Field_d", th.StringType),
        th.Property("Field_e", th.StringType),
        th.Property("02/2014", th.NumberType),
        th.Property("03/2014", th.NumberType),
        th.Property("04/2014", th.NumberType)
        ... etc
    ).to_dict()

Thanks a bunch!

aaronsteers

06/21/2021, 9:11 PM

Hi, @gary_lucas - and welcome! Yes, you can absolutely create this schema on-the-fly. Where you currently have

schema

defined as a static attribute, you can instead provide a dynamic

@property

method as in the sample here: Code Samples — Meltano SDK 0.2.0 documentation A couple things you'd have to tackle though: 1. We don't yet have an easy entrypoint to send supplemental requests using the same auth/request rails. That's certainly doable but we don't have existing patterns or docs on that approach. (Tracked here: Streamline complementary REST requests (#93)) The workaround would just be to call

requests

directly, optionally piggy-backing on

Stream.http_headers

and/or

Stream.authenticator.auth_headers

2. The fully-adaptive schema use cases isn't yet supported - wherein any change to the stream schema could happen anywhere during the stream. Essentially, you'd need to dynamically update

Stream.schema

and make extra calls to

Stream._write_schema_message()

- but that is probably overkill for the use case you describe. And the unfortunate downsides of implementing a fully adaptive schema are (1) the

--discover

method which documents the best/latest known schema would presumably not know about the adaptive changes, and (2) downstream targets don't always deal gracefully with a schema that is modified mid-stream (although the spec generally has no problem with it).

aaronsteers

06/21/2021, 9:14 PM

For either approach - either streamlining complementary REST requests or adding adaptive schema capabilities, we'd welcome contributions. The first approach is much easier, and may even be be doable today. Is this helpful at all?

gary_lucas

06/21/2021, 10:45 PM

Hi @aaronsteers thanks! It’s helpful. My current thoughts are we can manually stub out to

12/2023

And that should be fine for awhile. I will create a ticket for our team to investigate dynamically creating schema and try to get that scheduled in the next quarter or two. My theory is that by the time we do that work we will have written several meltano taps / targets and just generally be more competent at the framework.

Open in Slack

Previous Next