Hey all. Getting confused here. I have a mariadb s...
# plugins-general
p
Hey all. Getting confused here. I have a mariadb source database that I'm extracting data from and pushing to a snowflake warehouse. I have around 45 databases in the mariadb instance and those are each pushed into different schemas. All good so far. All databases have the same tables inside of them, so I'm using wildcards to select the tables and columns that I want to replicate. There are also some columns that I don't want to extract as they contain sensitive information, so those are de-selected again using negated wildcard selects. It looks something like this:
Copy code
plugins:
  extractors:
    - name: source-db
      inherit_from: tap-mysql
      variant: transferwise
      pip_url: pipelinewise-tap-mysql
      config:
        host: localhost
        user: meltano
        engine: mariadb
        use_gtid: true
      select:
        - "*-Orders.*"
        - "*-Payments.*"
        - "!*-Payments.CreditCardNumber"
        - "*-Customers.*"
        - "!*-Customers.owner*"
        - "!*-Customers.secondOwner*"
        - "*-EventDataKey.*"
        - "*-Export.*"
        - "*-Note.*"
        - "*-LogItem.*"
        - "!*-LogItem.data"
      metadata:
        "*":
          replication-method: LOG_BASED
Problem is now that the de-selected columns show up from time to time in the snowflake warehouse. The pipelines all run orchetrated by airflow, and the de-selected columns shows up only for some of the databases (seemingly random) and only at certain syncs. So two things; A. Is the above not the correct way to select "all but a few" columns out of a large number of databases? B. Is there a better way? Perhaps using a stream mapping to transform away the unwanted columns?
After reading around it seems that a stream mapper (i.e. meltano-map-transformer) could do the trick if; • it supports wildcards in the stream and column selection • it works when I run the pipeline through airflow
After experimentation it seems that meltano-map-transformer does not support wildcards in the stream selection. 😢
...which I'd need as I have so many databases and the tap-mysql constructs the streams as <database-name>-<table-name>
I would in essence like to handle PII removal for every table with the same name in all of the databases.
That is the stream map would be similar to *-<table-name>
After a painful process of resyncing all my databases over a very low bandwidth connection the problem persists. There seem to be a bug when selecting tables using wildcards and using airflow to orchestrate. Some exclusions are not taken into account. I will proceed with selecting every column explicitly and remove the exclusions as a test.
d
@par_degerman I think this may be the issue you're seeing: https://github.com/transferwise/pipelinewise-tap-mysql/issues/50
t
We didn’t have an open issue for wildcard stream selection .I just made that issue on https://github.com/MeltanoLabs/meltano-map-transform/issues/148