Hi all, I'm evaluating Meltano for an ETL project....
# troubleshooting
p
Hi all, I'm evaluating Meltano for an ETL project. The objective is to partially copy a table from Snowflake to S3. I'd like to control the table and columns that are selected, and the start date range. I'm using the meltanolabs variant of tap-snowflake. Few questions: • Currently the tap is scanning all the db schemas and tables. Is there a way to prevent this if the entities and attributes are already selected? • What is the difference between
select
and
select_filter
? • Is there a way to filter records by date range? Some taps have a
start_date
config, is that general attribute or only supported with some taps? Thank you
t
Hey @peter_pezon! This tap is under active development from @ken_payne and we’ve been using #C04BG3978NT for some feedback from early users. For your first question , https://github.com/MeltanoLabs/tap-snowflake/issues/10 is a relevant issue that we’ll be looking at soon. 2nd question - those basically do the opposite of each other - postive selection vs negative selection and can be stacked if needed https://docs.meltano.com/concepts/plugins#select_filter-extra 3rd question - if the tap support the
start_date
config then yes, but not all taps (apis / files) would support incremental replication in that way
p
Thank you so much for the quick response.
Is it fair to assume that snowflake-tap-sdk does not support
start_date
? - it's not listed as a plugin setting. Otherwise, is the approach to use inline stream maps with
___filter___
?
t
Correct - it doesn’t seem to support
start_date
currently, but I’m looking at the code and there seems to be a replication_key that’s not clear how it’s set to me. @ken_payne is in South Africa so won’t be able to respond until tomorrow. Would you mind opening an issue on the repo for this?
You should also be able to set the replication type using the metadata extra https://docs.meltano.com/concepts/plugins#metadata-extra I think that will actually so what you need for incremental loads
p
Just to confirm, that should look like this?
Copy code
config:
    plugins:
      extractors:
      - name: tap-snowflake
        select:
        - schema1-table1.a
        - schema1-table1.b
        - schema1-table1.c
        - schema1-table1.date
        - "!*.*"
        metadata:
          schema1-table1.*:
            replication-method: INCREMENTAL
            replication-key: date
            date:
              is-replication-key: true
t
@peter_pezon that looks right I think. the best way would be to run
meltano invoke tap-snowflake
and see what the output looks like
p
Alternatively: can I set the state manually s.t. the bookmark refers to a particular date?
p
I'd also add that its helpful to get some people testing the new meltanolabs tap-snowflake but the default is still the pipelinewise variant on the hub https://hub.meltano.com/extractors/tap-snowflake. So if you have trouble you can try that variant instead. I use it here
You can set the state using the state command but it corresponds to a tap/target pair
Or you could manually set it for invoke in
meltano invoke tap-snowflake --state state.json
(I have to test that syntax)
p
How should a state.json file look like?
p
I tested that syntax an it works - for mine it looked like this:
Copy code
{
  "currently_syncing": null,
  "bookmarks": {
    "USERDEV_PROD-PNADOLNY_MELTANO_HUB-HUB_METRICS_AUDIT": {
      "replication_key": "UPDATED_DATE",
      "version": 1670361200499,
      "replication_key_value": "2022-11-30T07:18:13.943169+00:00"
    }
  }
}
I just ran it without the
--state
flag (my table has 2 rows) then copied the last state message into my state.json and edited from there
my metadata and select criteria
Copy code
metadata:
      '*':
        replication-method: INCREMENTAL
        replication-key: UPDATED_DATE
    select:
    - '*HUB_METRICS_AUDIT.UPDATED_DATE'
    - '*HUB_METRICS_AUDIT.METRIC_TYPE'
Also this is the pipelinewise variant so it could be slightly different for you. I think meltanolabs takes lower case column names
p
Ran
meltano run tap-snowflake target-jsonl
. Plugin loaded >50,000 records. Source table has >500mil records, so I cancelled it. Don't see a state.json file generated
Nothing in db state:
Copy code
% meltano state list
2022-12-06T21:46:30.166180Z [info     ] Environment 'dev' is active
2022-12-06T21:46:30.233004Z [info     ] No state IDs found.
The [docs](https://docs.meltano.com/guide/integration#incremental-replication-state) suggest running with
--state-id
but seems that's an invalid flag:
Copy code
meltano run tap-snowflake target-jsonl --state-id=tap-snowflake-target-jsonl-state-1
2022-12-06T21:45:00.094722Z [info     ] Environment 'dev' is active
Usage: meltano run [OPTIONS] [BLOCKS]...
Try 'meltano run --help' for help.

Error: No such option: --state-id (Possible options: --no-state-update, --state-id-suffix)
p
@cody_hanson you might be able to help here - how would we set a state before ever having a successful sync? Can Peter just run
meltano state set dev:tap-snowflake-to-target-jsonl '{"singer_state": { "currently_syncing": null, "bookmarks": { "USERDEV_PROD-PNADOLNY_MELTANO_HUB-HUB_METRICS_AUDIT": { "replication_key": "UPDATED_DATE", "version": 1670361200499, "replication_key_value": "2022-11-30T07:18:13.943169+00:00" } } }}'
?
@peter_pezon yeah state-id is only for
meltano elt
. Do your logs have any
STATE
messages in them when you ran
meltano invoke tap-snowflake
? Like for example mine looks like:
c
Yes, you can set state for the job to any value using
meltano state set
If you have the state in a file, that command also accepts file input
p
What is the difference between
meltano run
,
meltano elt
, and
meltano job
?
no i don't believe i see
STATE
messages
I see a state message in an elt.log:
Copy code
2022-12-06T22:00:38.223765Z [debug    ] {"type": "STATE", "value": {"bookmarks": {"sap-sales_invoice_v": {"starting_replication_value": null}}}} cmd_type=extractor name=tap-snowflake (out) run_id=6ab45b32-aaa7-4050-a2c7-9db30502ce0d state_id=2022-12-06T215744--tap-snowflake--target-jsonl stdio=stdout
it's output a few times but the value is the same.
p
run is a newer (added in the last 6 months) more flexible version of elt (which has been around since day 1). Job is just a way to alias a bunch of plugins you want to run together, mostly for scheduling purposes. These docs might be helpful
I suspect this has something to do with the fact that this tap is in development, your replication key should show up in that state message
p
i will try the transferwise variant
Do I need to issue any command after changing meltano.yml to point to the new variant?
p
Using the
meltano remove
then
meltano add
will probably be the most sure fire way to get everything (python package, metadata, etc.) updated. It will remove your configs too so you might want to make a copy of those first
p
Switched over to transferwise variant, updated plugin config. Made a plugin for a table, as that seems to be the pattern. Running
meltano run tap-snowflake-x-y-z target-jsonl --state-id-suffix xyz
. It's stalled a bit. Looks like it's attempting to basically do select * from table order by date asc - on a table that's 500M rows. it's a bit mad.
Whereas it looked like the meltanolabs variant was streaming in data. I think that's much more preferable.
Nevermind, it's streaming in now. Guess it just takes a minute.
Does seem to produce a
STATE
message:
Copy code
{"type": "STATE", "value": {"currently_syncing": "XXX-YYY-ZZZ", "bookmarks": {"XXX-YYY-ZZZ":
 {"replication_key": "DATE", "version": 1670366638536, "replication_key_value": "2016-08-12T00:00:00+00:00"}}}} cmd_type=elb consumer=False name=tap-snowflake-xxx-yyy-zzz producer=True stdio=stdout string_id=tap-snowflake-xxx-yyy-zzz
So, I can set the state with
meltano state set tap-snowflake-xxx-yyy-zzz "{...}"
after changing the value of
replication_key_value
. Can the date be arbitrary, or should it match a date value in the source table? Can the version be an arbitrary number?
p
p
Copy code
$ meltano state list
2022-12-06T23:33:06.916782Z [info     ] Environment 'dev' is active
2022-12-06T23:33:06.985596Z [info     ] No state IDs found.
$ meltano state set tap-snowflake-xxx-yyy-zzz-to-target-jsonlnum2 --input-file xxx-yyy-zzz-state.json
2022-12-06T23:33:11.169833Z [info     ] Environment 'dev' is active
This will overwrite the state's current value. Continue? [y/N]: y
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.

singer_state not found in top level of provided state
p
yeah from the command I posted earlier in the thread
meltano state set dev:tap-snowflake-to-target-jsonl '{"singer_state": <INSERT_YOUR_STATE>}'
- see if that works
And I shared https://docs.meltano.com/concepts/environments#state-id-suffix because that state ID
dev:tap-snowflake-to-target-jsonl
is built from the plugin names youre running so if you use tap-snowflake-xyz you'll need to include that in the state ID
p
got it. i will return to this tomorrow. thanks for your help
p
Awesome! Glad you got it working
p
btw, i was reading through the code for tap-snowflake and singer_sdk. looks like if replication method is incremental, it will call
starting_starting_timestamp
, and singer_sdk suggests that
start_date
is used if there is no state. https://github.com/meltano/sdk/blob/main/singer_sdk/streams/core.py#L227 i don't see where
start_date
is actually used. is the context populated by the config block?
don't quite have it working yet 🙂 but i need to shift to something else
i tried meltanolabs variant against to try and set start_date, and got this error: Run invocation could not be completed as block failed: Extractor failed
Copy code
ValueError: Could not detect type from empty type_dict. Did you forget to define a property in the stream schema?
p
@ken_payne or @edgar_ramirez_mondragon will know more than me about the internals of the SDK and tap-snowflake specifically
@peter_pezon you can also create an issue in the tap repo to request
start_date
be supported if you'd like. I'm not sure if thats already a planned feature or not
p
Looks like I was using the wrong field for replication key. start_date is being picked up, but there's an error with how the sql query execution
Copy code
- name: tap-snowflake-xxx-yyy
    inherit_from: tap-snowflake
    config:
      start_date: "2022-12-01T00:00:00+00:00"
    select:
    - xxx-yyy.*
    metadata:
      "*":
        replication-method: INCREMENTAL
        replication-key: z_date
        z_date:
          is-replication-key: true
Copy code
File "/x/y/z/.meltano/extractors/tap-snowflake/venv/lib/python3.10/site-packages/snowflake/connector/errors.py", line 209, in default_errorhandler 
sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 252004: 252004: Failed processing pyformat-parameters: 255001: 255001: Binding data in type (column) is not supported.
[SQL: SELECT ...
FROM xxx.yyy
WHERE %(replication_key)s >= %(start_val)s ORDER BY xxx.yyy.z_date]
[parameters: {'replication_key': Column('z_date', DATE(), table=<yyy>), 'start_val': '2022-12-01T00:00:00+00:00'}]
(Background on this error at: <https://sqlalche.me/e/14/f405>)
Update: tried setting the state for transferwise variant of tap-snowflake. seems to be working! It's now bringing in recently updated records. Set State:
Copy code
meltano state set dev:tap-snowflake-xyz-to-target-jsonl '{"singer_state": { "currently_syncing": null, "bookmarks": { "XYZ": { "replication_key": "Z_DATE", "version": 1670361200499, "replication_key_value": "2022-11-30T07:18:13.943169+00:00" } } }}'
Run ELT:
Copy code
meltano --log-level=debug elt tap-snowflake-xyz target-jsonl --state-id dev:tap-snowflake-xyz-to-target-jsonl
I feel like this workflow is a bit batty though. Not sure how to automate this at all.
p
Nice! I'd also recommend considering using
run
instead of
elt
eventually. Yeah that workflow wasnt very clean, having a start_date would have solved it in a way simpler way.
p
how do i run the same command with
run
? instead of
--state-id
there is
--state-id-suffix
ah it's right in the CLI help.
Copy code
When running within an active environment, meltano run activates
      incremental job support. State ID's are autogenerated     using the
      format
      `{active_environment.name}:{extractor_name}-to-{loader_name}(:{state-id-
      suffix})` for each extract/load pair found:

          `meltano --environment=prod run tap-gitlab target-postgres tap-
          salesforce target-mysql`
I'd like to use meltanolabs variant and take advantage of
BATCH
. I can try this same thing again. It looks like
start_date
is somewhat supported - at least, there is an attempt to fill in
replication_key_value
with
start_date
, but there's an error on query compile
p
Yeah definitely raise and issue in that repo about it!