Hello, just got started and ran into a couple of i...
# getting-started
d
Hello, just got started and ran into a couple of issues: 1. What's the correct syntax for environment variable
TAP_MSSQL__SELECT_FILTER
? I have tried specifying
tap_stream_id
as outputted by dumping the tap, which is in the format
schema-tablename
. However, when running
meltano run tap-mssql target-snowflake
with
set TAP_MSSQL__SELECT_FILTER='["schema-tablename"]'
, it doesn't detect anything to sync. If I specify
schema-tablename.*
instead, then it runs everything that I have selected in my
meltano.yml
file, instead of just the single table. I cannot run
meltano el --select
since I'm on Windows apparently. 2. Is there a way to use the same order of columns in the source when the destination table gets created? It's not very convenient when everything is jumbled up since many source tables have columns ordered in a specific way so that a
select * from table
gives easily readable data. 3. Is it possible to output the
.gz
archives to a different directory?
e
1. Can you try reinstalling the tap before running with
TAP_MSSQL__SELECT_FILTER='["schema-tablename"]'
.
meltano install extractor tap-mssql --clean
should do the trick. If that works, it means you're running into a cached catalog. We have an upcoming CLI option to help with that: https://github.com/meltano/meltano/pull/8580 2. Not currently, but targets usually process the columns in the order they are in the
SCHEMA
message so you may be able to affect it by providing a custom order by the
schema
extra, with the inconvenience that you'd have to be exhaustive:
Copy code
plugins:
  extractors:
  - name: tap-mssql
    schema:
      my_column_1: {}
      my_column_2: {}
I'm not sure if SQLAlchemy can guarantee (usage in the SDK) that columns will be extracted in the same order as in the source db, but I invite to test it out and if it's the case then we can work on a PR to preserve that order when generating the catalog. 3. Which .gz file do you mean?
d
1. Unfortunately, it still didn't fix the issue. Does it matter that the select in the yml file is
schema-tablename.*
? Is there a particular reason why
el
isn't supported on Windows, or why
run
can't have a filter parameter instead of using the env var? Running
el --select schema-tablename
has the expected behavior in linux, but it's not convenient for development as our developers are working on windows 2. I can't say for sure, but for what it's worth, I have used dlt which does have a sqlalchemy backend, and they didn't have the problem with the column ordering and I've created the table multiple times through that. I've moved on since they didn't support log based replication. 3. I have a bunch of files in my root directory with the format
target-snowflake--schema-tablename-guid-1.json.gz
. They contain the data that I'm presuming is being inserted.
e
Does it matter that the select in the yml file is
schema-tablename.*
?
I'd need to confirm, but you could try commenting it out for testing purposes.
Is there a particular reason why
el
isn't supported on Windows,
https://github.com/meltano/meltano/issues/2841
why
run
can't have a filter parameter instead of using the env var?
Since
run
can run an arbitrary number of taps, I'd be impossible to determine which tap the parameter applies to 🫤 For column order I've logged https://github.com/meltano/sdk/issues/2489.
I have a bunch of files in my root directory with the format
target-snowflake--schema-tablename-guid-1.json.gz
. They contain the data that I'm presuming is being inserted.
You might be able to change that by passing a custom batch config. The default is defined in https://github.com/MeltanoLabs/target-snowflake/blob/0dbfd522347affdd35547a13c40cf20f29bd26d1/target_snowflake/sinks.py#L26-L29.
d
Tried commenting it out, and running the install with --clean, and nothing gets run still.
e
Meaning no stream is selected?
d
Correct, no stream is selected.
e
Oh, I think the tap may not be respecting the top-level
"select": true
generated by
select_filter
? I know other taps do. For example
Copy code
$ TAP_GETPOCKET__SELECT_FILTER='["items"]' meltano --log-level=warning invoke --dump=catalog tap-getpocket --about | jq '.streams[0].selected'
true
$ TAP_GETPOCKET__SELECT_FILTER='["items"]' meltano --log-level=warning invoke tap-getpocket | wc -l
752

$ TAP_GETPOCKET__SELECT_FILTER='["items2"]' meltano --log-level=warning invoke --dump=catalog tap-getpocket --about | jq '.streams[0].selected'
false
$ TAP_GETPOCKET__SELECT_FILTER='["items"]' meltano --log-level=warning invoke tap-getpocket | wc -l
1
Can you try
TAP_MSSQL__SELECT='["schema-tablename.*"]'
?
d
Hmm, I'm having some trouble with the syntax. If I do it as you mentioned above, it gives something like this, as if it's trying to read each character as a separate stream:
Copy code
2024-06-20T17:46:50.712865Z [warning  ] Stream `[` was not found in the catalog
2024-06-20T17:46:50.714869Z [warning  ] Stream `"` was not found in the catalog
2024-06-20T17:46:50.714869Z [warning  ] Stream `d` was not found in the catalog
2024-06-20T17:46:50.715874Z [warning  ] Stream `i` was not found in the catalog
2024-06-20T17:46:50.715874Z [warning  ] Stream `m` was not found in the catalog
I tried also several variations of it with/without single/double quotes, or the square brackets. Haven't figured it out yet
👀 1
Got it. This worked for me:
set TAP_MSSQL__SELECT=["schema-TableName"]
Thanks for checking on this. What's the difference between SELECT and SELECT_FILTER? Surprised there's two that do what should be the same thing
Also, I noticed someone had another comment in some other thread about using
--merge-state
. If I'm planning on running filtered runs, should I always be passing that arg?
e
What's the difference between SELECT and SELECT_FILTER? Surprised there's two that do what should be the same thing
Yeah. A bit before my time but some spelunking brought back the commit where it was added in the gitlab MR:
Add
select_filter
extractor extra to allow extracting a subset of selected entities
So, not entirely sure 😅 but it does seem to manage the catalog different in a subtly different way that may or may not be supported by all taps.
d
I created an issue to update doc or fix behavior. What about
--merge-state
? The documentation isn't completely clear to me, though that may also be my limited understanding of how meltano works behind the scenes. It sounds like I should always be setting this, though I'm not sure what happens to the state tracking if I don't specify it. Does that filtered run just maintain it's own state somewhere outside of the centralized location? If I only run it filtered, then would it not matter? Is it only important if I run a mix of everything or a filtered subset?
e
You've probably seen https://github.com/meltano/meltano/blob/main/integration/example-library/meltano-run-merge-states/index.md. The main use case of
--merge-state
is ensuring that pipelines that run with non-overlapping select criteria. If you always run with different select criteria, then simply using a
--state-id-suffix
is enough to keep their state separate. On the other hand, if you want to, for example, run a backfill for a subset of the streams using a select rule in the end you want the states to be merged. The
--merge-state
was added because some extractors read the current state but write a totally new one at the end of the run, without merging the dictionaries so the older bookmarks would be lost.
Is that helpful?
d
Not sure if I'm understanding correctly, but it sounds like there's two parts to it: 1. If you don't use
--merge-state
, it's possible for two selectors to select the same stream twice? 2. Some extractors override the state with what's in the current filtered run, ignoring the rest of the streams that weren't run and effectively removing them. If I'm just running single streams, would it make sense to just set --state-id-suffix= {tap_stream_id}? Is merging state an expensive operation? Does it make sense to just always specify the flag so that we maintain a single state regardless of which streams we run?
I guess in my head, what I'm imagining the state to be in a simplified way is that it's just a collection of the status of different streams. If we execute
meltano run
, I imagine it looks at the default state (if no
--state-id-suffix
is provided), and then it will update the default state or the provided state (and this would be where the
--merge-state
flag comes into play). If
--state-id-suffix
is provided, then it would look only at that particular state and update that specific one. I'm guessing there might be issues if I run everything with a unique
--state-id-suffix
, then decide I want to run multiple streams together? Is that correct? What happens if the schema of the source data changes and no longer matches with the destination?