Hello just got started and ran into a couple of issues 1 Wha Meltano #getting-started

Hello, just got started and ran into a couple of i...

Daniel Luo

06/17/2024, 9:12 PM

Hello, just got started and ran into a couple of issues: 1. What's the correct syntax for environment variable

TAP_MSSQL__SELECT_FILTER

? I have tried specifying

tap_stream_id

as outputted by dumping the tap, which is in the format

schema-tablename

. However, when running

meltano run tap-mssql target-snowflake

with

set TAP_MSSQL__SELECT_FILTER='["schema-tablename"]'

, it doesn't detect anything to sync. If I specify

schema-tablename.*

instead, then it runs everything that I have selected in my

meltano.yml

file, instead of just the single table. I cannot run

meltano el --select

since I'm on Windows apparently. 2. Is there a way to use the same order of columns in the source when the destination table gets created? It's not very convenient when everything is jumbled up since many source tables have columns ordered in a specific way so that a

select * from table

gives easily readable data. 3. Is it possible to output the

.gz

archives to a different directory?

Edgar Ramírez (Arch.dev)

06/18/2024, 11:32 AM

1. Can you try reinstalling the tap before running with

TAP_MSSQL__SELECT_FILTER='["schema-tablename"]'

meltano install extractor tap-mssql --clean

should do the trick. If that works, it means you're running into a cached catalog. We have an upcoming CLI option to help with that: https://github.com/meltano/meltano/pull/8580 2. Not currently, but targets usually process the columns in the order they are in the

SCHEMA

message so you may be able to affect it by providing a custom order by the

schema

extra, with the inconvenience that you'd have to be exhaustive:

Copy code

plugins:
  extractors:
  - name: tap-mssql
    schema:
      my_column_1: {}
      my_column_2: {}

I'm not sure if SQLAlchemy can guarantee (usage in the SDK) that columns will be extracted in the same order as in the source db, but I invite to test it out and if it's the case then we can work on a PR to preserve that order when generating the catalog. 3. Which .gz file do you mean?

Daniel Luo

06/18/2024, 1:14 PM

1. Unfortunately, it still didn't fix the issue. Does it matter that the select in the yml file is

schema-tablename.*

? Is there a particular reason why

el

isn't supported on Windows, or why

run

can't have a filter parameter instead of using the env var? Running

el --select schema-tablename

has the expected behavior in linux, but it's not convenient for development as our developers are working on windows 2. I can't say for sure, but for what it's worth, I have used dlt which does have a sqlalchemy backend, and they didn't have the problem with the column ordering and I've created the table multiple times through that. I've moved on since they didn't support log based replication. 3. I have a bunch of files in my root directory with the format

target-snowflake--schema-tablename-guid-1.json.gz

. They contain the data that I'm presuming is being inserted.

Edgar Ramírez (Arch.dev)

06/18/2024, 4:30 PM

Does it matter that the select in the yml file is
schema-tablename.*
?

I'd need to confirm, but you could try commenting it out for testing purposes.

Is there a particular reason why
el
isn't supported on Windows,

https://github.com/meltano/meltano/issues/2841

why
run
can't have a filter parameter instead of using the env var?

Since

run

can run an arbitrary number of taps, I'd be impossible to determine which tap the parameter applies to 🫤 For column order I've logged https://github.com/meltano/sdk/issues/2489.

I have a bunch of files in my root directory with the format
target-snowflake--schema-tablename-guid-1.json.gz
. They contain the data that I'm presuming is being inserted.

You might be able to change that by passing a custom batch config. The default is defined in https://github.com/MeltanoLabs/target-snowflake/blob/0dbfd522347affdd35547a13c40cf20f29bd26d1/target_snowflake/sinks.py#L26-L29.

Daniel Luo

06/18/2024, 6:07 PM

Tried commenting it out, and running the install with --clean, and nothing gets run still.

Edgar Ramírez (Arch.dev)

06/19/2024, 10:39 AM

Meaning no stream is selected?

Daniel Luo

06/20/2024, 2:07 PM

Correct, no stream is selected.

Edgar Ramírez (Arch.dev)

06/20/2024, 5:44 PM

Oh, I think the tap may not be respecting the top-level

"select": true

generated by

select_filter

? I know other taps do. For example

Copy code

$ TAP_GETPOCKET__SELECT_FILTER='["items"]' meltano --log-level=warning invoke --dump=catalog tap-getpocket --about | jq '.streams[0].selected'
true
$ TAP_GETPOCKET__SELECT_FILTER='["items"]' meltano --log-level=warning invoke tap-getpocket | wc -l
752

$ TAP_GETPOCKET__SELECT_FILTER='["items2"]' meltano --log-level=warning invoke --dump=catalog tap-getpocket --about | jq '.streams[0].selected'
false
$ TAP_GETPOCKET__SELECT_FILTER='["items"]' meltano --log-level=warning invoke tap-getpocket | wc -l
1

Can you try

TAP_MSSQL__SELECT='["schema-tablename.*"]'

Daniel Luo

06/20/2024, 5:50 PM

Hmm, I'm having some trouble with the syntax. If I do it as you mentioned above, it gives something like this, as if it's trying to read each character as a separate stream:

Copy code

2024-06-20T17:46:50.712865Z [warning  ] Stream `[` was not found in the catalog
2024-06-20T17:46:50.714869Z [warning  ] Stream `"` was not found in the catalog
2024-06-20T17:46:50.714869Z [warning  ] Stream `d` was not found in the catalog
2024-06-20T17:46:50.715874Z [warning  ] Stream `i` was not found in the catalog
2024-06-20T17:46:50.715874Z [warning  ] Stream `m` was not found in the catalog

I tried also several variations of it with/without single/double quotes, or the square brackets. Haven't figured it out yet

👀 1

Daniel Luo

06/20/2024, 5:56 PM

Got it. This worked for me:

set TAP_MSSQL__SELECT=["schema-TableName"]

Daniel Luo

06/20/2024, 5:57 PM

Thanks for checking on this. What's the difference between SELECT and SELECT_FILTER? Surprised there's two that do what should be the same thing

Daniel Luo

06/20/2024, 5:58 PM

Also, I noticed someone had another comment in some other thread about using

--merge-state

. If I'm planning on running filtered runs, should I always be passing that arg?

Edgar Ramírez (Arch.dev)

06/20/2024, 6:16 PM

What's the difference between SELECT and SELECT_FILTER? Surprised there's two that do what should be the same thing

Yeah. A bit before my time but some spelunking brought back the commit where it was added in the gitlab MR:

Add
select_filter
extractor extra to allow extracting a subset of selected entities

So, not entirely sure 😅 but it does seem to manage the catalog different in a subtly different way that may or may not be supported by all taps.

Daniel Luo

06/20/2024, 6:32 PM

I created an issue to update doc or fix behavior. What about

--merge-state

? The documentation isn't completely clear to me, though that may also be my limited understanding of how meltano works behind the scenes. It sounds like I should always be setting this, though I'm not sure what happens to the state tracking if I don't specify it. Does that filtered run just maintain it's own state somewhere outside of the centralized location? If I only run it filtered, then would it not matter? Is it only important if I run a mix of everything or a filtered subset?

Edgar Ramírez (Arch.dev)

06/20/2024, 6:47 PM

You've probably seen https://github.com/meltano/meltano/blob/main/integration/example-library/meltano-run-merge-states/index.md. The main use case of

--merge-state

is ensuring that pipelines that run with non-overlapping select criteria. If you always run with different select criteria, then simply using a

--state-id-suffix

is enough to keep their state separate. On the other hand, if you want to, for example, run a backfill for a subset of the streams using a select rule in the end you want the states to be merged. The

--merge-state

was added because some extractors read the current state but write a totally new one at the end of the run, without merging the dictionaries so the older bookmarks would be lost.

Edgar Ramírez (Arch.dev)

06/20/2024, 6:47 PM

Is that helpful?

Daniel Luo

06/20/2024, 7:01 PM

Not sure if I'm understanding correctly, but it sounds like there's two parts to it: 1. If you don't use

--merge-state

, it's possible for two selectors to select the same stream twice? 2. Some extractors override the state with what's in the current filtered run, ignoring the rest of the streams that weren't run and effectively removing them. If I'm just running single streams, would it make sense to just set --state-id-suffix= {tap_stream_id}? Is merging state an expensive operation? Does it make sense to just always specify the flag so that we maintain a single state regardless of which streams we run?

Daniel Luo

06/20/2024, 7:10 PM

I guess in my head, what I'm imagining the state to be in a simplified way is that it's just a collection of the status of different streams. If we execute

meltano run

, I imagine it looks at the default state (if no

--state-id-suffix

is provided), and then it will update the default state or the provided state (and this would be where the

--merge-state

flag comes into play). If

--state-id-suffix

is provided, then it would look only at that particular state and update that specific one. I'm guessing there might be issues if I run everything with a unique

--state-id-suffix

, then decide I want to run multiple streams together? Is that correct? What happens if the schema of the source data changes and no longer matches with the destination?

15 Views

Open in Slack

Previous Next