Hi everyone I was just wondering 1 is there an extractor loa Meltano #plugins-general

Hi everyone, I was just wondering 1. is there an e...

hoodox

06/22/2022, 12:10 PM

Hi everyone, I was just wondering 1. is there an extractor/loader plugin that loads data from a GCS into Bigquery as an External BQ table? 2. Is there an extractor plugin that allows me to run bashscripts? I have been doing a lot of searching but could not find anything or the documentation is not clear. Thanks

visch

06/22/2022, 12:23 PM

1. I don't know, you'd have to look at the implementations of

target-bigquery

here https://hub.meltano.com/loaders/target-bigquery to see if one of them does them as a External BQ Table. I'd assume not as why wouldn't we just load the data into BigQuery if we have all the data? Maybe you could expand on your use case more? 2. No there isn't, if you really wanted to do it in Meltano then I'd say write a Utility https://hub.meltano.com/ . What I do is however I"m running Meltano I tend to write some bash to get things to happen the way I want (git hub actions, git lab actions, SQL Agent Runs (PowerShell, Batch, etc) , Windows Task Scheduler) , and then I put those definitions in ./orchestrate

alexander_butler

06/22/2022, 1:35 PM

I gotcha, I use external stages First: Use https://github.com/dbt-labs/dbt-external-tables Second: Add these 2 commands to dbt-bigquery in your meltano yaml

Copy code

plugins:
  transformers:
    - name: dbt-bigquery
      pip_url: dbt-core~=1.1.0 dbt-bigquery~=1.1.0
      commands:
        stage-gcs:
          args: run-operation stage_external_sources
          description: Stage external sources in Google Cloud Storage
        stage-gcs-refresh:
          args: "run-operation stage_external_sources --vars 'ext_full_refresh: true'"
          description: Stage external sources in Google Cloud Storage

Third: Add a job with

dbt-bigquery:stage-gcs

Take this pattern and adjust how you need to.

hoodox

06/22/2022, 2:40 PM

@alexander_butler Thanks for that. I will certainly look at this.

hoodox

06/22/2022, 2:46 PM

Hi @visch, I have also looked at the https://hub.meltano.com/loaders/target-bigquery but there is no option of extracting from GCS -> BQ. Forgive me as I am new to Meltano, so I might be missing something. I have seen example tap-s3-csv-> target-bigquery, but nothing for gcs->target-bigquery. Which seems odd.

visch

06/22/2022, 2:47 PM

@hoodox check out https://github.com/ets/tap-spreadsheets-anywhere

visch

06/22/2022, 2:48 PM

Probably need a better way to discover this on the hub that tap-spreadsheets-anywhere basically handles a ton of those cases s3-csv s3-json s3-etc gcs-csv gcs-json etc etc

hoodox

06/22/2022, 2:49 PM

Yes, the name sounds a bit misleading, as I was thinking its for spreadsheets only lol.. thanks again.

hoodox

06/22/2022, 3:02 PM

Sorry @visch, once last question how do you pass in GCS json key credentials to the tap-spreadsheets-anywhere?

hoodox

06/22/2022, 3:07 PM

Its ok I think I have found it https://github.com/RaRe-Technologies/smart_open

hoodox

06/22/2022, 3:08 PM

I can use the GOOGLE_APPLICATION_CREDENTIALS environment credentials.

hoodox

06/22/2022, 5:31 PM

Hi @alexander_butler, I am trying your method first. Go stuck when trying to install the dependencies. I create a file $MELTANO_ROOT/ transform/dbt_packages/packages.yml and added this

hoodox

06/22/2022, 5:31 PM

Copy code

packages:
- package: dbt-labs/dbt_external_tables
  version: 0.8.0,

hoodox

06/22/2022, 5:32 PM

ran meltano invoke dbt:deps but got no packages found.

hoodox

06/22/2022, 5:33 PM

meltano invoke dbt-bigquery:deps

alexander_butler

06/22/2022, 6:20 PM

Yo, the packages.yml should be in the root of the dbt dir, same as dbt_project.yml The

install_path

is where dbt puts the installed packages, not where you need to put the yaml

hoodox

06/22/2022, 9:26 PM

Thanks, where do I put the source.yml file?

christoph

06/22/2022, 9:31 PM

Your dbt schema yml files (e.g. sources.yml) can go anywhere under

transform/models

hoodox

06/23/2022, 8:38 AM

I put the source in transform/models, and added this to the meltano.yml

Copy code

plugins:
  transformers:
    - name: dbt-bigquery
      pip_url: dbt-core~=1.1.0 dbt-bigquery~=1.1.0
      commands:
        stage-gcs:
          args: run-operation stage_external_sources
          description: Stage external sources in Google Cloud Storage
        stage-gcs-refresh:
          args: "run-operation stage_external_sources --vars 'ext_full_refresh: true'"
          description: Stage external sources in Google Cloud Storage

run meltano --environment=dev run dbt-bigquery:run but it seems nothing is picked up. My other transforms model are called.

alexander_butler

06/23/2022, 3:20 PM

Did you run the stage command

meltano --environment=dev run dbt-bigquery:stage-gcs

hoodox

06/23/2022, 3:24 PM

@alexander_butler Thanks, I just tried it and got an error with my config the project_id is wrong. Let me fix this and see what happens.

hoodox

06/23/2022, 3:28 PM

Strange error is Invalid resource name projects/pre_arc_keycloak_is; Project id: pre_arc_keycloak_is cmd_type=command name=dbt-bigquery stdio=stderr

alexander_butler

06/23/2022, 3:29 PM

Paste your source yaml maybe?

hoodox

06/23/2022, 3:29 PM

Where do you set the project_id?

alexander_butler

06/23/2022, 3:29 PM

its set in profiles.yml

alexander_butler

06/23/2022, 3:30 PM

please read this first https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile#overview-of-dbt-bigquery

hoodox

06/23/2022, 3:30 PM

Copy code

source.yml version: 2

sources:
  - name: keycloak
    database: pre_arc_keycloak_is
    loader: gcloud storage
  
    tables:
      - name: user_entity_gcs
        description: "External table of keycloak user_entity, stored as CSV files in Cloud Storage"
        external:
          location: '<gs://quester-d-lan-keycloak-is/kubernetes/prod/quester-gke-prod/postgres-backup-jobs/postgres-prod/cluster-postgres-prod/user_entity_refresh_220622T04_00_05.csv>'
          options:
            format: csv
            skip_leading_rows: 1
          
            # if you want a partitioned table, file paths MUST be Hive-style: 
            #   '<gs://bucket/path/collector_hour=2020-01-01/>'
            #   '<gs://bucket/path/collector_hour=2020-01-02/>' (etc)
            # hive_partition_uri_prefix: '<gs://bucket/path/>'
          # partitions:               
          #   - name: collector_date
          #     data_type: date
        
        columns:
          - name: id
            data_type: varchar(255)
            description: "ID"
          - name: username
            data_type: varchar(255)
            description: "Username"
          - name: email
            data_type: varchar(255)
            description: "email"
          - name: email_constraint
            data_type: varchar(255)
            description: "email"
          - name: email_verified
            data_type: varchar(2)
            description: "t for verified, f means not verified"
          - name: enabled
            data_type: varchar(2)
            description: "t account enaabled, f account not enabled"
          - name: first_name
            data_type: varchar(255)
            description: "first_name"
          - name: last_name
            data_type: varchar(255)
            description: "last_name"
          - name: created_date
            data_type: timestamp
            description: "created date"
          - name: realm_id
            data_type: varchar(255)
            description: "Keycloak realm"
          - name: service_account_client_link
            data_type: varchar(255)
            description: "the service account link"

alexander_butler

06/23/2022, 3:31 PM

drop the

database

key piece maybe in the above yaml?

alexander_butler

06/23/2022, 3:31 PM

database == project_id in bigquery which is already set in profiles.yml

hoodox

06/23/2022, 3:34 PM

I got a different error now 2022-06-23T153318.209199Z [error ] Block run completed. block_type=InvokerCommand err=RunnerError('`dbt-bigquery stage-gcs` failed with exit code: 1') exit_codes={} set_number=0 success=False Run invocation could not be completed as block failed:

dbt-bigquery stage-gcs

failed with exit code: 1. How do you turn on debug?

hoodox

06/23/2022, 3:35 PM

in the sources.yml do I need to set the Dataset some where?

alexander_butler

06/23/2022, 3:36 PM

you set the

schema

key in the source.yml otherwise the dataset should be the same as the source name, in this case

keycloak

alexander_butler

06/23/2022, 3:38 PM

Verbose dbt arg

meltano --environment=dev invoke dbt-bigquery --debug run-operation stage_external_sources

alexander_butler

06/23/2022, 3:39 PM

Plus meltano verbose command

meltano --environment=dev --log-level=debug invoke dbt-bigquery --debug run-operation stage_external_sources

hoodox

06/23/2022, 3:40 PM

Thanks running it now

hoodox

06/23/2022, 3:42 PM

```============================== 2022-06-23 153947.704758 | f8497e5d-44ae-4e59-b10a-bde98bb71864 ============================== 153947.704758 [info ] [MainThread]: Running with dbt=1.0.8 153947.706271 [debug] [MainThread]: running dbt with arguments Namespace(args='{}', cls=<class 'dbt.task.run_operation.RunOperationTask'>, debug=True, defer=None, event_buffer_size=None, fail_fast=None, log_cache_events=False, log_format=None, macro='stage_external_sources', partial_parse=None, printer_width=None, profile=None, profiles_dir='/project/transform/profiles/bigquery', project_dir=None, record_timing_info=None, rpc_method='run-operation', send_anonymous_usage_stats=None, single_threaded=False, state=None, static_parser=None, target=None, use_colors=None, use_experimental_parser=None, vars='{}', version_check=None, warn_error=None, which='run-operation', write_json=None) 153947.708422 [debug] [MainThread]: Tracking: do not track 153952.646894 [debug] [MainThread]: Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed. 153952.647863 [debug] [MainThread]: Partial parsing enabled, no changes found, skipping parsing 153952.649180 [warn ] [MainThread]: [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources. There are 1 unused configuration paths: - models.entity_user 153952.695760 [debug] [MainThread]: Acquiring new bigquery connection "macro_stage_external_sources" 153952.744671 [info ] [MainThread]: 1 of 1 START external source pre_arc_keycloak_is.user_entity_gcs 153952.758291 [debug] [MainThread]: On "macro_stage_external_sources": cache miss for schema "{self.database}.{self.schema}", this is inefficient 153952.760583 [debug] [MainThread]: Opening a new connection, currently in state init 153953.533291 [info ] [MainThread]: 1 of 1 (1) create or replace external table

quester-d-keycloak-tds-f513

.`pre_arc_keycloak... 153953.545568 [debug] [MainThread]: On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.0.8", "profile_name": "meltano", "target_name": "bigquery", "connection_name": "macro_stage_external_sources"} */ create or replace external table

quester-d-keycloak-tds-f513

pre_arc_keycloak_is

.`user_entity_gcs`( id varchar(255), username varchar(255), email varchar(255), email_constraint varchar(255), email_verified varchar(2), enabled varchar(2), first_name varchar(255), last_name varchar(255), created_date timestamp, realm_id varchar(255), service_account_client_link varchar(255)) options ( uris = ['gs://quester-d-lan-keycloak-is/kubernetes/prod/quester-gke-prod/postgres-backup-jobs/postgres-prod/cluster-postgres-prod/user_entity_refresh_220622T04_00_05.csv'], format = 'csv', skip_leading_rows = 1) 153953.939308 [debug] [MainThread]: BigQuery adapter: Retry attempt 1 of 1 after error: BadRequest('Type not found: varchar at [6:20]') 153955.379302 [debug] [MainThread]: BigQuery adapter: Unhandled error while running: macro stage_external_sources 153955.380556 [debug] [MainThread]: BigQuery adapter: Database Error Type not found: varchar at [6:20] 153955.382140 [error] [MainThread]: Encountered an error while running operation: Database Error Type not found: varchar at [6:20] 153955.383299 [debug] [MainThread]: Traceback (most recent call last): File "/project/.meltano/transformers/dbt-bigquery/venv/lib/python3.8/site-packages/dbt/adapters/bigquery/connections.py", line 174, in exception_handler yield File "/project/.meltano/transformers/dbt-bigquery/venv/lib/python3.8/site-packages/dbt/adapters/bigquery/connections.py", line 549, in _retry_and_handle return retry.retry_target( File "/project/.meltano/transformers/dbt-bigquery/venv/lib/python3.8/site-packages/google/api_core/retry…

hoodox

06/23/2022, 3:43 PM

Looks like my sources.yml should have String instead of VARCHAR?

hoodox

06/23/2022, 3:43 PM

I took the example from https://github.com/dbt-labs/dbt-external-tables/blob/main/sample_sources/bigquery.yml

alexander_butler

06/23/2022, 3:44 PM

use

string

notice that example also has

variant

which is snowflake specific so its definitely wrong

hoodox

06/23/2022, 3:45 PM

I modified the example to my csv schema

hoodox

06/23/2022, 3:45 PM

so no variant. Check my sources.yml I sent u earlier. Should I change varchar to string (at least that is what BQ takes I think)

alexander_butler

06/23/2022, 3:46 PM

I know, I was just making a point that THEIR bigquery example has variant which is obvioudly wrong 😉

alexander_butler

06/23/2022, 3:46 PM

this example needs a PR for sure

hoodox

06/23/2022, 3:50 PM

Ok, got it working. Thanks ever so much @alexander_butler

hoodox

06/23/2022, 3:51 PM

varchar should be string

alexander_butler

06/23/2022, 3:51 PM

No problem! 💪

hoodox

06/23/2022, 3:56 PM

@alexander_butler Lastly, does the config here run automatically? or do I have to chain the commands?

hoodox

06/23/2022, 3:56 PM

Copy code

plugins:
  transformers:
  - name: dbt-bigquery
    variant: dbt-labs
    pip_url: dbt-core~=1.0.0 dbt-bigquery~=1.0.0
    commands:
      stage-gcs:
        args: run-operation stage_external_sources
        description: Stage external sources in Google Cloud Storage
      stage-gcs-refresh:
        args: "run-operation stage_external_sources --vars 'ext_full_refresh: true'"
        description: Stage external sources in Google Cloud Storage

alexander_butler

06/23/2022, 3:57 PM

Run it like this.

meltano run dbt-bigquery:stage-gcs dbt-bigquery:run

Obviously throw in any EL tap->target blocks in before if needed. Then schedule in airflow, cron, or w/e

hoodox

06/23/2022, 4:02 PM

Great, thanks once again. I was thinking of using cloudfunction listening to bucket -> triggers cloudrun api instead. More like event driven in a serverless way. I found an example of a cloud function being able to trigger cloudbuild.

hoodox

06/23/2022, 4:04 PM

or GCS->cloudfunction -> cloudbuild -> cloudrun (if I put Meltano into docker)

alexander_butler

06/23/2022, 4:05 PM

Well consider that once a bucket is staged as an external table with a glob pattern, it doesn't need to keep being ran And yeah if you build a meltano image, you can trigger the dbt run based on bucket trigger, however the big if, is identifying your Service Level Objectives concretely and weighing it against the costs of your implementation. Specifically I would be worried about excessive dbt runs creating table materializations running up cloud bill if bucket update frequency is too high.

hoodox

06/23/2022, 4:08 PM

I think at the moment the csv gets pushed to GCS once a day, but the timing is a bit adhoc (not sure why between 3am to 7am). I think some scripts gets kicked off after some other process etc..

hoodox

06/23/2022, 4:09 PM

after some chain of processes finishes.

alexander_butler

06/23/2022, 4:10 PM

Yeah thats a fair frequency for cloud run in how you described. Im a fan of serverless too, so checks out. If it was updated every 10 minutes, it would obv be another story.

hoodox

06/24/2022, 3:53 PM

Hi @alex (harness.io) is there a way to pass an environment variable into the sources.yml file? I want to be able to pass an a gcs location dynamically via an environment variable $GCS_LOCATION like below:

Copy code

sources:
  - name: keycloak
    # database == project_id in bigquery which is already set in profiles.yml
    # database: pre_arc_keycloak_is
    schema: pre_arc_keycloak_is
    loader: gcloud storage
  
    tables:
      - name: user_entity_gcs
        description: "External table of keycloak user_entity, stored as CSV files in Cloud Storage"
        external:
          location: $GCS_LOCAATION

alexander_butler

06/24/2022, 4:29 PM

Copy code

sources:
  - name: keycloak
    # database == project_id in bigquery which is already set in profiles.yml
    # database: pre_arc_keycloak_is
    schema: pre_arc_keycloak_is
    loader: gcloud storage
  
    tables:
      - name: user_entity_gcs
        description: "External table of keycloak user_entity, stored as CSV files in Cloud Storage"
        external:
          location: "{{ env_var('GCS_LOCATION', 'defaultLocationHere') }}"

https://docs.getdbt.com/reference/dbt-jinja-functions/env_var

hoodox

06/24/2022, 4:31 PM

Great, thanks

hoodox

06/24/2022, 9:24 PM

Hi @alexander_butler, sorry to disturb you again but I cannot seem to get rid of this error

Copy code

[WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.meltano.2_current_copy.selective_daily_registers

alexander_butler

06/24/2022, 10:09 PM

I think you might wanna peel dbt docs please. Most of this stuff is just self education on there. That being said, the configuration path is basically models.[DBT PROJECT NAME].[FOLDER1].[FOLDER2] Find your dbt project name at the top of the dbt_project.yml

alexander_butler

06/24/2022, 10:10 PM

folders names in the above example come from folders in the models/ directory

alexander_butler

06/24/2022, 10:12 PM

So the above config is looking for a folder named 2_current_copy with a model or source in it named whatever that last part is

hoodox

06/25/2022, 3:19 PM

Thanks for that, I did read the docs but did not realised that my directory had a typo i.e. 2_current-copy and in config its 2_current_copy (noticed the - and _).

hoodox

06/25/2022, 3:20 PM

Spent ages trying to work out the error when (I thought everything looked correct).

hoodox

06/25/2022, 3:20 PM

Schoolboy error lol...

hoodox

06/30/2022, 12:29 PM

Hi @alexander_butler I wonder whether you ever had this problem. I am running

meltano invoke dbt-bigquery --debug run-operation stage_external_sources

, It was all working fine, then suddenly today it stopped working and I am getting this error

hoodox

06/30/2022, 12:29 PM

Copy code

12:18:48.538921 [debug] [MainThread]: Acquiring new bigquery connection "macro_stage_external_sources"
12:18:48.603770 [info ] [MainThread]: 1 of 1 START external source pre_arc_keycloak_is.user_entity_gcs
12:18:48.625276 [debug] [MainThread]: On "macro_stage_external_sources": cache miss for schema "{self.database}.{self.schema}", this is inefficient
12:18:48.627855 [debug] [MainThread]: Opening a new connection, currently in state init
12:18:49.507631 [info ] [MainThread]: 1 of 1 SKIP
12:18:49.511449 [debug] [MainThread]: Flushing usage events
12:18:49.511920 [debug] [MainThread]: Connection 'macro_stage_external_sources' was properly closed.

hoodox

06/30/2022, 12:30 PM

Copy code

On "macro_stage_external_sources": cache miss for schema "{self.database}.{self.schema}", this is inefficient

What does this mean?

alexander_butler

06/30/2022, 12:58 PM

Not sure, I dont run in debug very often unless something is broken. Looks safe to ignore really. dbt creates a cache of database schemas / tables in its adapter, thats as much as I know

hoodox

06/30/2022, 12:59 PM

Where is the cache for the adapter btw?

hoodox

06/30/2022, 2:37 PM

One more thing, does the *.lock files have to be checked into git? What are they use for?

alexander_butler

06/30/2022, 3:06 PM

its cached in memory at parse time, possibly in a msgpack file too this is entirely dbt related and has nothing to do with meltano... yeah lock files are a meltano thing and should be checked into git. They allow deterministic installs, thats the idea

hoodox

06/30/2022, 3:07 PM

Thanks

Open in Slack

Previous Next