Hi everyone, I was just wondering 1. is there an e...
# plugins-general
h
Hi everyone, I was just wondering 1. is there an extractor/loader plugin that loads data from a GCS into Bigquery as an External BQ table? 2. Is there an extractor plugin that allows me to run bashscripts? I have been doing a lot of searching but could not find anything or the documentation is not clear. Thanks
v
1. I don't know, you'd have to look at the implementations of
target-bigquery
here https://hub.meltano.com/loaders/target-bigquery to see if one of them does them as a External BQ Table. I'd assume not as why wouldn't we just load the data into BigQuery if we have all the data? Maybe you could expand on your use case more? 2. No there isn't, if you really wanted to do it in Meltano then I'd say write a Utility https://hub.meltano.com/ . What I do is however I"m running Meltano I tend to write some bash to get things to happen the way I want (git hub actions, git lab actions, SQL Agent Runs (PowerShell, Batch, etc) , Windows Task Scheduler) , and then I put those definitions in ./orchestrate
a
I gotcha, I use external stages First: Use https://github.com/dbt-labs/dbt-external-tables Second: Add these 2 commands to dbt-bigquery in your meltano yaml
Copy code
plugins:
  transformers:
    - name: dbt-bigquery
      pip_url: dbt-core~=1.1.0 dbt-bigquery~=1.1.0
      commands:
        stage-gcs:
          args: run-operation stage_external_sources
          description: Stage external sources in Google Cloud Storage
        stage-gcs-refresh:
          args: "run-operation stage_external_sources --vars 'ext_full_refresh: true'"
          description: Stage external sources in Google Cloud Storage
Third: Add a job with
dbt-bigquery:stage-gcs
Take this pattern and adjust how you need to.
h
@alexander_butler Thanks for that. I will certainly look at this.
Hi @visch, I have also looked at the https://hub.meltano.com/loaders/target-bigquery but there is no option of extracting from GCS -> BQ. Forgive me as I am new to Meltano, so I might be missing something. I have seen example tap-s3-csv-> target-bigquery, but nothing for gcs->target-bigquery. Which seems odd.
v
Probably need a better way to discover this on the hub that tap-spreadsheets-anywhere basically handles a ton of those cases s3-csv s3-json s3-etc gcs-csv gcs-json etc etc
h
Yes, the name sounds a bit misleading, as I was thinking its for spreadsheets only lol.. thanks again.
Sorry @visch, once last question how do you pass in GCS json key credentials to the tap-spreadsheets-anywhere?
I can use the GOOGLE_APPLICATION_CREDENTIALS environment credentials.
Hi @alexander_butler, I am trying your method first. Go stuck when trying to install the dependencies. I create a file $MELTANO_ROOT/ transform/dbt_packages/packages.yml and added this
Copy code
packages:
- package: dbt-labs/dbt_external_tables
  version: 0.8.0,
ran meltano invoke dbt:deps but got no packages found.
meltano invoke dbt-bigquery:deps
a
Yo, the packages.yml should be in the root of the dbt dir, same as dbt_project.yml The
install_path
is where dbt puts the installed packages, not where you need to put the yaml
h
Thanks, where do I put the source.yml file?
c
Your dbt schema yml files (e.g. sources.yml) can go anywhere under
transform/models
h
I put the source in transform/models, and added this to the meltano.yml
Copy code
plugins:
  transformers:
    - name: dbt-bigquery
      pip_url: dbt-core~=1.1.0 dbt-bigquery~=1.1.0
      commands:
        stage-gcs:
          args: run-operation stage_external_sources
          description: Stage external sources in Google Cloud Storage
        stage-gcs-refresh:
          args: "run-operation stage_external_sources --vars 'ext_full_refresh: true'"
          description: Stage external sources in Google Cloud Storage
run meltano --environment=dev run dbt-bigquery:run but it seems nothing is picked up. My other transforms model are called.
a
Did you run the stage command
meltano --environment=dev run dbt-bigquery:stage-gcs
??
h
@alexander_butler Thanks, I just tried it and got an error with my config the project_id is wrong. Let me fix this and see what happens.
Strange error is Invalid resource name projects/pre_arc_keycloak_is; Project id: pre_arc_keycloak_is cmd_type=command name=dbt-bigquery stdio=stderr
a
Paste your source yaml maybe?
h
Where do you set the project_id?
a
its set in profiles.yml
h
Copy code
source.yml version: 2

sources:
  - name: keycloak
    database: pre_arc_keycloak_is
    loader: gcloud storage
  
    tables:
      - name: user_entity_gcs
        description: "External table of keycloak user_entity, stored as CSV files in Cloud Storage"
        external:
          location: '<gs://quester-d-lan-keycloak-is/kubernetes/prod/quester-gke-prod/postgres-backup-jobs/postgres-prod/cluster-postgres-prod/user_entity_refresh_220622T04_00_05.csv>'
          options:
            format: csv
            skip_leading_rows: 1
          
            # if you want a partitioned table, file paths MUST be Hive-style: 
            #   '<gs://bucket/path/collector_hour=2020-01-01/>'
            #   '<gs://bucket/path/collector_hour=2020-01-02/>' (etc)
            # hive_partition_uri_prefix: '<gs://bucket/path/>'
          # partitions:               
          #   - name: collector_date
          #     data_type: date
        
        columns:
          - name: id
            data_type: varchar(255)
            description: "ID"
          - name: username
            data_type: varchar(255)
            description: "Username"
          - name: email
            data_type: varchar(255)
            description: "email"
          - name: email_constraint
            data_type: varchar(255)
            description: "email"
          - name: email_verified
            data_type: varchar(2)
            description: "t for verified, f means not verified"
          - name: enabled
            data_type: varchar(2)
            description: "t account enaabled, f account not enabled"
          - name: first_name
            data_type: varchar(255)
            description: "first_name"
          - name: last_name
            data_type: varchar(255)
            description: "last_name"
          - name: created_date
            data_type: timestamp
            description: "created date"
          - name: realm_id
            data_type: varchar(255)
            description: "Keycloak realm"
          - name: service_account_client_link
            data_type: varchar(255)
            description: "the service account link"
a
drop the
database
key piece maybe in the above yaml?
database == project_id in bigquery which is already set in profiles.yml
h
I got a different error now 2022-06-23T153318.209199Z [error ] Block run completed. block_type=InvokerCommand err=RunnerError('`dbt-bigquery stage-gcs` failed with exit code: 1') exit_codes={} set_number=0 success=False Run invocation could not be completed as block failed:
dbt-bigquery stage-gcs
failed with exit code: 1. How do you turn on debug?
in the sources.yml do I need to set the Dataset some where?
a
you set the
schema
key in the source.yml otherwise the dataset should be the same as the source name, in this case
keycloak
Verbose dbt arg
meltano --environment=dev invoke dbt-bigquery --debug run-operation stage_external_sources
Plus meltano verbose command
meltano --environment=dev --log-level=debug invoke dbt-bigquery --debug run-operation stage_external_sources
h
Thanks running it now
```============================== 2022-06-23 153947.704758 | f8497e5d-44ae-4e59-b10a-bde98bb71864 ============================== 153947.704758 [info ] [MainThread]: Running with dbt=1.0.8 153947.706271 [debug] [MainThread]: running dbt with arguments Namespace(args='{}', cls=<class 'dbt.task.run_operation.RunOperationTask'>, debug=True, defer=None, event_buffer_size=None, fail_fast=None, log_cache_events=False, log_format=None, macro='stage_external_sources', partial_parse=None, printer_width=None, profile=None, profiles_dir='/project/transform/profiles/bigquery', project_dir=None, record_timing_info=None, rpc_method='run-operation', send_anonymous_usage_stats=None, single_threaded=False, state=None, static_parser=None, target=None, use_colors=None, use_experimental_parser=None, vars='{}', version_check=None, warn_error=None, which='run-operation', write_json=None) 153947.708422 [debug] [MainThread]: Tracking: do not track 153952.646894 [debug] [MainThread]: Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed. 153952.647863 [debug] [MainThread]: Partial parsing enabled, no changes found, skipping parsing 153952.649180 [warn ] [MainThread]: [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources. There are 1 unused configuration paths: - models.entity_user 153952.695760 [debug] [MainThread]: Acquiring new bigquery connection "macro_stage_external_sources" 153952.744671 [info ] [MainThread]: 1 of 1 START external source pre_arc_keycloak_is.user_entity_gcs 153952.758291 [debug] [MainThread]: On "macro_stage_external_sources": cache miss for schema "{self.database}.{self.schema}", this is inefficient 153952.760583 [debug] [MainThread]: Opening a new connection, currently in state init 153953.533291 [info ] [MainThread]: 1 of 1 (1) create or replace external table
quester-d-keycloak-tds-f513
.`pre_arc_keycloak... 153953.545568 [debug] [MainThread]: On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.0.8", "profile_name": "meltano", "target_name": "bigquery", "connection_name": "macro_stage_external_sources"} */ create or replace external table
quester-d-keycloak-tds-f513
.
pre_arc_keycloak_is
.`user_entity_gcs`( id varchar(255), username varchar(255), email varchar(255), email_constraint varchar(255), email_verified varchar(2), enabled varchar(2), first_name varchar(255), last_name varchar(255), created_date timestamp, realm_id varchar(255), service_account_client_link varchar(255)) options ( uris = ['gs://quester-d-lan-keycloak-is/kubernetes/prod/quester-gke-prod/postgres-backup-jobs/postgres-prod/cluster-postgres-prod/user_entity_refresh_220622T04_00_05.csv'], format = 'csv', skip_leading_rows = 1) 153953.939308 [debug] [MainThread]: BigQuery adapter: Retry attempt 1 of 1 after error: BadRequest('Type not found: varchar at [6:20]') 153955.379302 [debug] [MainThread]: BigQuery adapter: Unhandled error while running: macro stage_external_sources 153955.380556 [debug] [MainThread]: BigQuery adapter: Database Error Type not found: varchar at [6:20] 153955.382140 [error] [MainThread]: Encountered an error while running operation: Database Error Type not found: varchar at [6:20] 153955.383299 [debug] [MainThread]: Traceback (most recent call last): File "/project/.meltano/transformers/dbt-bigquery/venv/lib/python3.8/site-packages/dbt/adapters/bigquery/connections.py", line 174, in exception_handler yield File "/project/.meltano/transformers/dbt-bigquery/venv/lib/python3.8/site-packages/dbt/adapters/bigquery/connections.py", line 549, in _retry_and_handle return retry.retry_target( File "/project/.meltano/transformers/dbt-bigquery/venv/lib/python3.8/site-packages/google/api_core/retry…
Looks like my sources.yml should have String instead of VARCHAR?
a
use
string
notice that example also has
variant
which is snowflake specific so its definitely wrong
h
I modified the example to my csv schema
so no variant. Check my sources.yml I sent u earlier. Should I change varchar to string (at least that is what BQ takes I think)
a
I know, I was just making a point that THEIR bigquery example has variant which is obvioudly wrong 😉
this example needs a PR for sure
h
Ok, got it working. Thanks ever so much @alexander_butler
varchar should be string
a
No problem! 💪
h
@alexander_butler Lastly, does the config here run automatically? or do I have to chain the commands?
Copy code
plugins:
  transformers:
  - name: dbt-bigquery
    variant: dbt-labs
    pip_url: dbt-core~=1.0.0 dbt-bigquery~=1.0.0
    commands:
      stage-gcs:
        args: run-operation stage_external_sources
        description: Stage external sources in Google Cloud Storage
      stage-gcs-refresh:
        args: "run-operation stage_external_sources --vars 'ext_full_refresh: true'"
        description: Stage external sources in Google Cloud Storage
a
Run it like this.
meltano run dbt-bigquery:stage-gcs dbt-bigquery:run
Obviously throw in any EL tap->target blocks in before if needed. Then schedule in airflow, cron, or w/e
h
Great, thanks once again. I was thinking of using cloudfunction listening to bucket -> triggers cloudrun api instead. More like event driven in a serverless way. I found an example of a cloud function being able to trigger cloudbuild.
or GCS->cloudfunction -> cloudbuild -> cloudrun (if I put Meltano into docker)
a
Well consider that once a bucket is staged as an external table with a glob pattern, it doesn't need to keep being ran And yeah if you build a meltano image, you can trigger the dbt run based on bucket trigger, however the big if, is identifying your Service Level Objectives concretely and weighing it against the costs of your implementation. Specifically I would be worried about excessive dbt runs creating table materializations running up cloud bill if bucket update frequency is too high.
h
I think at the moment the csv gets pushed to GCS once a day, but the timing is a bit adhoc (not sure why between 3am to 7am). I think some scripts gets kicked off after some other process etc..
after some chain of processes finishes.
a
Yeah thats a fair frequency for cloud run in how you described. Im a fan of serverless too, so checks out. If it was updated every 10 minutes, it would obv be another story.
h
Hi @alex (harness.io) is there a way to pass an environment variable into the sources.yml file? I want to be able to pass an a gcs location dynamically via an environment variable $GCS_LOCATION like below:
Copy code
sources:
  - name: keycloak
    # database == project_id in bigquery which is already set in profiles.yml
    # database: pre_arc_keycloak_is
    schema: pre_arc_keycloak_is
    loader: gcloud storage
  
    tables:
      - name: user_entity_gcs
        description: "External table of keycloak user_entity, stored as CSV files in Cloud Storage"
        external:
          location: $GCS_LOCAATION
a
Copy code
sources:
  - name: keycloak
    # database == project_id in bigquery which is already set in profiles.yml
    # database: pre_arc_keycloak_is
    schema: pre_arc_keycloak_is
    loader: gcloud storage
  
    tables:
      - name: user_entity_gcs
        description: "External table of keycloak user_entity, stored as CSV files in Cloud Storage"
        external:
          location: "{{ env_var('GCS_LOCATION', 'defaultLocationHere') }}"
https://docs.getdbt.com/reference/dbt-jinja-functions/env_var
h
Great, thanks
Hi @alexander_butler, sorry to disturb you again but I cannot seem to get rid of this error
Copy code
[WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.meltano.2_current_copy.selective_daily_registers
a
I think you might wanna peel dbt docs please. Most of this stuff is just self education on there. That being said, the configuration path is basically models.[DBT PROJECT NAME].[FOLDER1].[FOLDER2] Find your dbt project name at the top of the dbt_project.yml
folders names in the above example come from folders in the models/ directory
So the above config is looking for a folder named 2_current_copy with a model or source in it named whatever that last part is
h
Thanks for that, I did read the docs but did not realised that my directory had a typo i.e. 2_current-copy and in config its 2_current_copy (noticed the - and _).
Spent ages trying to work out the error when (I thought everything looked correct).
Schoolboy error lol...
Hi @alexander_butler I wonder whether you ever had this problem. I am running
meltano invoke dbt-bigquery --debug run-operation stage_external_sources
, It was all working fine, then suddenly today it stopped working and I am getting this error
Copy code
12:18:48.538921 [debug] [MainThread]: Acquiring new bigquery connection "macro_stage_external_sources"
12:18:48.603770 [info ] [MainThread]: 1 of 1 START external source pre_arc_keycloak_is.user_entity_gcs
12:18:48.625276 [debug] [MainThread]: On "macro_stage_external_sources": cache miss for schema "{self.database}.{self.schema}", this is inefficient
12:18:48.627855 [debug] [MainThread]: Opening a new connection, currently in state init
12:18:49.507631 [info ] [MainThread]: 1 of 1 SKIP
12:18:49.511449 [debug] [MainThread]: Flushing usage events
12:18:49.511920 [debug] [MainThread]: Connection 'macro_stage_external_sources' was properly closed.
Copy code
On "macro_stage_external_sources": cache miss for schema "{self.database}.{self.schema}", this is inefficient
What does this mean?
a
Not sure, I dont run in debug very often unless something is broken. Looks safe to ignore really. dbt creates a cache of database schemas / tables in its adapter, thats as much as I know
h
Where is the cache for the adapter btw?
One more thing, does the *.lock files have to be checked into git? What are they use for?
a
its cached in memory at parse time, possibly in a msgpack file too this is entirely dbt related and has nothing to do with meltano... yeah lock files are a meltano thing and should be checked into git. They allow deterministic installs, thats the idea
h
Thanks