Hi, What is the expected behavior when there is a ...
# troubleshooting
j
Hi, What is the expected behavior when there is a schema change for a table (added field) being sent to
target-bigquery
? According to this thread, it should handle the schema change, but I am getting
CRITICAL 400 Provided Schema does not match Table hubspot_prod.contacts. Cannot add fields (field: property_hs_date_entered_customer)
. I’m using
tap-hubspot
and it worked earlier this morning (I think Hubspot must have added new fields midday?) and I even ran
meltano install loaders
to make sure I had the latest BigQuery loader. I also have the loader set to
truncate
. The
select
in my
meltano.yml
is set to
contacts.*
and I’m hoping I don’t need to use a catalog that fixes the schema in time. Thanks! https://meltano.slack.com/archives/CMN8HELB0/p1622053388058800?thread_ts=1622044438.058200&cid=CMN8HELB0
r
Hi @joe_bossalini, unfortunately target-bigquery doesn't support handling schema changes like adding a new field. The expected behavior is that it breaks with the error above. A workaround could be to delete the old table and do backfill (I understand that this is inconvenient and inefficient).
Adswerve considered adding this feature to target-bigquery, but chose not to. Our rationale is in this thread in singer.io Slack workspace
j
Ah! Thank you for the quick response! Not what I had hoped to hear, but I’ll move forward with either manually adding the field to BQ or configure meltano to use a fixed schema. Thanks again!
r
Sorry for the inconvenience @joe_bossalini
j
no worries!
a
Thanks for sharing this thread. I will say that this is atypical amongst database targets. Given an expectation that schema almost never changes, and that human cycles are available to remedy, I could see the argument here from the other side. But I think the expectation should be that taps will frequently add columns to their schemas, and we should strive for the wider EL community (even beyond Meltano/Singer) to treat column additions as a non-breaking change. For the time being, I agree the workarounds should reasonable, but this target would likely be non-viable for a source like Salesforce which has user-driven schema changes constantly. I'll reach out with a new issue in the repo to see if we can make a case to reconsider. Cc @amanda.folson @taylor
@ruslan_bergenov - Do you know if there are significant performance hits to adding new columns in bigquery? or is more related to complexity? (Many databases can almost instantly make column additions but I don't know if the same applies to bigquery.)
j
@ruslan_bergenov I just noticed that Meltano was not running with the latest version of the target! the Discovery catalog is hardcoded with an old version
0.10.2
. https://gitlab.com/meltano/meltano/blob/master/src/meltano/core/bundle/discovery.yml#L1558 I removed the version number from my
meltano.yml
and re-ran the loader install and now presumably have the most recent version. I’ve not done anything other than that and the load is now progressing! (It looks like it has dropped that table and is re-loading from scratch, but I can live with that.) It is reloading fresh, but maybe that is a side-effect of reinstalling the plugin.
r
@aaronsteers, I'm not sure. I don't think there should be a significant performance hit. We already handled this situation (adding new columns, changing schema) in our other code, separate from target-bigquery, which we didn't open source. If this feature is very important to you to have in target-bigquery, let me pitch it to our senior engineers again to see if we should add it to target-bigquery roadmap (I can't promise anything though, as it depends on our general workload).
@joe_bossalini, fingers crossed it works now. Yes, it should load the latest release now. You can also point to the latest explicitly: https://github.com/adswerve/target-bigquery/releases/tag/0.11.1
@joe_bossalini, when you run your sync, do you pass a tap-catalog file? Or do you do discovery and generate a new catalog file each time your run your sync?
j
Not passing a tap-catalog. I’m thinking that is the route I should take, and then handle any column changes manually
r
We recommend passing a tap-catalog file, However, it's up to you, there's no right or wrong. We personally pass tap-catalog, as it helps us prevent changes in schema (because our tap-catalog file governs the schema). However, for some taps it doesn't work (their schema changes and it doesn't "listen" to what we said in tap-catalog).
@aaronsteers, @joe_bossalini We were previously reluctant about adding a new feature to Adswerve target-bigquery: being able to add a new column during the append operation. We discussed this feature again, and we're willing to meet you halfway through, because we understand that this feature is important for you. However, we don't recommend supporting the following operations: • Appending or deleting a new field which is nested (we do have concerns about performance) • Deleting any field/column, whether at the top or nested level (this might compromise the integrity of data) • Changing a data type However, we're willing to support adding a new column at the top level during the append operation. We can do it with a table patch method. The steps would be: 1. compare destination table schema and the new data schema 2. patch the destination table, add the new column at the top level only Question: If we were to support adding a new column at the top level, would that solve about 80% of your issues with taps changing their schema? Potential next step: We could build this feature ourselves, we did some preliminary POC development.
My coworker just pointed out that if wedo end up adding columns in target-bigquery, this might be an easier  than checking schemas and sending a patch request:
Copy code
job_config.schema_update_options=[bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]
we were able to add a new column during the append operation in our POC tests, but only at the top level
@joe_bossalini, @aaronsteers, sorry for the delay, wanted to follow up on a target-bigquery feature that you requested: ability to add a column during the append operation. In the example below, we were able to add the
addresses
field, but we were not able to add something like
addresses.description
to the existing
addresses
field. Would it work for you if we supported adding a column at the top level (not the nested sub-fields)? Would this cover most of your use cases?
For example, here's the destination table schema before appending
Here's the destination table schema after appending