nil
12/30/2020, 6:48 PMreplication_method: truncate
but still i can see that the data in bigquery contains duplicated entries (rows) or multiple version of a row.
Is my understanding correct that "truncate" will upsert/merge (no duplicate rows), where as "append" will insert data at the end of table (creating duplicate rows).douwe_maan
12/30/2020, 8:55 PMreplication_method
setting here: https://github.com/adswerve/target-bigquery/blob/master/target_bigquery/__init__.py#L45-L47 and uses it here: https://github.com/adswerve/target-bigquery/blob/master/target_bigquery/processhandler.py#L203-L208 to set choose tell BigQuery to use the TRUNCATE
WriteDispositition
, which works as expected: https://cloud.google.com/bigquery/docs/reference/auditlogs/rest/Shared.Types/WriteDisposition. You should see "Load {table_name} by FULL_TABLE (truncate)"
logged as well.
However, when that _load_to_bq
method is called from _do_temp_table_based_load
, it actually only loads the data into a temporary table: https://github.com/adswerve/target-bigquery/blob/master/target_bigquery/processhandler.py#L136, and then copies it into the "real" table using `APPEND`: https://github.com/adswerve/target-bigquery/blob/master/target_bigquery/processhandler.py#L152
So this looks to me like it could be a bug, and I think this last call site should be using the same truncate=self.truncate if stream not in self.partially_loaded_streams else False
logic used in the loop that calls _load_to_bq
. If you make that change to the locally installed target-bigquery, does it show the expected truncating behavior?
If it does indeed work that way, I suggest filing an issue and contributing directly to https://github.com/adswerve/target-bigquery!nil
12/30/2020, 9:07 PMnil
01/04/2021, 9:26 PMwrite_disposition
)
if self.truncate:
copy_config.write_disposition = WriteDisposition.WRITE_TRUNCATE
else:
copy_config.write_disposition = WriteDisposition.WRITE_APPEND
douwe_maan
01/04/2021, 9:56 PMself.truncate
, I think we should use self.truncate and stream not in self.partially_loaded_streams
like it does in _load_to_bq
, since partially loaded streams do need to append on copies after the first one, even if the first copy truncated.nil
01/04/2021, 10:09 PMif self.tables[stream] not in self.partially_loaded_streams and self.truncate
copy_config.write_disposition = WriteDisposition.WRITE_TRUNCATE
else
copy_config.write_disposition = WriteDisposition.WRITE_APPEND
so it will truncate only if its not in partially_loaded_streams
and self.truncate
is true
, append
otherwise.douwe_maan
01/04/2021, 10:10 PMpartially_loaded_streams
contains stream names though (like stream
), not self.tables[stream]
.douwe_maan
01/04/2021, 10:10 PMnil
01/04/2021, 10:12 PM