One more question regarding target-bigquery (adswe...
# plugins-general
n
One more question regarding target-bigquery (adswerve) . I have set
Copy code
replication_method: truncate
but still i can see that the data in bigquery contains duplicated entries (rows) or multiple version of a row. Is my understanding correct that "truncate" will upsert/merge (no duplicate rows), where as "append" will insert data at the end of table (creating duplicate rows).
d
@nil target-bigquery reads the
replication_method
setting here: https://github.com/adswerve/target-bigquery/blob/master/target_bigquery/__init__.py#L45-L47 and uses it here: https://github.com/adswerve/target-bigquery/blob/master/target_bigquery/processhandler.py#L203-L208 to set choose tell BigQuery to use the
TRUNCATE
WriteDispositition
, which works as expected: https://cloud.google.com/bigquery/docs/reference/auditlogs/rest/Shared.Types/WriteDisposition. You should see
"Load {table_name} by FULL_TABLE (truncate)"
logged as well. However, when that
_load_to_bq
method is called from
_do_temp_table_based_load
, it actually only loads the data into a temporary table: https://github.com/adswerve/target-bigquery/blob/master/target_bigquery/processhandler.py#L136, and then copies it into the "real" table using `APPEND`: https://github.com/adswerve/target-bigquery/blob/master/target_bigquery/processhandler.py#L152 So this looks to me like it could be a bug, and I think this last call site should be using the same
truncate=self.truncate if stream not in self.partially_loaded_streams else False
logic used in the loop that calls
_load_to_bq
. If you make that change to the locally installed target-bigquery, does it show the expected truncating behavior? If it does indeed work that way, I suggest filing an issue and contributing directly to https://github.com/adswerve/target-bigquery!
n
I will check that out tomorrow, to see if that solves the problem. As always thank you for your help
ok i have run with the change you suggested, it does truncate 🙂 but does not do what we thought. so when copy_jobs runs it will actually truncate the table and add the new data. this means we loose all the old data from the BQ. so write_append in copy job will copy the temp_table to BQ table, where as write_truncate will overwrite the BQ table (with data from temp_table) So the change i made was (as we only wanted to test if truncate did what we thought in
write_disposition
)
Copy code
if self.truncate:
  copy_config.write_disposition = WriteDisposition.WRITE_TRUNCATE
else:
  copy_config.write_disposition = WriteDisposition.WRITE_APPEND
d
@nil All right, so this is indeed a bug? Will you file an issue/PR on https://github.com/adswerve/target-bigquery? 🙂 Instead of checking
self.truncate
, I think we should use
self.truncate and stream not in self.partially_loaded_streams
like it does in
_load_to_bq
, since partially loaded streams do need to append on copies after the first one, even if the first copy truncated.
n
how about
Copy code
if self.tables[stream] not in self.partially_loaded_streams and self.truncate 
  copy_config.write_disposition = WriteDisposition.WRITE_TRUNCATE
else 
  copy_config.write_disposition = WriteDisposition.WRITE_APPEND
so it will truncate only if its not in
partially_loaded_streams
and
self.truncate
is
true
,
append
otherwise.
d
Isn't that what I suggested? 😄 I think
partially_loaded_streams
contains stream names though (like
stream
), not
self.tables[stream]
.
Either way, this is probably something the authors of the tap will need to weigh in on so I suggest moving this discussion to their issue/PR tracker 😉
n
I am running it again with this change, so hopefully will add issue there soon 🙂