How are you guys managing tap-google-analytics tha...
# plugins-general
d
How are you guys managing tap-google-analytics that doesn't support state for incremental data? Running it as-is will create duplicate rows since it runs full table replication.
d
@daniel_pettersson What loader are you using? As you point out, tap-google-analytics doesn't currently do incremental replication, but loaders are still usually capable of deduplicating records by upserting based on each stream's
key_properties
, which is set to the
ga_date
column on streams for reports that include a
ga:date
metric.
d
Postgres and bigquery
Mostly bigquery
d
Note that incremental replication should not be too hard to add per https://gitlab.com/meltano/tap-google-analytics/-/issues/6, if you were interested in contributing it 🙂
Ah OK, then the issue would appear to be that BigQuery can't upsert, so then you're dependent on the incremental loading behavior of the tap
Note that you would still see duplicate records at incremental replication boundaries even then: https://meltano.slack.com/archives/CFG3C3C66/p1602522150168400
@daniel_parker Instead of manually loading from the GA API to BigQuery, have you considered https://support.google.com/analytics/answer/9358801?hl=en?
To get this to work with tap-google-analytics and target-bigquery, we'd need to add incremental replication to the tap, plus a setting to find new/changed records exclusively of the replication key date (like https://meltano.com/plugins/extractors/bigquery.html#start-always-inclusive does for tap-bigquery).
It's made a little trickier by the fact that a report for the current day will already show up in the morning, but will still go up during the day, so if you can't upsert in the destination, you'd probably want to exclude incomplete records entirely, getting only data up until yesterday
Which is behavior you can realize yourself by dynamically adjusting the start_date and end_date settings. You could make sure your pipeline runs once a day, and sets both to yesterday's YYYY-MM-DD timestamp
That is, assuming you're only interested in daily reports.
When using the tap with a target that can do upserts, though, there's no reason to worry about the boundary and it may be useful to have incomplete records loaded during the day, so if we were to add incremental replication to the tap, we'd probably still want a setting to make it work properly with targets that can't upsert
d
Hmm, alright, sound like a bit of a hassle. Maybe connecting GA v4 with bigquery would be easier than this
d
It probably will be, looks like tap-google-analytics and target-bigquery are not a great match (right now)
d
Yeah, will take a look tomorrow, thanks for the pointers!
Hey again, how would I set dynamic dates in schedule or the config for the tap? I'm referring to yesterday's YYYY-MM-DD.
d
@daniel_parker How much control do you have over the execution environment of
meltano elt
? If you can run code there, can generate timestamps and feed them to the tap through the
TAP_GOOGLE_ANALYTICS_{START,END}_DATE
env vars
Whatever you're using to execute
meltano elt
on a daily schedule (Airflow, cron, systemd), you can likely configure it to run a script that sets those env vars before running the pipeline itself.
d
Full control, running it on GKE, airflow scheduler. Alright 👍
d
Perfect, then you should be able to modify the DAG generator at
orchestrate/dags/meltano.py
to your liking. Let me know how it goes!
x
Hi @daniel_pettersson ! I know this post is a bit too old, but I am wondering if you have solved the incremental replication for
tap-google-analytics
? I am facing exactly the same issue, would be great if you could share some hints, thank you!