Hi, i was trying to create a pipeline between post...
# troubleshooting
j
Hi, i was trying to create a pipeline between postgres and bigquery and every time i run the elt the process, it does a full import: "No state was found, complete import." I'm running meltano on docker i've set the database_uri to a postgres db, where i can see the run table I'm using the following command
docker run -v $(pwd):/project -w /project meltano/meltano --log-level=debug elt tap-postgres target-bigquery
If i run this from the UI, i get the message "No state was found, complete import." on the first run, but then on the second time i try to run it i get:
Copy code
2022-11-17T00:54:50.284064Z [info     ] Found state from 2022-11-17 00:49:10.989026.
Any idea of what i could be missing here?
s
Hey @jose, that is odd. Have you checked the state from inside docker? Running "meltano state list"? I'm not sure I can help with that problem, sounds hard to replicate, maybe @ken_payne or @Will Da Silva (Arch) have some ideas?
w
I'm not sure. Perhaps @cody_hanson has some ideas around what's causing this.
j
Hi @Sven Balnojan @Will Da Silva (Arch), thanks for the replies. Sven, i'm not sure if i'm "inside" docker. This is my first journey on docker tbh. To run that command you suggested i need to run it like this:
docker run  -v $(pwd):/project -w /project meltano/meltano state list
If i run it, i'm getting the below result:
I've run the process again using the following command:
docker run  -v $(pwd):/project -w /project meltano/meltano --log-level=debug elt tap-postgres target-bigquery-xxx
Looking at the log i can see this in the end:
and if after i run the process again, i get this in the log
I've set the database_uri to a postgres db, where i can see the run table and being so i'm a bit confused why in the last image above, the process tried to get the state from a state.json file.
s
hm not sure on this one, I'm still waiting for @cody_hanson; But could you maybe share your meltano.yml?
c
👋 Hi @jose. I've got a few ideas/comments here: 1. That
state.json
log line is expected. Meltano writes the state it finds in its backend to a
state.json
file for the tap to read. If there is no state to write, then you'll see that log line. 2. Could you share the configuration for
default_replication_method
? If you're doing log-based replication, ensure that your source postgres database is configured correctly for log-based replication. If you're doing incremental replication, make sure that you've set the correct
replication_key
for each of your streams. This behavior could also be caused by the tap not having any replication keys configured.
j
Hi @cody_hanson thanks. In fact, when i've run this from the UI, the first run did a full load and then the second run it just threw an error that i think is associated with wal2json be missing in our server (waiting for our devops team to install it). I was using log_based replication. The same problem does not happen using the command line, where the process always tries to do a full load every time i run it. I'm assuming that in the first run, the process don't use the log and just does a full dump of the table and because of that does not need the wal2json ? . Why running this from the UI and from command line had a different behavior? I will try to config this to incremental replication.
c
Why running this from the UI and from command line had a different behavior?
That's a really good question and would probably be worth filing a bug report. I'll also loop in <!subteam^S02BCD9FFEF> in case anyone else with more in-depth knowledge of the UI has any insight on this.
I'm assuming that in the first run, the process don't use the log and just does a full dump of the table and because of that does not need the wal2json
This seems likely. My guess would be that the underlying issue is the same (that log-based replication isn't working) but for whatever reason, that error is being swallowed on the CLI. Let me know how incremental replication works and whether or not the issue with log-based replication persists after your environment is fully configured for it.
k
@jose what variant of
tap-postgres
are you using? Guessing its the pipelinewise (PPW) one? This behaviour sounds like PPW fallback to me 🤔 The PPW taps contain logic for intelligently selecting the replication method, intended for running backfills using
FULL_TABLE
and then switching to
LOG_BASED
thereafter. Here is the relevant logic in the
pipelinewise-tap-postgres
codebase - it checks for specific values in the bookmark and "falls back" if they aren't found. So it may be that i) Meltano cannot access your state backend to retrieve state (and so is using an empty sqlite database locally) or ii) your state file is missing the bookmark values expected by the tap and is erroneously marking the stream as
FULL_TABLE
. Given your recent experience with the UI and the
waltojson
error, it seems the tap is correctly receiving state and attempting to read logs via the UI container. So I would check network connectivity/permissions between your workers calling the Meltano cli and your DB, and also verify you are correctly injecting the Meltano backend database URI secret in prod 🙂
Why running this from the UI and from command line had a different behavior?
This can happen when using different system databases or state backends, as this is where state is stored between invocations of Meltano. By default, Meltano will use a sqlite database in the local
.meltano
folder if no state backend or system database is configured. It may be that the UI is using a sqlite database, or the workers are using a sqlite database or both 😅 This can be solved by ensuring both the UI and the machines running commands via the CLI have access and are configured to use the same system database or state backend.
j
Hi @ken_payne, thanks a lot for the message and the info. I'm using tap-postgres from pipelinewise indeed and i've seen the code you have shared. That's fine, i would not mind to have this approach if this was working 🙂. If you don't mind, let me share a couple of thoughts related with some questions you've raised. Both methods of running the elt are reaching the database that i've configured in database_uri, as i can see a table with name "run" and it's being stored there the log of each run (cli and ui). Now... i was looking into the table run and just realized that the runs from the cli have always a different job name, whilst the ones from the ui have the same name.
I've went back to the docs and if i think i was missing this in the command line
--state-id=postgres-to-bigquery
I was running this command:
docker run  -v $(pwd):/project -w /project meltano/meltano --log-level=debug elt tap-postgres target-bigquery
and if i run it like this:
docker run  -v $(pwd):/project -w /project meltano/meltano --log-level=debug elt tap-postgres target-bigquery --state-id=postgres-to-bigquery
i'm getting the same error as in the ui 🙂.
k
Ah, well figured out 👏 It is probably worth saying that
meltano elt
has largely been superseded by
meltano run
which automatically handles state id's 😅 We have rough plans to remove
elt
in v3 of Meltano (expected some time in 2023). If you haven't already given the run command a try, I recommend checking it out 👍
j
Hi @ken_payne,thanks agains for sharing that. I've tested the run command and it works as intended. Such a small detail was causing me a lot of headaches 🙂.
k
Glad you were able to get to the bottom of it 🙌 Out of interest, where did you see references to
meltano elt
when you began with your project? We have been updating our docs and tutorials to point everyone to
meltano run
, but we may have missed some 😅
j
@ken_payne i've started to look at meltano around May/June and then after some research, i've put in place a small POC (tap-mysql -> target-bigquery). The meltano elt comes from that time and i've seen this on your documentation for sure 🙂. Back then i was running this mainly from the ui and in the cli was using the default sqlite db. I've notice the discrepancy also, but always thought it was related with the fact of using the sqlite.
Hi @ken_payne did another round on this and it seems that i can make this work by any means. I've changed the replication of my table to incremental and set the replication key:
docker run  -v $(pwd):/project -w /project meltano/meltano config tap-postgres set metadata.dw-table1_report.replication-method "INCREMENTAL"
docker run  -v $(pwd):/project -w /project meltano/meltano config tap-postgres set metadata.dw-table1_report.replication-key date_changed
I've also set the state backend uri and database uri to an external postgres db:
docker run  -v $(pwd):/project -w /project meltano/meltano config meltano set state_backend.uri <postgresql://user:pass@server12345.aws.com:5432/dbtest?options=-csearch_path%3Dmelty>
docker run  -v $(pwd):/project -w /project meltano/meltano config meltano set database_uri <postgresql://user:pass@server12345.aws.com:5432/dbtest?options=-csearch_path%3Dmelty>
i'm using this command to run:
docker run  -v $(pwd):/project -w /project meltano/meltano --log-level=debug run tap-postgres target-bigquery
I have the table already loaded in bigquery and want only to upload the deltas, basically everything that was updated since the last run. When ths process was running it just did a full load again and didn't considered the state. In the log file i can see this:
Copy code
2022-11-24T00:38:39.402260Z [debug    ] Could not find state.json in /project/.meltano/extractors/tap-postgres/state.json, skipping.
2022-11-24T00:38:39.411062Z [info     ] Found state from 2022-11-24 00:23:12.677537.
Here it run and uploaded all the records again In the end i'm getting this:
Copy code
2022-11-24T00:39:41.568131Z [info     ] time=2022-11-24 00:39:41 name=target_bigquery level=INFO message=Emitting state {"bookmarks": {"dw-table1_report": {"last_replication_method": "INCREMENTAL", "replication_key": "date_changed", "version": 1669119374130, "replication_key_value": "2022-11-23T09:10:47.560000+00:00"}}, "currently_syncing": null} cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery
Copy code
2022-11-24T00:39:41.577153Z [info     ] Incremental state has been updated at 2022-11-24 00:39:41.576912.

2022-11-24T00:39:41.577484Z [debug    ] Incremental state: {'bookmarks': {'dw-table1_report': {'last_replication_method': 'INCREMENTAL', 'replication_key': 'date_changed', 'version': 1669119374130, 'replication_key_value': '2022-11-23T09:10:47.560000+00:00'}}, 'currently_syncing': None}

2022-11-24T00:39:41.577649Z [info     ] {"bookmarks": {"dw-table1_report": {"last_replication_method": "INCREMENTAL", "replication_key": "date_changed", "version": 1669119374130, "replication_key_value": "2022-11-23T09:10:47.560000+00:00"}}, "currently_syncing": null} cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stdout string_id=target-bigquery
The most recent update in the table has a date_changed =
2022-11-23T09:10:47.560
If i run the process again it just does a full table again and the log file will be similar to what i've posted above. Any Idea of what could be happening ?
On Postgres, these are the tables that were created:
I'm wondering where is the state being stored. Is it in the "run" table?
Also, if i run
docker run  -v $(pwd):/project -w /project meltano/meltano state get dev:tap-postgres-to-target-bigquery
I'm getting this info, that is in line with what can be seen in the log:
Copy code
2022-11-24T00:59:48.720352Z [info     ] Found state from 2022-11-24 00:38:39.151956.
{"singer_state": {"bookmarks": {"dw-table1_report": {"last_replication_method": "INCREMENTAL", "replication_key": "date_changed", "version": 1669119374130, "replication_key_value": "2022-11-23T09:10:47.560000+00:00"}}, "currently_syncing": null}}
a
Hey @jose, can you please share your target-bigquery config once as well? I’m not a 100% sure but I think if it has truncate instead of append, it used to do a full table mirror for me last year when I was setting things up. One other thing you can try testing is doing
meltano elt
with a
--state-id
flag given instead of
meltano run
and see if you specifying the state helps?
j
Hi @abhishek_ajmera thanks for the reply. This is the config for target-bigquery i have in my meltano.yml
Copy code
add_metadata_columns: true
      credentials_path: /project/access.json
      dataset_id: ds_test
      default_target_schema: ds_test
      location: europe-west6
      primary_key_required: false
      project_id: project-dev
For what i can see in BigQuery project history, Meltano is uploading an avro file and then create a temp table with the content. After it runs a merge command between the temp table and the final table. To be honest i would prefer this to do an append in the existing table, rather than a merge. The merge makes some assumption in terms of uniqueness that i would prefer to control. Do you know if there's a way of doing it? I'll try the
meltano elt
and let you know
running the meltano elt it seemed to work. It pulls always the last updated records, which i can live with it. Interestingly after i've run a
meltano run
and had the same behaviour as the
meltano elt
a
Weird but wonderful! 😂