Hi i was trying to create a pipeline between postgres and bi Meltano #troubleshooting

Hi, i was trying to create a pipeline between post...

jose

11/17/2022, 1:07 AM

Hi, i was trying to create a pipeline between postgres and bigquery and every time i run the elt the process, it does a full import: "No state was found, complete import." I'm running meltano on docker i've set the database_uri to a postgres db, where i can see the run table I'm using the following command

docker run -v $(pwd):/project -w /project meltano/meltano --log-level=debug elt tap-postgres target-bigquery

If i run this from the UI, i get the message "No state was found, complete import." on the first run, but then on the second time i try to run it i get:

Copy code

2022-11-17T00:54:50.284064Z [info     ] Found state from 2022-11-17 00:49:10.989026.

Any idea of what i could be missing here?

Sven Balnojan

11/21/2022, 12:05 PM

Hey @jose, that is odd. Have you checked the state from inside docker? Running "meltano state list"? I'm not sure I can help with that problem, sounds hard to replicate, maybe @ken_payne or @Will Da Silva (Arch) have some ideas?

Will Da Silva (Arch)

11/21/2022, 2:00 PM

I'm not sure. Perhaps @cody_hanson has some ideas around what's causing this.

jose

11/21/2022, 2:47 PM

Hi @Sven Balnojan @Will Da Silva (Arch), thanks for the replies. Sven, i'm not sure if i'm "inside" docker. This is my first journey on docker tbh. To run that command you suggested i need to run it like this:

docker run  -v $(pwd):/project -w /project meltano/meltano state list

jose

11/21/2022, 2:47 PM

If i run it, i'm getting the below result:

jose

11/21/2022, 2:50 PM

I've run the process again using the following command:

docker run  -v $(pwd):/project -w /project meltano/meltano --log-level=debug elt tap-postgres target-bigquery-xxx

Looking at the log i can see this in the end:

jose

11/21/2022, 2:51 PM

and if after i run the process again, i get this in the log

jose

11/21/2022, 2:54 PM

I've set the database_uri to a postgres db, where i can see the run table and being so i'm a bit confused why in the last image above, the process tried to get the state from a state.json file.

Sven Balnojan

11/21/2022, 3:16 PM

hm not sure on this one, I'm still waiting for @cody_hanson; But could you maybe share your meltano.yml?

cody_hanson

11/21/2022, 3:35 PM

👋 Hi @jose. I've got a few ideas/comments here: 1. That

state.json

log line is expected. Meltano writes the state it finds in its backend to a

state.json

file for the tap to read. If there is no state to write, then you'll see that log line. 2. Could you share the configuration for

default_replication_method

? If you're doing log-based replication, ensure that your source postgres database is configured correctly for log-based replication. If you're doing incremental replication, make sure that you've set the correct

replication_key

for each of your streams. This behavior could also be caused by the tap not having any replication keys configured.

jose

11/21/2022, 4:40 PM

Hi @cody_hanson thanks. In fact, when i've run this from the UI, the first run did a full load and then the second run it just threw an error that i think is associated with wal2json be missing in our server (waiting for our devops team to install it). I was using log_based replication. The same problem does not happen using the command line, where the process always tries to do a full load every time i run it. I'm assuming that in the first run, the process don't use the log and just does a full dump of the table and because of that does not need the wal2json ? . Why running this from the UI and from command line had a different behavior? I will try to config this to incremental replication.

cody_hanson

11/21/2022, 5:21 PM

Why running this from the UI and from command line had a different behavior?

That's a really good question and would probably be worth filing a bug report. I'll also loop in <!subteam^S02BCD9FFEF> in case anyone else with more in-depth knowledge of the UI has any insight on this.

I'm assuming that in the first run, the process don't use the log and just does a full dump of the table and because of that does not need the wal2json

This seems likely. My guess would be that the underlying issue is the same (that log-based replication isn't working) but for whatever reason, that error is being swallowed on the CLI. Let me know how incremental replication works and whether or not the issue with log-based replication persists after your environment is fully configured for it.

ken_payne

11/21/2022, 7:15 PM

@jose what variant of

tap-postgres

are you using? Guessing its the pipelinewise (PPW) one? This behaviour sounds like PPW fallback to me 🤔 The PPW taps contain logic for intelligently selecting the replication method, intended for running backfills using

FULL_TABLE

and then switching to

LOG_BASED

thereafter. Here is the relevant logic in the

pipelinewise-tap-postgres

codebase - it checks for specific values in the bookmark and "falls back" if they aren't found. So it may be that i) Meltano cannot access your state backend to retrieve state (and so is using an empty sqlite database locally) or ii) your state file is missing the bookmark values expected by the tap and is erroneously marking the stream as

FULL_TABLE

. Given your recent experience with the UI and the

waltojson

error, it seems the tap is correctly receiving state and attempting to read logs via the UI container. So I would check network connectivity/permissions between your workers calling the Meltano cli and your DB, and also verify you are correctly injecting the Meltano backend database URI secret in prod 🙂

Why running this from the UI and from command line had a different behavior?

This can happen when using different system databases or state backends, as this is where state is stored between invocations of Meltano. By default, Meltano will use a sqlite database in the local

.meltano

folder if no state backend or system database is configured. It may be that the UI is using a sqlite database, or the workers are using a sqlite database or both 😅 This can be solved by ensuring both the UI and the machines running commands via the CLI have access and are configured to use the same system database or state backend.

jose

11/22/2022, 12:13 AM

Hi @ken_payne, thanks a lot for the message and the info. I'm using tap-postgres from pipelinewise indeed and i've seen the code you have shared. That's fine, i would not mind to have this approach if this was working 🙂. If you don't mind, let me share a couple of thoughts related with some questions you've raised. Both methods of running the elt are reaching the database that i've configured in database_uri, as i can see a table with name "run" and it's being stored there the log of each run (cli and ui). Now... i was looking into the table run and just realized that the runs from the cli have always a different job name, whilst the ones from the ui have the same name.

jose

11/22/2022, 12:17 AM

I've went back to the docs and if i think i was missing this in the command line

--state-id=postgres-to-bigquery

I was running this command:

docker run  -v $(pwd):/project -w /project meltano/meltano --log-level=debug elt tap-postgres target-bigquery

and if i run it like this:

docker run  -v $(pwd):/project -w /project meltano/meltano --log-level=debug elt tap-postgres target-bigquery --state-id=postgres-to-bigquery

i'm getting the same error as in the ui 🙂.

ken_payne

11/22/2022, 10:06 AM

Ah, well figured out 👏 It is probably worth saying that

meltano elt

has largely been superseded by

meltano run

which automatically handles state id's 😅 We have rough plans to remove

elt

in v3 of Meltano (expected some time in 2023). If you haven't already given the run command a try, I recommend checking it out 👍

jose

11/22/2022, 10:48 AM

Hi @ken_payne,thanks agains for sharing that. I've tested the run command and it works as intended. Such a small detail was causing me a lot of headaches 🙂.

ken_payne

11/22/2022, 10:53 AM

Glad you were able to get to the bottom of it 🙌 Out of interest, where did you see references to

meltano elt

when you began with your project? We have been updating our docs and tutorials to point everyone to

meltano run

, but we may have missed some 😅

jose

11/22/2022, 11:42 AM

@ken_payne i've started to look at meltano around May/June and then after some research, i've put in place a small POC (tap-mysql -> target-bigquery). The meltano elt comes from that time and i've seen this on your documentation for sure 🙂. Back then i was running this mainly from the ui and in the cli was using the default sqlite db. I've notice the discrepancy also, but always thought it was related with the fact of using the sqlite.

jose

11/24/2022, 12:54 AM

Hi @ken_payne did another round on this and it seems that i can make this work by any means. I've changed the replication of my table to incremental and set the replication key:

docker run  -v $(pwd):/project -w /project meltano/meltano config tap-postgres set metadata.dw-table1_report.replication-method "INCREMENTAL"

docker run  -v $(pwd):/project -w /project meltano/meltano config tap-postgres set metadata.dw-table1_report.replication-key date_changed

I've also set the state backend uri and database uri to an external postgres db:

docker run  -v $(pwd):/project -w /project meltano/meltano config meltano set state_backend.uri <postgresql://user:pass@server12345.aws.com:5432/dbtest?options=-csearch_path%3Dmelty>

docker run  -v $(pwd):/project -w /project meltano/meltano config meltano set database_uri <postgresql://user:pass@server12345.aws.com:5432/dbtest?options=-csearch_path%3Dmelty>

i'm using this command to run:

docker run  -v $(pwd):/project -w /project meltano/meltano --log-level=debug run tap-postgres target-bigquery

I have the table already loaded in bigquery and want only to upload the deltas, basically everything that was updated since the last run. When ths process was running it just did a full load again and didn't considered the state. In the log file i can see this:

Copy code

2022-11-24T00:38:39.402260Z [debug    ] Could not find state.json in /project/.meltano/extractors/tap-postgres/state.json, skipping.
2022-11-24T00:38:39.411062Z [info     ] Found state from 2022-11-24 00:23:12.677537.

Here it run and uploaded all the records again In the end i'm getting this:

Copy code

2022-11-24T00:39:41.568131Z [info     ] time=2022-11-24 00:39:41 name=target_bigquery level=INFO message=Emitting state {"bookmarks": {"dw-table1_report": {"last_replication_method": "INCREMENTAL", "replication_key": "date_changed", "version": 1669119374130, "replication_key_value": "2022-11-23T09:10:47.560000+00:00"}}, "currently_syncing": null} cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stderr string_id=target-bigquery

Copy code

2022-11-24T00:39:41.577153Z [info     ] Incremental state has been updated at 2022-11-24 00:39:41.576912.

2022-11-24T00:39:41.577484Z [debug    ] Incremental state: {'bookmarks': {'dw-table1_report': {'last_replication_method': 'INCREMENTAL', 'replication_key': 'date_changed', 'version': 1669119374130, 'replication_key_value': '2022-11-23T09:10:47.560000+00:00'}}, 'currently_syncing': None}

2022-11-24T00:39:41.577649Z [info     ] {"bookmarks": {"dw-table1_report": {"last_replication_method": "INCREMENTAL", "replication_key": "date_changed", "version": 1669119374130, "replication_key_value": "2022-11-23T09:10:47.560000+00:00"}}, "currently_syncing": null} cmd_type=elb consumer=True name=target-bigquery producer=False stdio=stdout string_id=target-bigquery

The most recent update in the table has a date_changed =

2022-11-23T09:10:47.560

If i run the process again it just does a full table again and the log file will be similar to what i've posted above. Any Idea of what could be happening ?

jose

11/24/2022, 12:58 AM

On Postgres, these are the tables that were created:

jose

11/24/2022, 12:59 AM

I'm wondering where is the state being stored. Is it in the "run" table?

jose

11/24/2022, 1:02 AM

Also, if i run

docker run  -v $(pwd):/project -w /project meltano/meltano state get dev:tap-postgres-to-target-bigquery

I'm getting this info, that is in line with what can be seen in the log:

Copy code

2022-11-24T00:59:48.720352Z [info     ] Found state from 2022-11-24 00:38:39.151956.
{"singer_state": {"bookmarks": {"dw-table1_report": {"last_replication_method": "INCREMENTAL", "replication_key": "date_changed", "version": 1669119374130, "replication_key_value": "2022-11-23T09:10:47.560000+00:00"}}, "currently_syncing": null}}

abhishek_ajmera

11/24/2022, 4:31 AM

Hey @jose, can you please share your target-bigquery config once as well? I’m not a 100% sure but I think if it has truncate instead of append, it used to do a full table mirror for me last year when I was setting things up. One other thing you can try testing is doing

meltano elt

with a

--state-id

flag given instead of

meltano run

and see if you specifying the state helps?

jose

11/24/2022, 9:56 AM

Hi @abhishek_ajmera thanks for the reply. This is the config for target-bigquery i have in my meltano.yml

Copy code

add_metadata_columns: true
      credentials_path: /project/access.json
      dataset_id: ds_test
      default_target_schema: ds_test
      location: europe-west6
      primary_key_required: false
      project_id: project-dev

For what i can see in BigQuery project history, Meltano is uploading an avro file and then create a temp table with the content. After it runs a merge command between the temp table and the final table. To be honest i would prefer this to do an append in the existing table, rather than a merge. The merge makes some assumption in terms of uniqueness that i would prefer to control. Do you know if there's a way of doing it? I'll try the

meltano elt

and let you know

jose

11/24/2022, 10:15 AM

running the meltano elt it seemed to work. It pulls always the last updated records, which i can live with it. Interestingly after i've run a

meltano run

and had the same behaviour as the

meltano elt

abhishek_ajmera

11/24/2022, 11:26 AM

Weird but wonderful! 😂

Open in Slack

Previous Next