hello Again I am loading from postgress and sink is target s Meltano #troubleshooting

hello Again- I am loading from postgress and sink ...

ashutosh_shanker

10/10/2023, 7:05 AM

hello Again- I am loading from postgress and sink is target-s3( crowemi) . Just loading one table with *91093 records. The pipelines run fine but when I look at the no of records in the parquet file in the S3 bucket it does not match. Two parquet files are. created in S3 and one has 10k and the other has only 100 recotrds. If I change the target to "*target-jsonl" , I can see all the records in the json file, so the extractor looks fine. Seems I have missed some setting

ashutosh_shanker

10/10/2023, 7:05 AM

excerpts from the pipeline logs

Copy code

| Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:36:47.574519Z [info     ] 2023-10-10 12:06:47,574 | INFO     | target-s3            | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0636 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:36:58.701592Z [info     ] 2023-10-10 12:06:58,701 | INFO     | target-s3            | Target 'target-s3' completed reading 91093 lines of input (1 schemas, 91080 records, 0 batch manifests, 11 state messages). cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:36:58.738839Z [info     ] 2023-10-10 12:06:58,738 | INFO     | target-s3            | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0636 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:37:03.868288Z [info     ] 2023-10-10 12:07:03,867 | INFO     | target-s3            | Cleaning up public-users_user cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:37:03.868871Z [info     ] 2023-10-10 12:07:03,868 | INFO     | target-s3            | Emitting completed target state {"bookmarks": {"public-users_user": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1696919723638, "replication_key_value": 114042}}, "currently_syncing": null} cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:37:03.900582Z [info     ] Writing state to AWS S3
2023-10-10T06:37:06.421520Z [info     ] smart_open.s3.MultipartWriter('mds-bh', 'dev/meltano/state/dev:tap-postgres-to-target-s3/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)
2023-10-10T06:37:07.579636Z [info     ] smart_open.s3.MultipartWriter('mds-bh', 'dev/meltano/state/dev:tap-postgres-to-target-s3/state.json'): uploading part_num: 1, 239 bytes (total 0.000GB)
2023-10-10T06:37:08.707181Z [info     ] Incremental state has been updated at 2023-10-10 06:37:08.706740.
2023-10-10T06:37:08.723602Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True

ashutosh_shanker

10/10/2023, 7:06 AM

meltano config

Copy code

- name: target-s3
    variant: crowemi
    pip_url: git+<https://github.com/crowemi/target-s3.git>
    config:
      append_date_to_filename: true
      append_date_to_filename_grain: minute
      append_date_to_prefix: false
      cloud_provider:
        aws:
          aws_bucket: mds-bh
          aws_region: us-east-2
      include_process_date: false
      format:
        format_parquet:
          validate: false
        format_type: parquet
      prefix: local/ashu/parquet

ashutosh_shanker

10/10/2023, 8:38 AM

using below loader works just fine

Copy code

name: target-s3-parquet
  variant: jkausti

user

10/10/2023, 12:58 PM

What command are you using to run this? One thing I'm wondering is if theres incremental replication happening and thats only the new records vs all the records. If you use

run

then state is tracked based on the combination of tap-x + target-y so the second time you run it the tap will be incremental and have state (although idk if tap-postgres does incremental automatically 🤔), if you switch the target to jsonl there will be no state and all records syncing would be expected.

user

10/10/2023, 12:59 PM

If thats not the case then it could certainly be a target issue but would be good to decipher exactly where the records are being missed

user

10/10/2023, 1:01 PM

While throwing out random idea, could the target be accidentally overwriting files? 🤔 without looking at the code, if 2 files are written in the same minute maybe they have a shared name and are being overwritten, probably unlikely but 🤷

ashutosh_shanker

10/11/2023, 7:07 AM

I have two S3 loaders

Copy code

- name: target-s3
    variant: crowemi
    pip_url: git+<https://github.com/crowemi/target-s3.git>
    config:
      append_date_to_filename: true
      append_date_to_filename_grain: minute
      append_date_to_prefix: false
      cloud_provider:
        aws:
          aws_bucket: mds-bh
          aws_region: us-east-2
      include_process_date: false
      format:
        format_parquet:
          validate: false
        format_type: parquet
      prefix: local/ashu/parquet
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl
    config:
      destination_path: ${MELTANO_PROJECT_ROOT}/output/
  - name: target-s3-parquet
    variant: jkausti
    pip_url: git+<https://github.com/jkausti/target-s3.git>
    config:
      aws_region: us-east-2
      filetype: parquet
      path: mds-bh/local/ashu1/parquet

ashutosh_shanker

10/11/2023, 7:08 AM

below is the extractor config

Copy code

extractors:
  - name: tap-postgres
    variant: transferwise
    pip_url: pipelinewise-tap-postgres
    config:
      host: XXXX
      port: XXXX
      user: XXXX
      dbname: XXX
      filter_schemas: public
      default_replication_method: FULL_TABLE
    select:
    - public-users_user.*
    #- public-customer_customer.*
    metadata:
      public-users_user:
        replication-method: INCREMENTAL
        replication-key: id

ashutosh_shanker

10/11/2023, 7:09 AM

it is incremental, so I removed state.json and cleaned up the s3 parquet folder too

ashutosh_shanker

10/11/2023, 7:10 AM

target-s3 ( crowemi ) - creates multiple files. So it created 2 for the fresh run . one had 10k and the other had 1k records . files were kb in size target-s3-parquet - created 1 file with 9MB

ashutosh_shanker

10/11/2023, 7:12 AM

extractor pull 90k records. I could see in the loader logs that it processes 90k records and was draining the target-sink but once the run was completed it did not load all the records

ashutosh_shanker

10/11/2023, 7:12 AM

I ran the below command

Copy code

meltano run tap-postgres column-remover target-s3

ashutosh_shanker

10/11/2023, 7:13 AM

logs of the loader which does not work ```2023-10-10T063524.243417Z [info ] 2023-10-10 120524,243 | INFO | target-s3 | Target 'target-s3' is listening for input from tap. cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063524.243677Z [info ] 2023-10-10 120524,243 | INFO | target-s3 | Initializing 'target-s3' target sink... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063524.243788Z [info ] 2023-10-10 120524,243 | INFO | target-s3 | Initializing target sink for stream 'public-users_user'... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063524.243907Z [info ] 2023-10-10 120524,243 | INFO | target-s3 | Setting up public-users_user cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063524.243996Z [info ] 2023-10-10 120524,243 | WARNING | target-s3 | ACTIVATE_VERSION message received but not implemented by this target. Ignoring. cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063526.312348Z [info ] 2023-10-10 120526,312 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063526.409979Z [info ] 2023-10-10 120526,409 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0635 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063530.191170Z [info ] time=2023-10-10 120530 name=singer level=INFO message=METRIC: {"type": "counter", "metric": "record_count", "value": 91080, "tags": {}} cmd_type=elb consumer=False name=tap-postgres producer=True stdio=stderr string_id=tap-postgres 2023-10-10T063538.631532Z [info ] 2023-10-10 120538,631 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063538.674520Z [info ] 2023-10-10 120538,674 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0635 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063546.790267Z [info ] 2023-10-10 120546,790 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063546.833582Z [info ] 2023-10-10 120546,833 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0635 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063554.012207Z [info ] 2023-10-10 120554,012 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063554.032601Z [info ] 2023-10-10 120554,032 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0635 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063604.354533Z [info ] 2023-10-10 120604,354 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063604.407245Z [info ] 2023-10-10 120604,407 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0636 cmd_type=elb consumer=True name=target-s3 producer=False stdio=s…

user

10/11/2023, 2:05 PM

it is incremental, so I removed state.json and cleaned up the s3 parquet folder too

@ashutosh_shanker when you say you removed the state.json, what exactly did you do? The state is managed by meltano so you'd have to remove it from your state backend which defaults to the local sqlite db. Alternatively you can run with the

--full-refresh

flag like

meltano run tap-postgres column-remover target-s3 --full-refresh

to clear it automatically. Other than that I'd probably recommend opening an issue in the target repository to see if the maintainer has any recommendations.

ashutosh_shanker

10/11/2023, 4:05 PM

I am storing state in s3 l, so I removed it from s3.. let me check full refresh option and SQLite stuff

Open in Slack

Previous Next