hello Again- I am loading from postgress and sink ...
# troubleshooting
a
hello Again- I am loading from postgress and sink is target-s3( crowemi) . Just loading one table with *91093 records. The pipelines run fine but when I look at the no of records in the parquet file in the S3 bucket it does not match. Two parquet files are. created in S3 and one has 10k and the other has only 100 recotrds. If I change the target to "*target-jsonl" , I can see all the records in the json file, so the extractor looks fine. Seems I have missed some setting
excerpts from the pipeline logs
Copy code
| Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:36:47.574519Z [info     ] 2023-10-10 12:06:47,574 | INFO     | target-s3            | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0636 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:36:58.701592Z [info     ] 2023-10-10 12:06:58,701 | INFO     | target-s3            | Target 'target-s3' completed reading 91093 lines of input (1 schemas, 91080 records, 0 batch manifests, 11 state messages). cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:36:58.738839Z [info     ] 2023-10-10 12:06:58,738 | INFO     | target-s3            | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0636 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:37:03.868288Z [info     ] 2023-10-10 12:07:03,867 | INFO     | target-s3            | Cleaning up public-users_user cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:37:03.868871Z [info     ] 2023-10-10 12:07:03,868 | INFO     | target-s3            | Emitting completed target state {"bookmarks": {"public-users_user": {"last_replication_method": "INCREMENTAL", "replication_key": "id", "version": 1696919723638, "replication_key_value": 114042}}, "currently_syncing": null} cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3
2023-10-10T06:37:03.900582Z [info     ] Writing state to AWS S3
2023-10-10T06:37:06.421520Z [info     ] smart_open.s3.MultipartWriter('mds-bh', 'dev/meltano/state/dev:tap-postgres-to-target-s3/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)
2023-10-10T06:37:07.579636Z [info     ] smart_open.s3.MultipartWriter('mds-bh', 'dev/meltano/state/dev:tap-postgres-to-target-s3/state.json'): uploading part_num: 1, 239 bytes (total 0.000GB)
2023-10-10T06:37:08.707181Z [info     ] Incremental state has been updated at 2023-10-10 06:37:08.706740.
2023-10-10T06:37:08.723602Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True
meltano config
Copy code
- name: target-s3
    variant: crowemi
    pip_url: git+<https://github.com/crowemi/target-s3.git>
    config:
      append_date_to_filename: true
      append_date_to_filename_grain: minute
      append_date_to_prefix: false
      cloud_provider:
        aws:
          aws_bucket: mds-bh
          aws_region: us-east-2
      include_process_date: false
      format:
        format_parquet:
          validate: false
        format_type: parquet
      prefix: local/ashu/parquet
using below loader works just fine
Copy code
name: target-s3-parquet
  variant: jkausti
u
What command are you using to run this? One thing I'm wondering is if theres incremental replication happening and thats only the new records vs all the records. If you use
run
then state is tracked based on the combination of tap-x + target-y so the second time you run it the tap will be incremental and have state (although idk if tap-postgres does incremental automatically 🤔), if you switch the target to jsonl there will be no state and all records syncing would be expected.
u
If thats not the case then it could certainly be a target issue but would be good to decipher exactly where the records are being missed
u
While throwing out random idea, could the target be accidentally overwriting files? 🤔 without looking at the code, if 2 files are written in the same minute maybe they have a shared name and are being overwritten, probably unlikely but 🤷
a
I have two S3 loaders
Copy code
- name: target-s3
    variant: crowemi
    pip_url: git+<https://github.com/crowemi/target-s3.git>
    config:
      append_date_to_filename: true
      append_date_to_filename_grain: minute
      append_date_to_prefix: false
      cloud_provider:
        aws:
          aws_bucket: mds-bh
          aws_region: us-east-2
      include_process_date: false
      format:
        format_parquet:
          validate: false
        format_type: parquet
      prefix: local/ashu/parquet
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl
    config:
      destination_path: ${MELTANO_PROJECT_ROOT}/output/
  - name: target-s3-parquet
    variant: jkausti
    pip_url: git+<https://github.com/jkausti/target-s3.git>
    config:
      aws_region: us-east-2
      filetype: parquet
      path: mds-bh/local/ashu1/parquet
below is the extractor config
Copy code
extractors:
  - name: tap-postgres
    variant: transferwise
    pip_url: pipelinewise-tap-postgres
    config:
      host: XXXX
      port: XXXX
      user: XXXX
      dbname: XXX
      filter_schemas: public
      default_replication_method: FULL_TABLE
    select:
    - public-users_user.*
    #- public-customer_customer.*
    metadata:
      public-users_user:
        replication-method: INCREMENTAL
        replication-key: id
it is incremental, so I removed state.json and cleaned up the s3 parquet folder too
target-s3 ( crowemi ) - creates multiple files. So it created 2 for the fresh run . one had 10k and the other had 1k records . files were kb in size target-s3-parquet - created 1 file with 9MB
extractor pull 90k records. I could see in the loader logs that it processes 90k records and was draining the target-sink but once the run was completed it did not load all the records
I ran the below command
Copy code
meltano run tap-postgres column-remover target-s3
logs of the loader which does not work ```2023-10-10T063524.243417Z [info ] 2023-10-10 120524,243 | INFO | target-s3 | Target 'target-s3' is listening for input from tap. cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063524.243677Z [info ] 2023-10-10 120524,243 | INFO | target-s3 | Initializing 'target-s3' target sink... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063524.243788Z [info ] 2023-10-10 120524,243 | INFO | target-s3 | Initializing target sink for stream 'public-users_user'... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063524.243907Z [info ] 2023-10-10 120524,243 | INFO | target-s3 | Setting up public-users_user cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063524.243996Z [info ] 2023-10-10 120524,243 | WARNING | target-s3 | ACTIVATE_VERSION message received but not implemented by this target. Ignoring. cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063526.312348Z [info ] 2023-10-10 120526,312 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063526.409979Z [info ] 2023-10-10 120526,409 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0635 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063530.191170Z [info ] time=2023-10-10 120530 name=singer level=INFO message=METRIC: {"type": "counter", "metric": "record_count", "value": 91080, "tags": {}} cmd_type=elb consumer=False name=tap-postgres producer=True stdio=stderr string_id=tap-postgres 2023-10-10T063538.631532Z [info ] 2023-10-10 120538,631 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063538.674520Z [info ] 2023-10-10 120538,674 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0635 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063546.790267Z [info ] 2023-10-10 120546,790 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063546.833582Z [info ] 2023-10-10 120546,833 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0635 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063554.012207Z [info ] 2023-10-10 120554,012 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063554.032601Z [info ] 2023-10-10 120554,032 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0635 cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063604.354533Z [info ] 2023-10-10 120604,354 | INFO | target-s3 | Target sink for 'public-users_user' is full. Draining... cmd_type=elb consumer=True name=target-s3 producer=False stdio=stderr string_id=target-s3 2023-10-10T063604.407245Z [info ] 2023-10-10 120604,407 | INFO | target-s3 | key: mds-bh/local/ashu/parquet/public-users_user/20231010-0636 cmd_type=elb consumer=True name=target-s3 producer=False stdio=s…
u
it is incremental, so I removed state.json and cleaned up the s3 parquet folder too
@ashutosh_shanker when you say you removed the state.json, what exactly did you do? The state is managed by meltano so you'd have to remove it from your state backend which defaults to the local sqlite db. Alternatively you can run with the
--full-refresh
flag like
meltano run tap-postgres column-remover target-s3 --full-refresh
to clear it automatically. Other than that I'd probably recommend opening an issue in the target repository to see if the maintainer has any recommendations.
a
I am storing state in s3 l, so I removed it from s3.. let me check full refresh option and SQLite stuff