Tony Yun
09/13/2024, 7:41 PMstate.json
to incremental load tap-github. The issues I’m encounterings are:
1. The state file is stored in a timestamped folder so every time when the tap runs, there’s no state found. e.g. 2024-09-13T175856--tap-github--target-jsonl/state.json
2. Even if I provide --state state.json
to override the default setting, I can see that all history data are being fetched, instead of the net-new pull-requests
that I’m expecting.
Maybe I’m understanding it wrong. Please let me know.
Thanks!visch
09/13/2024, 7:43 PMTony Yun
09/13/2024, 7:48 PMversion: 1
default_environment: dev
project_id: 8d46d142-ee51-42fd-90fd-0f1b3a8bba5d
environments:
- name: dev
- name: staging
- name: prod
state_backend:
uri: file:///${MELTANO_SYS_DIR_ROOT}/state/
plugins:
extractors:
- name: tap-github
variant: meltanolabs
pip_url: git+<https://github.com/MeltanoLabs/tap-github.git>
config:
# organizations:
# - 'ThirtyMadison'
repositories:
- ThirtyMadison/data-eng-admin
select:
- pull_requests.*
command I ran:
1. meltano el tap-github target-jsonl
2. meltano el tap-github target-jsonl --state state/03/state.json
visch
09/13/2024, 7:49 PMmeltano run
2. Why are you passing in a state file? Meltano handles state for y ouTony Yun
09/13/2024, 7:51 PMmeltano el
. Let me try the other.
2. the reason is what said in 1) , the state file is stored with a timestamp in itvisch
09/13/2024, 7:52 PMvisch
09/13/2024, 7:52 PMTony Yun
09/13/2024, 7:52 PMvisch
09/13/2024, 7:52 PMTony Yun
09/13/2024, 7:53 PMmeltano run
does change the folder name to something different now.visch
09/13/2024, 7:53 PMls
or picture?Tony Yun
09/13/2024, 7:54 PMvisch
09/13/2024, 7:54 PM.meltano
directory so it seems?visch
09/13/2024, 7:55 PMTony Yun
09/13/2024, 7:55 PMvisch
09/13/2024, 7:55 PMtree
your project root pleaseTony Yun
09/13/2024, 7:56 PMTony Yun
09/13/2024, 7:57 PMvisch
09/13/2024, 7:59 PMls -l
Tony Yun
09/13/2024, 7:59 PM2024-09-13T19:57:52.565226Z [info ] 2024-09-13 15:57:52,564 | INFO | tap-github | Beginning incremental sync of 'pull_requests' with context: {'org': 'ThirtyMadison', 'repo': 'data-eng-admin', 'repo_id': 396928701}
2024-09-13T19:57:54.325915Z [info ] 2024-09-13 15:57:54,324 | INFO | singer_sdk.metrics | METRIC: {"type": "counter", "metric": "record_count", "value": 100, "tags": {"stream": "pull_requests", "context": {"org": "ThirtyMadison", "repo": "data-eng-admin", "repo_id": 396928701}}} cmd_type=elb consumer=False job_name=dev:tap-github-to-target-jsonl name=tap-github producer=True run_id=7cc53685-c32e-49b5-838c-4af92a6faafc stdio=stderr string_id=tap-github
visch
09/13/2024, 8:00 PMstate
is in your main root directory is the only thing I"m left withvisch
09/13/2024, 8:00 PMTony Yun
09/13/2024, 8:01 PMstate_backend:
uri: file:///${MELTANO_SYS_DIR_ROOT}/state/
visch
09/13/2024, 8:01 PM.meltano
Tony Yun
09/13/2024, 8:02 PMTony Yun
09/13/2024, 8:02 PMvisch
09/13/2024, 8:02 PMvisch
09/13/2024, 8:03 PMTony Yun
09/13/2024, 8:03 PMrun
and el
visch
09/13/2024, 8:03 PMrun
just handles the state-id for youvisch
09/13/2024, 8:03 PMTony Yun
09/13/2024, 8:04 PMvisch
09/13/2024, 8:05 PMTony Yun
09/13/2024, 8:05 PMEdgar Ramírez (Arch.dev)
09/13/2024, 11:14 PMmeltano el
that were missing an explanation of how to enable incremental replication (ie --state-id
). All fine if you found it, for example, by just exploring the CLI.
> The 2) issue is- even the second run, it still fetches data that I don’t expect to be net-new.
>
That may be because of the At-Least-Once nature of the replication that Meltano extractors use. Happy to answer other questions.