Steven Searcy
05/06/2025, 3:21 PMtap-s3-csv
seems to have been sunset.
I am trying to setup a pipeline using a CSV, and I am also unsure if any of these taps supports incremental replication.Edgar Ramírez (Arch.dev)
05/06/2025, 3:34 PMDo you mean https://github.com/s7clarke10/pipelinewise-tap-s3-csv? Maybe https://hub.meltano.com/extractors/tap-spreadsheets-anywhere?seems to have been sunset.tap-s3-csv
Steven Searcy
05/06/2025, 3:39 PMSteven Searcy
05/06/2025, 9:19 PMtap-csv
? As I understand, that's not exactly offered with the tap. Curious if you know any effective workarounds.Matt Menzenski
05/06/2025, 9:42 PMMaybe https://hub.meltano.com/extractors/tap-spreadsheets-anywhere?We’ve been running tap-spreadsheets-anywhere in production for a couple of years now. I think we have ~70ish taps that use it across all our environments. It worked fine for incremental loading until we started running meltano within Dagster. Then we started having weirdness with meltano state getting lost when a Dagster run is canceled. I still haven’t gotten to the bottom of that - but since our migration to Dagster this tap has effectively run in “full table” mode even though it’s configured to be incremental. Not ideal. (But IMO I think it’s more to do with the dagster-shell / dagster-meltano libraries than the tap itself. I’d still recommend the tap for a non-Dagster context)
Steven Searcy
05/06/2025, 9:53 PMtap-spreadsheets-anywhere
supports incremental loading.Steven Searcy
05/06/2025, 9:55 PMMatt Menzenski
05/06/2025, 9:55 PMMatt Menzenski
05/06/2025, 9:56 PMSteven Searcy
05/06/2025, 9:58 PMMatt Menzenski
05/06/2025, 9:59 PMThis tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.
While state is maintained, only new files will be processed from subsequent runs.So it’s “supposed” to process files incrementally, and not reprocess already-processed files
Matt Menzenski
05/06/2025, 10:00 PMSteven Searcy
05/06/2025, 10:00 PMMatt Menzenski
05/06/2025, 10:01 PMMatt Menzenski
05/06/2025, 10:01 PMSteven Searcy
05/06/2025, 10:02 PMMatt Menzenski
05/06/2025, 10:03 PMSteven Searcy
05/06/2025, 10:04 PMMatt Menzenski
05/06/2025, 10:05 PMkey_properties:
- _smart_source_file
- _smart_source_lineno
we set these ones in the tap-spreadsheets-anywhere tables config, with the intent that a unique record is identified by its source file and line numberMatt Menzenski
05/06/2025, 10:06 PMadd_record_metadata: true
in your target to get standard record metadata fieldsSteven Searcy
05/06/2025, 10:10 PMSteven Searcy
05/06/2025, 10:10 PMEdgar Ramírez (Arch.dev)
05/06/2025, 10:12 PMAndy Carter
05/07/2025, 7:21 AMtap-spreadsheets-anywhere
users, it's a real swiss army knife. I also use meltano with dagster, and have found the state management / 'don't reprocess unchanged files' to work fine. Defining your primary keys is really essential, and then the challenge is stopping users chucking in any old crap into a sheet :)Steven Searcy
05/13/2025, 3:06 AM.meltano
directory as a bind mount.
This tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.
While state is maintained, only new files will be processed from subsequent runs.
Andy Carter
05/13/2025, 7:25 AMSteven Searcy
05/13/2025, 2:51 PM2025-05-13T02:40:29.552698Z [info ] Using S3StateStoreManager add-on state backend
2025-05-13T02:40:29.725469Z [info ] smart_open.s3.MultipartWriter('meltano-state-develop', 'state/dev:tap-spreadsheets-anywhere-to-target-postgres--ndo/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)
2025-05-13T02:40:29.781329Z [info ] Writing state to AWS S3
2025-05-13T02:40:29.798834Z [info ] smart_open.s3.MultipartWriter('meltano-state-develop', 'state/dev:tap-spreadsheets-anywhere-to-target-postgres--ndo/state.json'): uploading part_num: 1, 129 bytes (total 0.000GB)
2025-05-13T02:40:29.884007Z [info ] Incremental state has been updated at 2025-05-13 02:40:29.883969+00:00.
2025-05-13T02:40:29.884702Z [info ] 2025-05-13 02:40:29,552 | INFO | singer_sdk.metrics | METRIC: {"type": "counter", "metric": "record_count", "value": 82010, "tags": {"stream": "raw_foods", "pid": 16}} cmd_type=elb consumer=True job_name=dev:tap-spreadsheets-anywhere-to-target-postgres--ndo name=target-postgres--ndo producer=False run_id=2a8aa8be-4f12-470f-8db6-4cf9de33ccbf stdio=stderr string_id=target-postgres--ndo
state.json
{
"completed": {
"singer_state": {}
},
"partial": {}
}
Steven Searcy
05/13/2025, 3:25 PM{
"completed": {
"singer_state": {
"raw_data": {
"modified_since": "2025-05-12T22:10:29.362435+00:00"
}
}
},
"partial": {}
}
However, if I run the pipeline again, using the state ^, it does what it's supposed to, which is skips extract and load. However, it removes the state from the state.json file:
{
"completed": {
"singer_state": {}
},
"partial": {}
}
No idea why it is doing this though.Steven Searcy
05/13/2025, 3:42 PMEdgar Ramírez (Arch.dev)
05/13/2025, 4:15 PMOne would think it should re-emit the last known state even if it didn’t find new data — but it doesn't.Hey @Steven Searcy, I think you're using MeltanoLabs/target-postgres, so this might be solved by addressing this in our Singer SDK: https://github.com/meltano/sdk/issues/3000. (cc @Reuben (Matatika) does that seem like the same problem you were running into?)
Steven Searcy
05/13/2025, 4:17 PMSteven Searcy
05/13/2025, 4:17 PMReuben (Matatika)
05/13/2025, 4:23 PMtap-spreadsheets-anywhere
also. From our internal issue tracking:
Problem is with newmeltanolabs
variant.target-snowflake
It ALWAYS expects a state message, and SDK based taps will always output a state message even if there was no records synced.
(and our connection variants) are not built on the SDK, and it does not output state if there are no records.tap-spreadsheets-anywhere
In combination, the tap sent no state, the target then set the no state () it “finds”, overwriting all existing state.{}
Steven Searcy
05/13/2025, 8:41 PMEdgar Ramírez (Arch.dev)
05/13/2025, 11:37 PMMatt Menzenski
05/13/2025, 11:41 PM