Having trouble finding currently maintained taps f...
# plugins-general
s
Having trouble finding currently maintained taps for S3 CSVs. Anyone have any recommendations for this?
tap-s3-csv
seems to have been sunset. I am trying to setup a pipeline using a CSV, and I am also unsure if any of these taps supports incremental replication.
e
s
Yeah the first link is a fork from the sunset pipelinewise tap, and wondering if I should still use it. Will give the second option a go.
👍 1
@Edgar Ramírez (Arch.dev) Any ideas on incremental loading using
tap-csv
? As I understand, that's not exactly offered with the tap. Curious if you know any effective workarounds.
m
Maybe https://hub.meltano.com/extractors/tap-spreadsheets-anywhere?
We’ve been running tap-spreadsheets-anywhere in production for a couple of years now. I think we have ~70ish taps that use it across all our environments. It worked fine for incremental loading until we started running meltano within Dagster. Then we started having weirdness with meltano state getting lost when a Dagster run is canceled. I still haven’t gotten to the bottom of that - but since our migration to Dagster this tap has effectively run in “full table” mode even though it’s configured to be incremental. Not ideal. (But IMO I think it’s more to do with the dagster-shell / dagster-meltano libraries than the tap itself. I’d still recommend the tap for a non-Dagster context)
ty 1
s
Maybe I am missing something, but I don't see that
tap-spreadsheets-anywhere
supports incremental loading.
Specifically referring to metadata. Apologies, I am a few days old to Meltano.
m
lol, wouldn’t that be funny if _that_’s why I’m having these issues 😂 It wouldn’t surprise me if you were right. This tap is not built on the Meltano SDK which makes the internals hard for me to follow
I have contributed to this tap, I should know if it supports this
s
Yeah, I haven't poked around in the repository, but it's definitely left out of the documentation. I just really don't want to rebuild the table every single time, and would prefer incremental replication.
m
ok I’m not crazy, lol. I don’t see any support for “incremental” replication mode per se, but the readme does say:
This tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.
While state is maintained, only new files will be processed from subsequent runs.
So it’s “supposed” to process files incrementally, and not reprocess already-processed files
and that behavior worked for us, before we moved to Dagster
s
Ahh I see. So it's already baked in.
m
you might also want to look at https://github.com/MeltanoLabs/tap-universal-file , which as I understand it is basically doing the same thing as tap-spreadsheets-anywhere (supporting arbitrary files) but built with the Meltano SDK
I’ve been wanting to try that but haven’t yet (it doesn’t support short-lived AWS credentials, which is a requirement for us)
s
Well, actually, that's file specific. I am more or less thinking of row-specific deduplication.
m
that’s something you should be able to accomplish by setting key_properties (depending on the target)
s
Sorry not deduplication, but upserts based on updated_at, etc.
m
Copy code
key_properties:
                    - _smart_source_file
                    - _smart_source_lineno
we set these ones in the tap-spreadsheets-anywhere tables config, with the intent that a unique record is identified by its source file and line number
I think because it’s not based on the SDK, tap-spreadsheets-anywhere doesn’t produce the same metadata that some taps do. But you can generally set
add_record_metadata: true
in your target to get standard record metadata fields
s
Awesome, yeah that definitely helps. I think you're right, as I was not able to use stream maps out of the box, had to use a mapper for that.
Thanks again for the guidance!
👍 1
e
I've also been working on https://github.com/meltanolabs/tap-csv-folder (not on the Hub yet) which is an iteration I think is slightly more approachable than tap-universal-file.
👀 3
a
Add me to the list of fairly happy
tap-spreadsheets-anywhere
users, it's a real swiss army knife. I also use meltano with dagster, and have found the state management / 'don't reprocess unchanged files' to work fine. Defining your primary keys is really essential, and then the challenge is stopping users chucking in any old crap into a sheet :)
🤔 1
👍 3
s
@Matt Menzenski So, this works great locally, but I am having issues within production every time I pull in a new Docker image, the state is essentially reset. I am wondering if I need to set the
.meltano
directory as a bind mount.
This tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.
While state is maintained, only new files will be processed from subsequent runs.
a
@Steven Searcy where are you persisting state to? An external db or s3 like filestore? It sounds like you are using the default sqlite db perhaps. https://docs.meltano.com/concepts/state_backends
1
s
@Andy Carter That's the odd part, as I am using S3 for state backend. and it appears to be writing to the file in the bucket, but it's effectively empty. Pipeline Logs
Copy code
2025-05-13T02:40:29.552698Z [info     ] Using S3StateStoreManager add-on state backend
2025-05-13T02:40:29.725469Z [info     ] smart_open.s3.MultipartWriter('meltano-state-develop', 'state/dev:tap-spreadsheets-anywhere-to-target-postgres--ndo/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)
2025-05-13T02:40:29.781329Z [info     ] Writing state to AWS S3       
2025-05-13T02:40:29.798834Z [info     ] smart_open.s3.MultipartWriter('meltano-state-develop', 'state/dev:tap-spreadsheets-anywhere-to-target-postgres--ndo/state.json'): uploading part_num: 1, 129 bytes (total 0.000GB)
2025-05-13T02:40:29.884007Z [info     ] Incremental state has been updated at 2025-05-13 02:40:29.883969+00:00.
2025-05-13T02:40:29.884702Z [info     ] 2025-05-13 02:40:29,552 | INFO     | singer_sdk.metrics   | METRIC: {"type": "counter", "metric": "record_count", "value": 82010, "tags": {"stream": "raw_foods", "pid": 16}} cmd_type=elb consumer=True job_name=dev:tap-spreadsheets-anywhere-to-target-postgres--ndo name=target-postgres--ndo producer=False run_id=2a8aa8be-4f12-470f-8db6-4cf9de33ccbf stdio=stderr string_id=target-postgres--ndo
state.json
Copy code
{
  "completed": {
    "singer_state": {}
  },
  "partial": {}
}
I think I've figured out the issue. On the first run without state, the state.json file is effectively written to:
Copy code
{
  "completed": {
    "singer_state": {
      "raw_data": {
        "modified_since": "2025-05-12T22:10:29.362435+00:00"
      }
    }
  },
  "partial": {}
}
However, if I run the pipeline again, using the state ^, it does what it's supposed to, which is skips extract and load. However, it removes the state from the state.json file:
Copy code
{
  "completed": {
    "singer_state": {}
  },
  "partial": {}
}
No idea why it is doing this though.
So, this appears to be the default behavior, but why? 🤔 One would think it should re-emit the last known state even if it didn’t find new data — but it doesn't.
e
One would think it should re-emit the last known state even if it didn’t find new data — but it doesn't.
Hey @Steven Searcy, I think you're using MeltanoLabs/target-postgres, so this might be solved by addressing this in our Singer SDK: https://github.com/meltano/sdk/issues/3000. (cc @Reuben (Matatika) does that seem like the same problem you were running into?)
s
Yep, that's the one. Great, I will note that I am experiencing issues with this.
Thanks Edgar!
r
Yes - it was with
tap-spreadsheets-anywhere
also. From our internal issue tracking:
Problem is with new
meltanolabs
target-snowflake
variant.
It ALWAYS expects a state message, and SDK based taps will always output a state message even if there was no records synced.
tap-spreadsheets-anywhere
(and our connection variants) are not built on the SDK, and it does not output state if there are no records.
In combination, the tap sent no state, the target then set the no state (
{}
) it “finds”, overwriting all existing state.
ty 1
1
s
@Edgar Ramírez (Arch.dev) Thanks for the quick fix!
e
np! thanks everyone for the pointers 🙂
m
Whoa!! This is awesome news