Having trouble finding currently maintained taps for S3 CSVs Meltano #plugins-general

Having trouble finding currently maintained taps f...

Steven Searcy

05/06/2025, 3:21 PM

Having trouble finding currently maintained taps for S3 CSVs. Anyone have any recommendations for this?

tap-s3-csv

seems to have been sunset. I am trying to setup a pipeline using a CSV, and I am also unsure if any of these taps supports incremental replication.

Edgar Ramírez (Arch.dev)

05/06/2025, 3:34 PM

tap-s3-csv
seems to have been sunset.

Do you mean https://github.com/s7clarke10/pipelinewise-tap-s3-csv? Maybe https://hub.meltano.com/extractors/tap-spreadsheets-anywhere?

Steven Searcy

05/06/2025, 3:39 PM

Yeah the first link is a fork from the sunset pipelinewise tap, and wondering if I should still use it. Will give the second option a go.

👍 1

Steven Searcy

05/06/2025, 9:19 PM

@Edgar Ramírez (Arch.dev) Any ideas on incremental loading using

tap-csv

? As I understand, that's not exactly offered with the tap. Curious if you know any effective workarounds.

Matt Menzenski

05/06/2025, 9:42 PM

Maybe https://hub.meltano.com/extractors/tap-spreadsheets-anywhere?

We’ve been running tap-spreadsheets-anywhere in production for a couple of years now. I think we have ~70ish taps that use it across all our environments. It worked fine for incremental loading until we started running meltano within Dagster. Then we started having weirdness with meltano state getting lost when a Dagster run is canceled. I still haven’t gotten to the bottom of that - but since our migration to Dagster this tap has effectively run in “full table” mode even though it’s configured to be incremental. Not ideal. (But IMO I think it’s more to do with the dagster-shell / dagster-meltano libraries than the tap itself. I’d still recommend the tap for a non-Dagster context)

ty 1

Steven Searcy

05/06/2025, 9:53 PM

Maybe I am missing something, but I don't see that

tap-spreadsheets-anywhere

supports incremental loading.

Steven Searcy

05/06/2025, 9:55 PM

Specifically referring to metadata. Apologies, I am a few days old to Meltano.

Matt Menzenski

05/06/2025, 9:55 PM

lol, wouldn’t that be funny if _that_’s why I’m having these issues 😂 It wouldn’t surprise me if you were right. This tap is not built on the Meltano SDK which makes the internals hard for me to follow

Matt Menzenski

05/06/2025, 9:56 PM

I have contributed to this tap, I should know if it supports this

Steven Searcy

05/06/2025, 9:58 PM

Yeah, I haven't poked around in the repository, but it's definitely left out of the documentation. I just really don't want to rebuild the table every single time, and would prefer incremental replication.

Matt Menzenski

05/06/2025, 9:59 PM

ok I’m not crazy, lol. I don’t see any support for “incremental” replication mode per se, but the readme does say:

This tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.

While state is maintained, only new files will be processed from subsequent runs.

So it’s “supposed” to process files incrementally, and not reprocess already-processed files

Matt Menzenski

05/06/2025, 10:00 PM

and that behavior worked for us, before we moved to Dagster

Steven Searcy

05/06/2025, 10:00 PM

Ahh I see. So it's already baked in.

Matt Menzenski

05/06/2025, 10:01 PM

you might also want to look at https://github.com/MeltanoLabs/tap-universal-file , which as I understand it is basically doing the same thing as tap-spreadsheets-anywhere (supporting arbitrary files) but built with the Meltano SDK

Matt Menzenski

05/06/2025, 10:01 PM

I’ve been wanting to try that but haven’t yet (it doesn’t support short-lived AWS credentials, which is a requirement for us)

Steven Searcy

05/06/2025, 10:02 PM

Well, actually, that's file specific. I am more or less thinking of row-specific deduplication.

Matt Menzenski

05/06/2025, 10:03 PM

that’s something you should be able to accomplish by setting key_properties (depending on the target)

Steven Searcy

05/06/2025, 10:04 PM

Sorry not deduplication, but upserts based on updated_at, etc.

Matt Menzenski

05/06/2025, 10:05 PM

Copy code

key_properties:
                    - _smart_source_file
                    - _smart_source_lineno

we set these ones in the tap-spreadsheets-anywhere tables config, with the intent that a unique record is identified by its source file and line number

Matt Menzenski

05/06/2025, 10:06 PM

I think because it’s not based on the SDK, tap-spreadsheets-anywhere doesn’t produce the same metadata that some taps do. But you can generally set

add_record_metadata: true

in your target to get standard record metadata fields

Steven Searcy

05/06/2025, 10:10 PM

Awesome, yeah that definitely helps. I think you're right, as I was not able to use stream maps out of the box, had to use a mapper for that.

Steven Searcy

05/06/2025, 10:10 PM

Thanks again for the guidance!

👍 1

Edgar Ramírez (Arch.dev)

05/06/2025, 10:12 PM

I've also been working on https://github.com/meltanolabs/tap-csv-folder (not on the Hub yet) which is an iteration I think is slightly more approachable than tap-universal-file.

👀 3

Andy Carter

05/07/2025, 7:21 AM

Add me to the list of fairly happy

tap-spreadsheets-anywhere

users, it's a real swiss army knife. I also use meltano with dagster, and have found the state management / 'don't reprocess unchanged files' to work fine. Defining your primary keys is really essential, and then the challenge is stopping users chucking in any old crap into a sheet :)

🤔 1

👍 3

Steven Searcy

05/13/2025, 3:06 AM

@Matt Menzenski So, this works great locally, but I am having issues within production every time I pull in a new Docker image, the state is essentially reset. I am wondering if I need to set the

.meltano

directory as a bind mount.

This tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.

While state is maintained, only new files will be processed from subsequent runs.

Andy Carter

05/13/2025, 7:25 AM

@Steven Searcy where are you persisting state to? An external db or s3 like filestore? It sounds like you are using the default sqlite db perhaps. https://docs.meltano.com/concepts/state_backends

➕ 1

Steven Searcy

05/13/2025, 2:51 PM

@Andy Carter That's the odd part, as I am using S3 for state backend. and it appears to be writing to the file in the bucket, but it's effectively empty. Pipeline Logs

Copy code

2025-05-13T02:40:29.552698Z [info     ] Using S3StateStoreManager add-on state backend
2025-05-13T02:40:29.725469Z [info     ] smart_open.s3.MultipartWriter('meltano-state-develop', 'state/dev:tap-spreadsheets-anywhere-to-target-postgres--ndo/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)
2025-05-13T02:40:29.781329Z [info     ] Writing state to AWS S3       
2025-05-13T02:40:29.798834Z [info     ] smart_open.s3.MultipartWriter('meltano-state-develop', 'state/dev:tap-spreadsheets-anywhere-to-target-postgres--ndo/state.json'): uploading part_num: 1, 129 bytes (total 0.000GB)
2025-05-13T02:40:29.884007Z [info     ] Incremental state has been updated at 2025-05-13 02:40:29.883969+00:00.
2025-05-13T02:40:29.884702Z [info     ] 2025-05-13 02:40:29,552 | INFO     | singer_sdk.metrics   | METRIC: {"type": "counter", "metric": "record_count", "value": 82010, "tags": {"stream": "raw_foods", "pid": 16}} cmd_type=elb consumer=True job_name=dev:tap-spreadsheets-anywhere-to-target-postgres--ndo name=target-postgres--ndo producer=False run_id=2a8aa8be-4f12-470f-8db6-4cf9de33ccbf stdio=stderr string_id=target-postgres--ndo

state.json

Copy code

{
  "completed": {
    "singer_state": {}
  },
  "partial": {}
}

Steven Searcy

05/13/2025, 3:25 PM

I think I've figured out the issue. On the first run without state, the state.json file is effectively written to:

Copy code

{
  "completed": {
    "singer_state": {
      "raw_data": {
        "modified_since": "2025-05-12T22:10:29.362435+00:00"
      }
    }
  },
  "partial": {}
}

However, if I run the pipeline again, using the state ^, it does what it's supposed to, which is skips extract and load. However, it removes the state from the state.json file:

Copy code

{
  "completed": {
    "singer_state": {}
  },
  "partial": {}
}

No idea why it is doing this though.

Steven Searcy

05/13/2025, 3:42 PM

So, this appears to be the default behavior, but why? 🤔 One would think it should re-emit the last known state even if it didn’t find new data — but it doesn't.

Edgar Ramírez (Arch.dev)

05/13/2025, 4:15 PM

One would think it should re-emit the last known state even if it didn’t find new data — but it doesn't.

Hey @Steven Searcy, I think you're using MeltanoLabs/target-postgres, so this might be solved by addressing this in our Singer SDK: https://github.com/meltano/sdk/issues/3000. (cc @Reuben (Matatika) does that seem like the same problem you were running into?)

Steven Searcy

05/13/2025, 4:17 PM

Yep, that's the one. Great, I will note that I am experiencing issues with this.

Steven Searcy

05/13/2025, 4:17 PM

Thanks Edgar!

Reuben (Matatika)

05/13/2025, 4:23 PM

Yes - it was with

tap-spreadsheets-anywhere

also. From our internal issue tracking:

Problem is with new
meltanolabs
target-snowflake
variant.

It ALWAYS expects a state message, and SDK based taps will always output a state message even if there was no records synced.

tap-spreadsheets-anywhere
(and our connection variants) are not built on the SDK, and it does not output state if there are no records.

In combination, the tap sent no state, the target then set the no state (
{}
) it “finds”, overwriting all existing state.

ty 1

➕ 1

Steven Searcy

05/13/2025, 8:41 PM

@Edgar Ramírez (Arch.dev) Thanks for the quick fix!

Edgar Ramírez (Arch.dev)

05/13/2025, 11:37 PM

np! thanks everyone for the pointers 🙂

Matt Menzenski

05/13/2025, 11:41 PM

Whoa!! This is awesome news

6 Views

Open in Slack

Previous Next