Hello, I'm using <https://github.com/MeltanoLabs/t...
# troubleshooting
c
Hello, I'm using https://github.com/MeltanoLabs/tap-universal-file tap to get the data from S3. Is there any way to override schema definition in tap? Currently It auto detect based on first record from the file.
a
What do you want to achieve with overriding? Change datatype of a column? Rename a column?
c
I want to hardcode schema/catalog in tap. I know It's possible in tap config but Is there any way to reference catalog file in tap config?
v
What type of file are you pulling? For Delimited Files (CSVs etc) there's https://github.com/MeltanoLabs/tap-universal-file/tree/main#:~:text=delimited_override_headers The
select
itself works to allow you to select / deselect fields in the tap. Which gets to @Andy Carter’s point, what's your purpose of overriding the schema? Maybe it's a feature that is / isn't supported by the tap maybe it is but it all depends on the purpose of what you're doing You can definitely override the schema here https://docs.meltano.com/concepts/plugins#schema-extra
c
Yep, Thanks. We can also provide a schema json path here in tap, https://github.com/meltano/sdk/blob/main/singer_sdk/streams/core.py#L106
@visch Given tap create schema based on the first record from the first file on S3. But some of the files have different schema. So I needed to provide aggregated schema manually.
v
Nice so
delimited_override_headers
worked for you here?
c
I''m fetching jsonl files not the csv
v
Did you inherit from the universal file to setup multiple taps for each of the different files? Mostly curious now about your use case as we're trying to understand how to design this tap still #C05CNUF699B šŸ˜„ , we went with a single definition for files instead of offering a list of files and I'm still not 100% sold on it
Ok now we're getting somewhere!
jsonl
file so we're talking about the
jsonl_sampling_strategy
configuration here.
c
Tap https://github.com/MeltanoLabs/tap-universal-file from MeltanoLabs already supports multiple file types
v
I know we made it šŸ˜…
It wasn't clear to us how to do
jsonl_sampling_strategy
in a way that made a lot of sense. @chintan_patel you're pulling in multiple jsonl files with the same schema right? So you want to provide a schema that these files adhere to?
c
No slightly different schema
So I wanted to cover all the columns in schema.
v
Got it so
jsonl_sampling_strategy
could support
all
which would sample the whole file using something like https://github.com/wolverdude/GenSON and auto fill the json Then for
jsonl_type_coercion_strategy
would fit into this as well. Hmmm The hack we have right now is to use
jsonl_type_coercion_strategy
and then use
envelope
or
any
depending on what you're after. envelope will put everything in a record object you can expand in your target. I don't love it. Sounds like for you, it'd be much better if we sampled everything with genson and provided a schema. Or if you could provide your own schema mappings for the file
The "right" way here just to be sure I say it (Would take dev work) is to add support for
jsonl_sampling_strategy
of
all
, Also add a config of
jsonl_type_coercion_strategy
of
detect
(I think or something similar) Not clear how we'd want to provide the manual override let me think.
c
I already made the change in my forked one to provide schema manually. I can open a PR If you want.
v
@chintan_patel yes please! That would be great šŸ˜„