I am testing out the tap-spreadsheets-anywhere tap...
# plugins-general
w
I am testing out the tap-spreadsheets-anywhere tap on a S3 bucket. It looks like I can pass a catalog json file for specification (which includes a s3 bucket and paths for the files I want to ingest). My bucket contains data under a directory structure {year}/{month}/{day}/{hour}. Ideally, I want to pass a dynamic variable to the search path on an hourly execution of the pipeline so that only the files listed under the matching criteria are selected. Is it possible to override search_path as a command line arg when calling
meltano elt
? Then, I could avoid re-writing a catalog json file every hour and instead pass in the relevant execution time as a variable!
d
@wyatt Ideally you could use env var substitution here, so that you could say
search_path: $SEARCH_PATH
in your
meltano.yml
config
for
tap-spreadsheets-anywhere
, and then set
SEARCH_PATH
when calling
meltano elt
as in
SEARCH_PATH=foo meltano elt ...
, but env var substitution doesn't currently work in arrays, and I think
tap-spreadsheets-anywhere
uses an array to define the tables: https://gitlab.com/meltano/meltano/-/issues/2481
That issue hasn't become a priority yet, but it's come up a few times recently, so I've assigned to @aaronsteers to possibly look into next week, or to at least have it on our radar again
@wyatt How are you currently setting the search path through the catalog file? There may be a method to make it dynamic using https://meltano.com/docs/plugins.html#metadata-extra
w
Thanks for the quick response @douwe_maan! My hope was to not use a catalog json file, but set the necessary 'tables' key through environment variables or something similar. I am using Airflow on k8s (k8s executor) and spinning up a Meltano image via k8s pod operator. I would prefer not to mount a volume in order to keep this lean - hence trying to set the catalog either as env vars or when the pipeline command is invoked.
d
What change are you making in the catalog file? It may be possible to automate that through an env var as well
w
I called
meltano config tap-spreadsheets-anywhere list
to check if env vars can be set. It looks like this particular tap does not have any possible configs?
d
That should at least list
tables
, does it not list anything?
w
it returns no results!
meltano select --list --all tap-spreadsheets-anywhere
returns
Exception: Config is missing required keys: ['tables']
d
What does
meltano config tap-spreadsheets-anywhere
say?
w
that returns
{}
d
Can you share your
meltano.yml
definition for the tap?
w
Copy code
- name: tap-spreadsheets-anywhere
    namespace: tap_spreadsheets_anywhere
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    executable: tap-spreadsheets-anywhere
    capabilities:
    - catalog
    - discover
    - state
so no tables var is set, yet
d
Ah, looks like you added it as a custom plugin instead of using the version supported out of the box that has metadata on
tables
. I notice now that https://hub.meltano.com/taps/spreadsheets-anywhere doesn't refer to https://hub.meltano.com/extractors/spreadsheets-anywhere as it should
Can you remove the
namespace
,
executable
, and
capabilities
keys?
Then
meltano config tap-spreadsheets-anywhere list
should at least show the `tables`` setting
w
oh!! I just removed it and added it - heart is beating a little faster because now I see tables as a setting
progress is being made
d
Yay!
So now you can at least override the entire value for
tables
using the env var you should see listed
w
let me now try to set this variable dynamically on my k8s operator
d
Like I mentioned, overriding just the
search_path
is not possible yet because of https://gitlab.com/meltano/meltano/-/issues/2481
w
yes, I will override the whole thing
new error - which is a great thing! going to fix that now...
d
😄
w
okay so I fixed the one error which was caused because I literally passed in "tables" as a key when it expected just the tables array value.
now it displays
Processing 0 selected streams from Catalog
d
That's progress 🙂
Does
meltano select --list --all tap-spreadsheets-anywhere
show anything?
w
Copy code
Legend:
        selected
        excluded
        automatic

Enabled patterns:
        *.*

Selected attributes:
d
OK, if it's not listing any possible streams, the
tables
config is likely incorrect and it's not able to find any files/streams
Can you run
meltano invoke tap-spreadsheets-anywhere --discover
and see if you see any relevant logs?
w
Copy code
...
ERROR Unable to write Catalog entry for {tablename} - it will be skipped due to error {file} could not be parsed: Extra data: line 1 column 1716 (char 1715)
looks like it found the 5 files in s3 correctly though! so definitely getting somewhere
I have to run for now, but will continue down this path tomorrow
thanks for unblocking me! I appreciate it greatly as I had struggled with the original issue for a couple of hours
d
Extra data: line 1 column 1716 (char 1715)
looks like a JSON parsing error
I'm curious what's at char 1715 🙂
hah yes same thought here - json parsing error
{
is the char and it looks like the files are non-valid json
d
😬
w
hah! well now I need to dig into what is writing these upstream...
funny enough, Amazon is writing these logs as part of https://aws.amazon.com/connect/
smells like a bug to me!
okay so I believe the only way to resolve this issue is parse the log file into proper json format, which would require
.replace('}{"', '}\n{"')
this transform would turn the multiple json objects into separate lines
d
The tap may still not expect multiple JSON objects in one file, even when separated by line
w
gotcha - so sounds like to get this to work as off the shelf EL I would have to transform the files before
TEL if you will
d
Yeah, looks like it
w
@douwe_maan thanks again for all the help! Multiple JSON objects in one file are supported, they just need to be wrapped as an array like [{}, {}, {}]