Hi everyone, just starting out with meltano and it...
# getting-started
b
Hi everyone, just starting out with meltano and it seems to be the tool that will make us abandon our own ‘import data scripts’. But there is one thing that I cannot figure out. We recieve a weekly csv-file with snapshots of a database. We’ve been getting this csv for over a decade, so we have quite a large historical database. I’ve already setup a tap-csv and target-postgres that loads incremental (hooray!). But… what I would also like to do is convert every csv-file to a corresponding parquet-file and store it on S3 (minio, actually). I’m able to get a single parquet on S3 with all data, but I would like to split them in different files. Is this scenario possible with meltano?
a
How many csvs do you have that you want separate parquets for?
Can you share the `meltano.yml`as it stands?
b
It’s about a thousand csv’s. My meltano.yml is below
version: 1
default_environment: dev
project_id: c4547xxxxxxxx
environments:
- name: dev
- name: staging
- name: prod
plugins:
extractors:
- name: tap-csv
variant: meltanolabs
pip_url: git+<https://github.com/MeltanoLabs/tap-csv.git>
config:
add_metadata_columns: true
csv_files_definition: csv_files_definitions.json
metadata:
studielink:
replication-method: INCREMENTAL
replication-key: id
id:
is-replication-key: true
mappers:
- name: meltano-map-transformer
variant: meltano
pip_url: git+<https://github.com/MeltanoLabs/meltano-map-transform.git>
executable: meltano-map-transform
mappings:
- name: add_telbestanden_metadata
config:
stream_maps:
telbestanden:
source_filename: _sdc_source_file.split('/')[-1].split(".")[0]
collegejaar: _sdc_source_file.split('/')[-1].split(".")[0][14:18]
volgnummer: _sdc_source_file.split('/')[-1].split(".")[0][20:22]
peildatum: _sdc_source_file.split('/')[-1].split(".")[0][23:]
# an id is created for every row in a telbestand. It is a combination of peildatum, collegejaar and rownumber.
id: "int(_sdc_source_file.split('/')[-1].split('.')[0][23:] + _sdc_source_file.split('/')[-1].split('.')[0][14:18] + str(_sdc_source_lineno).zfill(6))"
__key_properties__: ["id"]
loaders:
- name: target-s3
variant: crowemi
pip_url: git+<https://github.com/crowemi/target-s3.git>
config:
cloud_provider.aws.aws_access_key_id: xxxxxxxxxxxxx
cloud_provider.aws.aws_bucket: xxx
cloud_provider.aws.aws_region: eu-west-1
cloud_provider.aws.aws_endpoint_override: xxxxxxxxxxxxxxxxx
cloud_provider.aws.aws_secret_access_key: xxxxxxxxxxxxxxxxx
format.format_type: parquet
prefix: meltano
append_date_to_filename: false
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
- name: target-postgres
variant: meltanolabs
pip_url: git+<https://github.com/MeltanoLabs/target-postgres.git>
config:
add_record_metadata: true
database: meltano
default_target_schema: raw
user: docker
host: localhost
password: docker
a
And all the files listed in your
definitions.json
are ending up a single parquet? I don't work with these taps, but normally I would expect each individual stream (csv file name) to have it's own parquet target
b
I don’t reference every csv individually, I just point to a directory containing all csvs. So in mij definitions.json there is only:
“path”: “/Users/bramenning/GitHub/open-source-data-stack/data”,
a
Ah that explains it. So all files under that path are being considered as one 'stream', and therefore being written to one parquet.
If you were to specify the individual csvs in defs.json, then you should see each having their own stream, and output parquet file. But by the sounds of it, this is non-trivial for your case 🙂
b
Exactly 🙂, because that would mean a change to my meltano code every week when a new csv arrives.
a
Do the csvs follow a naming pattern? Are they all named uniquely?
b
yes they are
a
You can use something like
tap-spreadsheets-anywhere
to pattern match files, and group similar named files together into on a stream.
But you have to know your 'groups' in advance if that makes sense.
So if you have
db1-2023-07-25.csv
and
db1-2023-07-24.csv
files, these can be matched into a
db1
stream, but if a
db2-
file arrives, nothing will happen with it.
b
Thanks, that sounds doable! I’ll give it a try!
a
I think this sort of behaviour is supported with your existing tap, if your files are partitioned into separate subdirectories. But if they are all in a single folder, you'll need some kind of regex match to split them into your groups, which
tap-spreadsheets-anywhere
supports.
b
I can’t seem to get
tap-spreadsheets-anywhere
to get working. When I run meltano it says:
Copy code
2023-07-26T13:33:24.748905Z [info     ] Environment 'dev' is active
2023-07-26T13:33:26.270667Z [warning  ] No state was found, complete import.
2023-07-26T13:33:27.842568Z [info     ] INFO Using supplied catalog /Users/my_name/GitHub/open-source-data-stack/meltano/ed2c/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-07-26T13:33:27.843119Z [info     ] INFO Processing 0 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-07-26T13:33:28.071768Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True
This is in my meltano.yml:
Copy code
- name: tap-spreadsheets-anywhere
      variant: ets
      pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
      namespace: tap_spreadsheets_anywhere
      executable: tap-spreadsheets-anywhere
      capabilities:
        - catalog
        - discover
        - state
      config:
        tables:
          - path: "file:///Users/myname/GitHub/open-source-data-stack/data"
            name: "telbestanden"
            pattern: "*.csv"
            start_date: "2009-12-10T13:49:51.141Z"
            key_properties: []
            format: "csv"
            delimiter: ";"
a
what meltano command are you using? There is an optional
--dev
switch to add more info
message has been deleted
b
Copy code
2023-07-27T06:17:10.736204Z [info     ] Environment 'dev' is active
INFO Using supplied catalog /Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json.
INFO Processing 0 selected streams from Catalog
it does not get more verbose running with
--dev
a
What command are you running? I found
--dev
only works in some contexts
You could also try clearning the catalog file and see if that helps
b
Hi, I’m running
meltano invoke tap-spreadsheets-anywhere --dev
and after removing the catalog-files at
.meltano/extractors/tap-spreadsheets-anywhere
the output became a little more verbose.
Copy code
2023-07-28T07:12:23.553324Z [info     ] Environment 'dev' is active
INFO Generating catalog through sampling.
INFO Walking /Users/bramenning/GitHub/open-source-data-stack/data.
INFO Found 7 files.
ERROR Unable to write Catalog entry for 'telbestanden' - it will be skipped due to error nothing to repeat at position 0
INFO Processing 0 selected streams from Catalog
Can it be that the tap doens’t find any csv’s that match the pattern, so there is nothing to write to the Catalog?
a
No, that would be a different message. Looks like @visch may have highlighted this: https://github.com/ets/tap-spreadsheets-anywhere/pull/49/commits/fa5b8b6a0cc77f1d153afb58895392212daf28f2 Try
"<file://c>:/Users/myname/GitHub/open-source-data-stack/data"
for your path? It looks like the files are being found correctly, but then the catalog entry fails before those files can be returned to be processed.
b
I’m on a Mac so the
c:/
part will not work, I think
file:///
is correct for my setup.
a
My mistake, assumed windows.
It looks like the 7 files in your dir are being collected correctly, but the error failure is happening between the full
os.walk
stage the regex filtering stage. Can you try removing the
pattern:
and see if that works at least?
b
the
pattern
keyword is mandatory (voloptuous seems to need it):
Copy code
CRITICAL expected str for dictionary value @ data['tables'][0]['pattern']
Traceback (most recent call last):
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
    sys.exit(main())
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/singer/utils.py", line 235, in wrapped
    return fnc(*args, **kwargs)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/tap_spreadsheets_anywhere/__init__.py", line 147, in main
    tables_config = Config.validate(tables_config)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/tap_spreadsheets_anywhere/configuration.py", line 50, in validate
    CONFIG_CONTRACT(config_json)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 272, in __call__
    return self._compiled([], data)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 595, in validate_dict
    return base_validate(path, iteritems(data), out)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 433, in validate_mapping
    raise er.MultipleInvalid(errors)
voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['tables'][0]['pattern']
Bingo! It seems that
*.csv
is not a valid regex. I tried
.csv$
in this worked!
a
Nice. I might raise an issue for that on the tap repo as that's pretty counter-intuitive to me.
b
Done, thanks for your patience and help!