Hi everyone just starting out with meltano and it seems to b Meltano #getting-started

Hi everyone, just starting out with meltano and it...

bram_enning

07/25/2023, 8:01 AM

Hi everyone, just starting out with meltano and it seems to be the tool that will make us abandon our own ‘import data scripts’. But there is one thing that I cannot figure out. We recieve a weekly csv-file with snapshots of a database. We’ve been getting this csv for over a decade, so we have quite a large historical database. I’ve already setup a tap-csv and target-postgres that loads incremental (hooray!). But… what I would also like to do is convert every csv-file to a corresponding parquet-file and store it on S3 (minio, actually). I’m able to get a single parquet on S3 with all data, but I would like to split them in different files. Is this scenario possible with meltano?

Andy Carter

07/25/2023, 8:03 AM

How many csvs do you have that you want separate parquets for?

Andy Carter

07/25/2023, 8:03 AM

Can you share the `meltano.yml`as it stands?

bram_enning

07/25/2023, 8:07 AM

It’s about a thousand csv’s. My meltano.yml is below

version: 1

default_environment: dev

project_id: c4547xxxxxxxx

environments:

- name: dev

- name: staging

- name: prod

plugins:

extractors:

- name: tap-csv

variant: meltanolabs

pip_url: git+<https://github.com/MeltanoLabs/tap-csv.git>

config:

add_metadata_columns: true

csv_files_definition: csv_files_definitions.json

metadata:

studielink:

replication-method: INCREMENTAL

replication-key: id

id:

is-replication-key: true

mappers:

- name: meltano-map-transformer

variant: meltano

pip_url: git+<https://github.com/MeltanoLabs/meltano-map-transform.git>

executable: meltano-map-transform

mappings:

- name: add_telbestanden_metadata

config:

stream_maps:

telbestanden:

source_filename: _sdc_source_file.split('/')[-1].split(".")[0]

collegejaar: _sdc_source_file.split('/')[-1].split(".")[0][14:18]

volgnummer: _sdc_source_file.split('/')[-1].split(".")[0][20:22]

peildatum: _sdc_source_file.split('/')[-1].split(".")[0][23:]

# an id is created for every row in a telbestand. It is a combination of peildatum, collegejaar and rownumber.

id: "int(_sdc_source_file.split('/')[-1].split('.')[0][23:] + _sdc_source_file.split('/')[-1].split('.')[0][14:18] + str(_sdc_source_lineno).zfill(6))"

__key_properties__: ["id"]

loaders:

- name: target-s3

variant: crowemi

pip_url: git+<https://github.com/crowemi/target-s3.git>

config:

cloud_provider.aws.aws_access_key_id: xxxxxxxxxxxxx

cloud_provider.aws.aws_bucket: xxx

cloud_provider.aws.aws_region: eu-west-1

cloud_provider.aws.aws_endpoint_override: xxxxxxxxxxxxxxxxx

cloud_provider.aws.aws_secret_access_key: xxxxxxxxxxxxxxxxx

format.format_type: parquet

prefix: meltano

append_date_to_filename: false

- name: target-jsonl

variant: andyh1203

pip_url: target-jsonl

- name: target-postgres

variant: meltanolabs

pip_url: git+<https://github.com/MeltanoLabs/target-postgres.git>

config:

add_record_metadata: true

database: meltano

default_target_schema: raw

user: docker

host: localhost

password: docker

Andy Carter

07/25/2023, 8:16 AM

And all the files listed in your

definitions.json

are ending up a single parquet? I don't work with these taps, but normally I would expect each individual stream (csv file name) to have it's own parquet target

bram_enning

07/25/2023, 8:18 AM

I don’t reference every csv individually, I just point to a directory containing all csvs. So in mij definitions.json there is only:

bram_enning

07/25/2023, 8:18 AM

“path”: “/Users/bramenning/GitHub/open-source-data-stack/data”,

Andy Carter

07/25/2023, 8:19 AM

Ah that explains it. So all files under that path are being considered as one 'stream', and therefore being written to one parquet.

Andy Carter

07/25/2023, 8:20 AM

If you were to specify the individual csvs in defs.json, then you should see each having their own stream, and output parquet file. But by the sounds of it, this is non-trivial for your case 🙂

bram_enning

07/25/2023, 8:21 AM

Exactly 🙂, because that would mean a change to my meltano code every week when a new csv arrives.

Andy Carter

07/25/2023, 8:22 AM

Do the csvs follow a naming pattern? Are they all named uniquely?

bram_enning

07/25/2023, 8:22 AM

yes they are

Andy Carter

07/25/2023, 8:23 AM

You can use something like

tap-spreadsheets-anywhere

to pattern match files, and group similar named files together into on a stream.

Andy Carter

07/25/2023, 8:23 AM

But you have to know your 'groups' in advance if that makes sense.

Andy Carter

07/25/2023, 8:24 AM

So if you have

db1-2023-07-25.csv

and

db1-2023-07-24.csv

files, these can be matched into a

db1

stream, but if a

db2-

file arrives, nothing will happen with it.

bram_enning

07/25/2023, 8:25 AM

Thanks, that sounds doable! I’ll give it a try!

Andy Carter

07/25/2023, 8:28 AM

I think this sort of behaviour is supported with your existing tap, if your files are partitioned into separate subdirectories. But if they are all in a single folder, you'll need some kind of regex match to split them into your groups, which

tap-spreadsheets-anywhere

supports.

bram_enning

07/26/2023, 1:34 PM

I can’t seem to get

tap-spreadsheets-anywhere

to get working. When I run meltano it says:

Copy code

2023-07-26T13:33:24.748905Z [info     ] Environment 'dev' is active
2023-07-26T13:33:26.270667Z [warning  ] No state was found, complete import.
2023-07-26T13:33:27.842568Z [info     ] INFO Using supplied catalog /Users/my_name/GitHub/open-source-data-stack/meltano/ed2c/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-07-26T13:33:27.843119Z [info     ] INFO Processing 0 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-07-26T13:33:28.071768Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True

This is in my meltano.yml:

Copy code

- name: tap-spreadsheets-anywhere
      variant: ets
      pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
      namespace: tap_spreadsheets_anywhere
      executable: tap-spreadsheets-anywhere
      capabilities:
        - catalog
        - discover
        - state
      config:
        tables:
          - path: "file:///Users/myname/GitHub/open-source-data-stack/data"
            name: "telbestanden"
            pattern: "*.csv"
            start_date: "2009-12-10T13:49:51.141Z"
            key_properties: []
            format: "csv"
            delimiter: ";"

Andy Carter

07/26/2023, 2:30 PM

what meltano command are you using? There is an optional

--dev

switch to add more info

Andy Carter

07/26/2023, 2:30 PM

message has been deleted

Andy Carter

07/26/2023, 2:31 PM

https://hub.meltano.com/extractors/tap-spreadsheets-anywhere

bram_enning

07/27/2023, 6:18 AM

Copy code

2023-07-27T06:17:10.736204Z [info     ] Environment 'dev' is active
INFO Using supplied catalog /Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json.
INFO Processing 0 selected streams from Catalog

bram_enning

07/27/2023, 6:18 AM

it does not get more verbose running with

--dev

Andy Carter

07/27/2023, 7:14 AM

What command are you running? I found

--dev

only works in some contexts

Andy Carter

07/27/2023, 7:15 AM

You could also try clearning the catalog file and see if that helps

bram_enning

07/28/2023, 7:15 AM

Hi, I’m running

meltano invoke tap-spreadsheets-anywhere --dev

and after removing the catalog-files at

.meltano/extractors/tap-spreadsheets-anywhere

the output became a little more verbose.

Copy code

2023-07-28T07:12:23.553324Z [info     ] Environment 'dev' is active
INFO Generating catalog through sampling.
INFO Walking /Users/bramenning/GitHub/open-source-data-stack/data.
INFO Found 7 files.
ERROR Unable to write Catalog entry for 'telbestanden' - it will be skipped due to error nothing to repeat at position 0
INFO Processing 0 selected streams from Catalog

Can it be that the tap doens’t find any csv’s that match the pattern, so there is nothing to write to the Catalog?

Andy Carter

07/28/2023, 7:57 AM

No, that would be a different message. Looks like @visch may have highlighted this: https://github.com/ets/tap-spreadsheets-anywhere/pull/49/commits/fa5b8b6a0cc77f1d153afb58895392212daf28f2 Try

"<file://c>:/Users/myname/GitHub/open-source-data-stack/data"

for your path? It looks like the files are being found correctly, but then the catalog entry fails before those files can be returned to be processed.

bram_enning

07/28/2023, 11:23 AM

I’m on a Mac so the

c:/

part will not work, I think

file:///

is correct for my setup.

Andy Carter

07/28/2023, 11:33 AM

My mistake, assumed windows.

Andy Carter

07/28/2023, 11:39 AM

It looks like the 7 files in your dir are being collected correctly, but the error failure is happening between the full

os.walk

stage the regex filtering stage. Can you try removing the

pattern:

and see if that works at least?

Andy Carter

07/28/2023, 11:39 AM

THe processing code is here: https://github.com/ets/tap-spreadsheets-anywhere/blob/646c0698b6288b277426a1dd8a587e3a504a8797/tap_spreadsheets_anywhere/file_utils.py#L125

bram_enning

07/28/2023, 12:20 PM

the

pattern

keyword is mandatory (voloptuous seems to need it):

Copy code

CRITICAL expected str for dictionary value @ data['tables'][0]['pattern']
Traceback (most recent call last):
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
    sys.exit(main())
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/singer/utils.py", line 235, in wrapped
    return fnc(*args, **kwargs)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/tap_spreadsheets_anywhere/__init__.py", line 147, in main
    tables_config = Config.validate(tables_config)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/tap_spreadsheets_anywhere/configuration.py", line 50, in validate
    CONFIG_CONTRACT(config_json)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 272, in __call__
    return self._compiled([], data)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 595, in validate_dict
    return base_validate(path, iteritems(data), out)
  File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 433, in validate_mapping
    raise er.MultipleInvalid(errors)
voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['tables'][0]['pattern']

bram_enning

07/28/2023, 12:28 PM

Bingo! It seems that

*.csv

is not a valid regex. I tried

.csv$

in this worked!

Andy Carter

07/28/2023, 12:30 PM

Nice. I might raise an issue for that on the tap repo as that's pretty counter-intuitive to me.

bram_enning

07/28/2023, 12:45 PM

Done, thanks for your patience and help!

2 Views

Open in Slack

Previous Next