bram_enning
07/25/2023, 8:01 AMAndy Carter
07/25/2023, 8:03 AMAndy Carter
07/25/2023, 8:03 AMbram_enning
07/25/2023, 8:07 AMversion: 1
default_environment: dev
project_id: c4547xxxxxxxx
environments:
- name: dev
- name: staging
- name: prod
plugins:
extractors:
- name: tap-csv
variant: meltanolabs
pip_url: git+<https://github.com/MeltanoLabs/tap-csv.git>
config:
add_metadata_columns: true
csv_files_definition: csv_files_definitions.json
metadata:
studielink:
replication-method: INCREMENTAL
replication-key: id
id:
is-replication-key: true
mappers:
- name: meltano-map-transformer
variant: meltano
pip_url: git+<https://github.com/MeltanoLabs/meltano-map-transform.git>
executable: meltano-map-transform
mappings:
- name: add_telbestanden_metadata
config:
stream_maps:
telbestanden:
source_filename: _sdc_source_file.split('/')[-1].split(".")[0]
collegejaar: _sdc_source_file.split('/')[-1].split(".")[0][14:18]
volgnummer: _sdc_source_file.split('/')[-1].split(".")[0][20:22]
peildatum: _sdc_source_file.split('/')[-1].split(".")[0][23:]
# an id is created for every row in a telbestand. It is a combination of peildatum, collegejaar and rownumber.
id: "int(_sdc_source_file.split('/')[-1].split('.')[0][23:] + _sdc_source_file.split('/')[-1].split('.')[0][14:18] + str(_sdc_source_lineno).zfill(6))"
__key_properties__: ["id"]
loaders:
- name: target-s3
variant: crowemi
pip_url: git+<https://github.com/crowemi/target-s3.git>
config:
cloud_provider.aws.aws_access_key_id: xxxxxxxxxxxxx
cloud_provider.aws.aws_bucket: xxx
cloud_provider.aws.aws_region: eu-west-1
cloud_provider.aws.aws_endpoint_override: xxxxxxxxxxxxxxxxx
cloud_provider.aws.aws_secret_access_key: xxxxxxxxxxxxxxxxx
format.format_type: parquet
prefix: meltano
append_date_to_filename: false
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
- name: target-postgres
variant: meltanolabs
pip_url: git+<https://github.com/MeltanoLabs/target-postgres.git>
config:
add_record_metadata: true
database: meltano
default_target_schema: raw
user: docker
host: localhost
password: docker
Andy Carter
07/25/2023, 8:16 AMdefinitions.json
are ending up a single parquet? I don't work with these taps, but normally I would expect each individual stream (csv file name) to have it's own parquet targetbram_enning
07/25/2023, 8:18 AMbram_enning
07/25/2023, 8:18 AMAndy Carter
07/25/2023, 8:19 AMAndy Carter
07/25/2023, 8:20 AMbram_enning
07/25/2023, 8:21 AMAndy Carter
07/25/2023, 8:22 AMbram_enning
07/25/2023, 8:22 AMAndy Carter
07/25/2023, 8:23 AMtap-spreadsheets-anywhere
to pattern match files, and group similar named files together into on a stream.Andy Carter
07/25/2023, 8:23 AMAndy Carter
07/25/2023, 8:24 AMdb1-2023-07-25.csv
and db1-2023-07-24.csv
files, these can be matched into a db1
stream, but if a db2-
file arrives, nothing will happen with it.bram_enning
07/25/2023, 8:25 AMAndy Carter
07/25/2023, 8:28 AMtap-spreadsheets-anywhere
supports.bram_enning
07/26/2023, 1:34 PMtap-spreadsheets-anywhere
to get working. When I run meltano it says:
2023-07-26T13:33:24.748905Z [info ] Environment 'dev' is active
2023-07-26T13:33:26.270667Z [warning ] No state was found, complete import.
2023-07-26T13:33:27.842568Z [info ] INFO Using supplied catalog /Users/my_name/GitHub/open-source-data-stack/meltano/ed2c/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-07-26T13:33:27.843119Z [info ] INFO Processing 0 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-07-26T13:33:28.071768Z [info ] Block run completed. block_type=ExtractLoadBlocks err=None set_number=0 success=True
This is in my meltano.yml:
- name: tap-spreadsheets-anywhere
variant: ets
pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
namespace: tap_spreadsheets_anywhere
executable: tap-spreadsheets-anywhere
capabilities:
- catalog
- discover
- state
config:
tables:
- path: "file:///Users/myname/GitHub/open-source-data-stack/data"
name: "telbestanden"
pattern: "*.csv"
start_date: "2009-12-10T13:49:51.141Z"
key_properties: []
format: "csv"
delimiter: ";"
Andy Carter
07/26/2023, 2:30 PM--dev
switch to add more infoAndy Carter
07/26/2023, 2:30 PMAndy Carter
07/26/2023, 2:31 PMbram_enning
07/27/2023, 6:18 AM2023-07-27T06:17:10.736204Z [info ] Environment 'dev' is active
INFO Using supplied catalog /Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json.
INFO Processing 0 selected streams from Catalog
bram_enning
07/27/2023, 6:18 AM--dev
Andy Carter
07/27/2023, 7:14 AM--dev
only works in some contextsAndy Carter
07/27/2023, 7:15 AMbram_enning
07/28/2023, 7:15 AMmeltano invoke tap-spreadsheets-anywhere --dev
and after removing the catalog-files at .meltano/extractors/tap-spreadsheets-anywhere
the output became a little more verbose.
2023-07-28T07:12:23.553324Z [info ] Environment 'dev' is active
INFO Generating catalog through sampling.
INFO Walking /Users/bramenning/GitHub/open-source-data-stack/data.
INFO Found 7 files.
ERROR Unable to write Catalog entry for 'telbestanden' - it will be skipped due to error nothing to repeat at position 0
INFO Processing 0 selected streams from Catalog
Can it be that the tap doens’t find any csv’s that match the pattern, so there is nothing to write to the Catalog?Andy Carter
07/28/2023, 7:57 AM"<file://c>:/Users/myname/GitHub/open-source-data-stack/data"
for your path?
It looks like the files are being found correctly, but then the catalog entry fails before those files can be returned to be processed.bram_enning
07/28/2023, 11:23 AMc:/
part will not work, I think file:///
is correct for my setup.Andy Carter
07/28/2023, 11:33 AMAndy Carter
07/28/2023, 11:39 AMos.walk
stage the regex filtering stage. Can you try removing the pattern:
and see if that works at least?Andy Carter
07/28/2023, 11:39 AMbram_enning
07/28/2023, 12:20 PMpattern
keyword is mandatory (voloptuous seems to need it):
CRITICAL expected str for dictionary value @ data['tables'][0]['pattern']
Traceback (most recent call last):
File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
sys.exit(main())
File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/singer/utils.py", line 235, in wrapped
return fnc(*args, **kwargs)
File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/tap_spreadsheets_anywhere/__init__.py", line 147, in main
tables_config = Config.validate(tables_config)
File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/tap_spreadsheets_anywhere/configuration.py", line 50, in validate
CONFIG_CONTRACT(config_json)
File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 272, in __call__
return self._compiled([], data)
File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 595, in validate_dict
return base_validate(path, iteritems(data), out)
File "/Users/bramenning/GitHub/open-source-data-stack/meltano/ed2c/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.8/site-packages/voluptuous/schema_builder.py", line 433, in validate_mapping
raise er.MultipleInvalid(errors)
voluptuous.error.MultipleInvalid: expected str for dictionary value @ data['tables'][0]['pattern']
bram_enning
07/28/2023, 12:28 PM*.csv
is not a valid regex. I tried .csv$
in this worked!Andy Carter
07/28/2023, 12:30 PMbram_enning
07/28/2023, 12:45 PM