Matt Menzenski
12/05/2022, 11:15 PMpat_nadolny
12/06/2022, 3:59 PMMatt Menzenski
12/06/2022, 4:00 PMpat_nadolny
12/06/2022, 4:02 PMpat_nadolny
12/06/2022, 4:10 PMMatt Menzenski
12/06/2022, 9:37 PMdouwe_maan
12/06/2022, 10:21 PMpat_nadolny
12/06/2022, 10:52 PMconfig:
tables:
- path: <s3://devtest-meltano-bucket-01>
format: json
key_properties: []
name: test123
start_date: "2020-01-01T00:00:00Z"
pattern: "spreadsheet_test/json_sample.json"
Matt Menzenski
12/06/2022, 10:55 PMpat_nadolny
12/06/2022, 10:55 PM{"my_data": "abc"}
{"my_data": "def"}
Matt Menzenski
12/06/2022, 10:56 PMpat_nadolny
12/06/2022, 10:58 PMdouwe_maan
12/06/2022, 10:59 PMMatt Menzenski
12/07/2022, 2:47 PMpath:
value has a trailing slash /
it fails silently 😬Matt Menzenski
12/07/2022, 2:47 PMMatt Menzenski
12/07/2022, 2:47 PMMatt Menzenski
12/07/2022, 2:58 PMpat_nadolny
12/07/2022, 3:44 PMpat_nadolny
12/07/2022, 3:45 PMmight have spoken too soon, it’s not always finding filesCould it be your regex
pattern
?Matt Menzenski
12/07/2022, 4:57 PMpattern
is definitely acting funnyMatt Menzenski
12/07/2022, 4:58 PMenvironments:
- name: system
config:
plugins:
extractors:
- name: tap-spreadsheets-anywhere
config:
tables:
- path: "<s3://payit-paw-raw-system>"
name: "s3_system"
pattern: "topics/paw_mongodb_events/year=2022/month=12/day=06/hour=17/paw_mongodb_events+0+0000087043.json"
start_date: "2020-01-01T00:00:00Z"
key_properties: [ ]
format: json
$ meltano --environment=system --log-level=info run --full-refresh tap-spreadsheets-anywhere target-jsonl
2022-12-07T16:57:48.863420Z [info ] Environment 'system' is active
2022-12-07T16:57:49.641880Z [info ] Performing full refresh, ignoring state left behind by any previous runs.
2022-12-07T16:57:53.240141Z [info ] INFO Using supplied catalog /Users/matt/dev/pudl/src/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:53.240760Z [info ] INFO Processing 1 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:53.241110Z [info ] INFO Syncing stream:s3_system cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:53.267621Z [info ] INFO Found credentials in environment variables. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.401112Z [info ] INFO Found 5102 files. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.402651Z [info ] INFO Checking 5102 resolved objects for any that match regular expression "topics/paw_mongodb_events/year=2022/month=12/day=06/hour=17/paw_mongodb_events+0+0000087043.json" and were modified since 2020-01-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.408553Z [info ] INFO Processing 0 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.409551Z [info ] INFO Wrote 0 records for stream "s3_system". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.516206Z [info ] Block run completed. block_type=ExtractLoadBlocks err=None set_number=0 success=True
Matt Menzenski
12/07/2022, 5:00 PMpattern: ".*"
instead, it doesn’t even find credentials anymore?
$ meltano --environment=system run --full-refresh tap-spreadsheets-anywhere target-jsonl
2022-12-07T16:59:49.428831Z [info ] Environment 'system' is active
2022-12-07T16:59:50.162329Z [info ] Performing full refresh, ignoring state left behind by any previous runs.
2022-12-07T16:59:54.631578Z [info ] INFO Using supplied catalog /Users/matt/dev/pudl/src/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:59:54.632079Z [info ] INFO Processing 0 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:59:54.709339Z [info ] Block run completed. block_type=ExtractLoadBlocks err=None set_number=0 success=True
pat_nadolny
12/07/2022, 5:30 PMpat_nadolny
12/07/2022, 5:34 PMpattern: "topics/paw_mongodb_events/year=2022/month=12/day=06/hour=17/paw_mongodb_events\+0\+0000087043.json"
Matt Menzenski
12/07/2022, 5:34 PM$ meltano --environment=system run --full-refresh tap-spreadsheets-anywhere target-jsonl
2022-12-07T17:32:34.122391Z [critical ] Error while parsing YAML file: /Users/matt/dev/pudl/src/environments/system.meltano.yml
while scanning a double-quoted scalar
in "/Users/matt/dev/pudl/src/environments/system.meltano.yml", line 11, column 28
found unknown escape character '+'
in "/Users/matt/dev/pudl/src/environments/system.meltano.yml", line 11, column 108
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.
while scanning a double-quoted scalar
in "/Users/matt/dev/pudl/src/environments/system.meltano.yml", line 11, column 28
found unknown escape character '+'
in "/Users/matt/dev/pudl/src/environments/system.meltano.yml", line 11, column 108
Matt Menzenski
12/07/2022, 5:35 PM\\+
gives the silent errorpat_nadolny
12/07/2022, 5:36 PMconfig:
tables:
- path: file:///Users/XXX/data/
format: json
key_properties: []
name: test123
start_date: '2020-01-01T00:00:00Z'
pattern: 'prefix=paw_mongodb_events\+0\+0000087043.json'
pat_nadolny
12/07/2022, 5:37 PMMatt Menzenski
12/07/2022, 5:38 PMpattern: 'prefix=XXX'
Matt Menzenski
12/07/2022, 5:38 PMpat_nadolny
12/07/2022, 5:38 PMprefix=
isnt neededMatt Menzenski
12/07/2022, 5:39 PMpat_nadolny
12/07/2022, 5:39 PMMatt Menzenski
12/07/2022, 5:39 PMpattern: 'topics/paw_mongodb_events/year=2022/month=12/day=06/hour=17/paw_mongodb_events\+0\+0000087043.json'
gives no errors but fails silently for meMatt Menzenski
12/07/2022, 5:39 PMpat_nadolny
12/07/2022, 5:40 PM.meltano/run/tap-spreasheets-anywhere/
directory? I'm not sure its related but the catalog might be cached and causing an issue. I'll also setup a test on S3 to see if I can get it workingMatt Menzenski
12/07/2022, 5:42 PMpat_nadolny
12/07/2022, 5:59 PM- name: tap-spreadsheets-anywhere
variant: ets
pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
config:
tables:
- path: <s3://x-x-x>
format: json
key_properties: []
name: test123
start_date: '2020-01-01T00:00:00Z'
pattern: 'spreadsheet_test/year=2022/paw_mongodb_events\+0\+0000087043.json'
pat_nadolny
12/07/2022, 5:59 PMspreadsheet_test/year=2022/paw_mongodb_events+0+0000087043.json
on S3pat_nadolny
12/07/2022, 6:00 PMets
variant?Matt Menzenski
12/07/2022, 6:01 PM$ cat meltano.yml
version: 1
include_paths:
- ./environments/dev.meltano.yml
- ./environments/system.meltano.yml
- ./environments/staging.meltano.yml
- ./environments/prod.meltano.yml
- ./environments/ca-staging.meltano.yml
- ./environments/ca-prod.meltano.yml
default_environment: dev
project_id: acff2bdd-2726-48c7-b239-207f84ce4eb3
send_anonymous_usage_stats: false
plugins:
extractors:
- name: tap-spreadsheets-anywhere
variant: ets
pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git@5d9115985d3f9e7a568c6dcc68975f0c038253ff>
loaders:
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
$ cat environments/system.meltano.yml
environments:
- name: system
config:
plugins:
extractors:
- name: tap-spreadsheets-anywhere
config:
tables:
- path: "<s3://payit-paw-raw-system>"
# search_prefix: "topics/paw_mongodb_events/"
name: "s3_system_raw_5"
pattern: "topics/paw_mongodb_events/year=2022/month=12/day=01/hour=01/.*json"
start_date: "2020-01-01T00:00:00Z"
key_properties: [ ]
format: json
Matt Menzenski
12/07/2022, 6:07 PM--log-level=debug
to the meltano
command I don’t get this debug log in the output: https://github.com/ets/tap-spreadsheets-anywhere/blob/master/tap_spreadsheets_anywhere/file_utils.py#L154Matt Menzenski
12/07/2022, 6:10 PM.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/file_utils.py
pat_nadolny
12/07/2022, 6:10 PMMatt Menzenski
12/07/2022, 6:11 PMMatt Menzenski
12/07/2022, 6:14 PMMatt Menzenski
12/07/2022, 6:14 PMERROR Unable to write Catalog entry for 's3_system_raw_5' - it will be skipped due to error 'str' object has no attribute 'items'
Matt Menzenski
12/07/2022, 6:28 PM- path: "<s3://payit-paw-raw-system>"
name: "s3_system_raw_5"
pattern: "topics/paw_mongodb_events/.*json"
start_date: "2020-01-01T00:00:00Z"
key_properties: [ ]
format: json
with this ^ config, if I remove that .meltano/run/tap-spreadsheets-anywhere
directory, and then run meltano --environment=system --log-level=debug run --full-refresh tap-spreadsheets-anywhere target-jsonl
, I can see that the files I expect to be included are actually included in the output of the tapMatt Menzenski
12/07/2022, 6:29 PMDEBUG Including key "topics/paw_mongodb_events/year=2022/month=12/day=07/hour=15/paw_mongodb_events+0+0000087047.json"
DEBUG Last modified: 2022-12-07 15:35:58+00:00 comparing to 2020-01-01 00:00:00+00:00
DEBUG Including key "topics/paw_mongodb_events/year=2022/month=12/day=07/hour=15/paw_mongodb_events+0+0000087048.json"
DEBUG Last modified: 2022-12-07 15:38:32+00:00 comparing to 2020-01-01 00:00:00+00:00
DEBUG Including key "topics/paw_mongodb_events/year=2022/month=12/day=07/hour=15/paw_mongodb_events+0+0000087049.json"
DEBUG Last modified: 2022-12-07 18:20:05+00:00 comparing to 2020-01-01 00:00:00+00:00
INFO Sampling topics/paw_mongodb_events/year=2022/month=04/day=29/hour=22/paw_mongodb_events+0+0000000000.json (1000 records, every 5th record).
ERROR Unable to write Catalog entry for 's3_system_raw_7' - it will be skipped due to error 'str' object has no attribute 'items'
Matt Menzenski
12/07/2022, 6:58 PM--dump=catalog
command:
$ meltano --environment=system invoke --dump=catalog tap-spreadsheets-anywhere
2022-12-07T18:57:10.586276Z [info ] Environment 'system' is active
{
"streams": []
}
Does this help any? Should there be something in that array?pat_nadolny
12/07/2022, 7:32 PM{
"streams": [
{
"tap_stream_id": "test123",
"key_properties": [],
"schema": {
"properties": {
"my_data": {
"type": [
"null",
"string"
]
},
"_smart_source_bucket": {
"type": "string"
},
"_smart_source_file": {
"type": "string"
},
"_smart_source_lineno": {
"type": "integer"
}
},
"selected": true,
"type": "object"
},
"stream": "test123",
"metadata": [
{
"breadcrumb": [],
"metadata": {
"inclusion": "automatic",
"selected": true
}
},
{
"breadcrumb": [
"properties",
"my_data"
],
"metadata": {
"inclusion": "automatic",
"selected": true
}
},
{
"breadcrumb": [
"properties",
"_smart_source_bucket"
],
"metadata": {
"inclusion": "automatic",
"selected": true
}
},
{
"breadcrumb": [
"properties",
"_smart_source_file"
],
"metadata": {
"inclusion": "automatic",
"selected": true
}
},
{
"breadcrumb": [
"properties",
"_smart_source_lineno"
],
"metadata": {
"inclusion": "automatic",
"selected": true
}
}
],
"selected": true
}
]
}
for the sample I sent in https://meltano.slack.com/archives/C01UW1W4D5Y/p1670367316662309?thread_ts=1670282159.443659&cid=C01UW1W4D5Ypat_nadolny
12/07/2022, 7:32 PMMatt Menzenski
12/07/2022, 7:35 PM{
"checkpoint": 1670347122,
"mongoEventType": "UPSERT",
"payitId": "3d011309-3aa5-4477-bce7-04969559f8f3",
"addedToKafka": "2022-12-06T17:18:42.698296059Z",
"kafkaTopic": "paw_mongodb_events",
"mongoCollectionName": "SignIn",
"mongoDocument": "{\"_id\": {\"$oid\": \"638f79728f8f021e3c9ea5c3\"}, \"className\": \"com.payit.SignIn\", \"timestamp\": {\"$date\": \"2022-12-06T17:18:42.692Z\"}, \"signInAppName\": \"undefined\", \"userAgent\": \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36\", \"accountType\": \"ADMIN\", \"signUpOriginEnum\": \"WEB\", \"userId\": \"1776b519-c79c-4af4-994f-2ff534581107\", \"id\": \"3d011309-3aa5-4477-bce7-04969559f8f3\"}",
"mongoDatabaseName": "signin_service",
"mongoDocumentId": "638f79728f8f021e3c9ea5c3"
}
Matt Menzenski
12/07/2022, 7:35 PMMatt Menzenski
12/07/2022, 7:47 PMMatt Menzenski
12/07/2022, 7:48 PM{"my_data": "abc"}
{"my_data": "def"}
to
{
"my_data": "abc"
}
Matt Menzenski
12/07/2022, 7:52 PMobj
in that iterator.
Sometimes it’s a JSON object, and throws no exception, but sometimes it’s a string, and throws the exception:
DEBUG obj: 'checkpoint'
DEBUG line 110
ERROR 'str' object has no attribute 'items'
Traceback (most recent call last):
File "/Users/matt/dev/pudl/src/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/file_utils.py", line 89, in sample_file
for row in iterator:
File "/Users/matt/dev/pudl/src/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/json_handler.py", line 13, in generator_wrapper
for key, value in obj.items():
AttributeError: 'str' object has no attribute 'items'
INFO Sampled 0 records.
Matt Menzenski
12/07/2022, 7:57 PMMatt Menzenski
12/07/2022, 7:57 PMMatt Menzenski
12/07/2022, 7:58 PMget_row_iterator
function to skip rows that are type dict
, it proceeds without issue and I end up with a non-empty streams
array in the catalogpat_nadolny
12/07/2022, 7:59 PMMatt Menzenski
12/07/2022, 8:00 PMMatt Menzenski
12/07/2022, 8:00 PMpat_nadolny
12/07/2022, 8:05 PMelt
with the --catalog
flag to pass in your catalog manually if you were able generate one after hacking the code a bitMatt Menzenski
12/07/2022, 8:05 PMMatt Menzenski
12/07/2022, 8:29 PMMatt Menzenski
12/07/2022, 8:29 PM$ cat output/s3_system_raw_7.jsonl| jq 'select(._smart_source_lineno == 1)'
gives no outputMatt Menzenski
12/08/2022, 7:07 PMMatt Menzenski
12/08/2022, 9:55 PMDenis I.
03/24/2023, 1:54 PMDenis I.
03/25/2023, 6:11 PMMatt Menzenski
03/25/2023, 6:15 PM