I have JSONL (JSON Lines) files in AWS S3 - is the...
# singer-taps
m
I have JSONL (JSON Lines) files in AWS S3 - is there a tap I can use to fetch that data?
p
@Matt Menzenski I havent personally used it for this reason but I think tap-spreadsheets-anywhere will allow you to read S3 jsonl files
m
ooh! thank you, I will try that
p
hmm actually this note in the README makes me think this isnt true https://github.com/ets/tap-spreadsheets-anywhere#json-support. I'm not sure, you might have to try it out
m
I am not able to get https://github.com/ets/tap-spreadsheets-anywhere to work for me for anything in S3 (JSON or CSV). I have confirmed that my AWS credential can access these files, and I can use this tap to read a local CSV fine. So I’m not sure what the issue is but it doesn’t look good for my JSONL use case
d
@Matt Menzenski Can you share your config and what error or broken behavior you’re seeing?
p
I was able to get a test working with jsonl. My config looks like this
Copy code
config:
      tables:
        - path: <s3://devtest-meltano-bucket-01>
          format: json
          key_properties: []
          name: test123
          start_date: "2020-01-01T00:00:00Z"
          pattern: "spreadsheet_test/json_sample.json"
m
I think mine is functionally identical to that. I will check
p
with my file containing
Copy code
{"my_data": "abc"}
{"my_data": "def"}
m
No containing array? Just json lines like that?
p
yeah that worked for me. What does yours look like?
d
I assume the tap requires one row per line (JSONL, not a JSON array)
m
oh geez - if the
path:
value has a trailing slash
/
it fails silently 😬
I removed that and now I can see that it’s actually connecting to S3 and finding files
thanks @pat_nadolny!
might have spoken too soon, it’s not always finding files 😕
p
Yeah I found that it doesnt log very well so its hard to know whats failing. I ended up configuring a local file first to get it working then moved it to S3
might have spoken too soon, it’s not always finding files
Could it be your regex
pattern
?
m
the
pattern
is definitely acting funny
this should find exactly one file, but does not output anything:
Copy code
environments:
  - name: system
    config:
      plugins:
        extractors:
          - name: tap-spreadsheets-anywhere
            config:
              tables:
                - path: "<s3://payit-paw-raw-system>"
                  name: "s3_system"
                  pattern: "topics/paw_mongodb_events/year=2022/month=12/day=06/hour=17/paw_mongodb_events+0+0000087043.json"
                  start_date: "2020-01-01T00:00:00Z"
                  key_properties: [ ]
                  format: json
Copy code
$ meltano --environment=system --log-level=info run --full-refresh tap-spreadsheets-anywhere target-jsonl
2022-12-07T16:57:48.863420Z [info     ] Environment 'system' is active
2022-12-07T16:57:49.641880Z [info     ] Performing full refresh, ignoring state left behind by any previous runs.
2022-12-07T16:57:53.240141Z [info     ] INFO Using supplied catalog /Users/matt/dev/pudl/src/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:53.240760Z [info     ] INFO Processing 1 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:53.241110Z [info     ] INFO Syncing stream:s3_system  cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:53.267621Z [info     ] INFO Found credentials in environment variables. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.401112Z [info     ] INFO Found 5102 files.         cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.402651Z [info     ] INFO Checking 5102 resolved objects for any that match regular expression "topics/paw_mongodb_events/year=2022/month=12/day=06/hour=17/paw_mongodb_events+0+0000087043.json" and were modified since 2020-01-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.408553Z [info     ] INFO Processing 0 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.409551Z [info     ] INFO Wrote 0 records for stream "s3_system". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:57:55.516206Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True
if I use
pattern: ".*"
instead, it doesn’t even find credentials anymore?
Copy code
$ meltano --environment=system run --full-refresh tap-spreadsheets-anywhere target-jsonl
2022-12-07T16:59:49.428831Z [info     ] Environment 'system' is active
2022-12-07T16:59:50.162329Z [info     ] Performing full refresh, ignoring state left behind by any previous runs.
2022-12-07T16:59:54.631578Z [info     ] INFO Using supplied catalog /Users/matt/dev/pudl/src/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:59:54.632079Z [info     ] INFO Processing 0 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2022-12-07T16:59:54.709339Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True
p
I'm wondering if this is a related to escaping some of those characters `This is an escaped regular expression that the tap will use to filter the listing result set returned from the listing request. This pattern potentially reduces the number of listed files that are considered as sources for the declared table. It's a bit strange, since this is an escaped string inside of an escaped string, any backslashes in the RegEx will need to be double-escaped.`I'm not too good at regexes but I wonder if the plus signs are causing issues 🤔
ahh yeah I just ran a local test, try updating to this with backslashes
pattern: "topics/paw_mongodb_events/year=2022/month=12/day=06/hour=17/paw_mongodb_events\+0\+0000087043.json"
m
that doesn’t work, I get
Copy code
$ meltano --environment=system run --full-refresh tap-spreadsheets-anywhere target-jsonl
2022-12-07T17:32:34.122391Z [critical ] Error while parsing YAML file: /Users/matt/dev/pudl/src/environments/system.meltano.yml
 while scanning a double-quoted scalar
  in "/Users/matt/dev/pudl/src/environments/system.meltano.yml", line 11, column 28
found unknown escape character '+'
  in "/Users/matt/dev/pudl/src/environments/system.meltano.yml", line 11, column 108
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.

while scanning a double-quoted scalar
  in "/Users/matt/dev/pudl/src/environments/system.meltano.yml", line 11, column 28
found unknown escape character '+'
  in "/Users/matt/dev/pudl/src/environments/system.meltano.yml", line 11, column 108
and double backslashes
\\+
gives the silent error
p
Mine looks like this for a file I created called `prefix=paw_mongodb_events+0+0000087043.json`:
Copy code
config:
      tables:
      - path: file:///Users/XXX/data/
        format: json
        key_properties: []
        name: test123
        start_date: '2020-01-01T00:00:00Z'
        pattern: 'prefix=paw_mongodb_events\+0\+0000087043.json'
^^ I only tested locally though so maybe graduating to S3 causes other problems. But I did see it skip that file until I escaped the plus signs
m
you have
pattern: 'prefix=XXX'
I haven’t tried that
p
oh that was just to test that equal signs dont cause issues the
prefix=
isnt needed
m
ah
p
are you on windows?
m
pattern: 'topics/paw_mongodb_events/year=2022/month=12/day=06/hour=17/paw_mongodb_events\+0\+0000087043.json'
gives no errors but fails silently for me
no, mac
p
can you try deleting your
.meltano/run/tap-spreasheets-anywhere/
directory? I'm not sure its related but the catalog might be cached and causing an issue. I'll also setup a test on S3 to see if I can get it working
m
that seems to have had no effect
p
really strange I'm able to get it working with this:
Copy code
- name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    config:
      tables:
      - path: <s3://x-x-x>
        format: json
        key_properties: []
        name: test123
        start_date: '2020-01-01T00:00:00Z'
        pattern: 'spreadsheet_test/year=2022/paw_mongodb_events\+0\+0000087043.json'
my file is
spreadsheet_test/year=2022/paw_mongodb_events+0+0000087043.json
on S3
can you share your meltano.yml (anything secret excluded)? Are you using the
ets
variant?
m
Copy code
$ cat meltano.yml
version: 1
include_paths:
  - ./environments/dev.meltano.yml
  - ./environments/system.meltano.yml
  - ./environments/staging.meltano.yml
  - ./environments/prod.meltano.yml
  - ./environments/ca-staging.meltano.yml
  - ./environments/ca-prod.meltano.yml
default_environment: dev
project_id: acff2bdd-2726-48c7-b239-207f84ce4eb3
send_anonymous_usage_stats: false
plugins:
  extractors:
    - name: tap-spreadsheets-anywhere
      variant: ets
      pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git@5d9115985d3f9e7a568c6dcc68975f0c038253ff>
  loaders:
    - name: target-jsonl
      variant: andyh1203
      pip_url: target-jsonl
Copy code
$ cat environments/system.meltano.yml
environments:
  - name: system
    config:
      plugins:
        extractors:
          - name: tap-spreadsheets-anywhere
            config:
              tables:
                - path: "<s3://payit-paw-raw-system>"
                  # search_prefix: "topics/paw_mongodb_events/"
                  name: "s3_system_raw_5"
                  pattern: "topics/paw_mongodb_events/year=2022/month=12/day=01/hour=01/.*json"
                  start_date: "2020-01-01T00:00:00Z"
                  key_properties: [ ]
                  format: json
can I set log levels for a specific plugin? If I pass
--log-level=debug
to the
meltano
command I don’t get this debug log in the output: https://github.com/ets/tap-spreadsheets-anywhere/blob/master/tap_spreadsheets_anywhere/file_utils.py#L154
actually I just went into the installed plugin and changed it
Copy code
.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/file_utils.py
p
@edgar_ramirez_mondragon do you know how to modify the tap log level?
m
I found a hacky way to do it
whoa
Copy code
ERROR Unable to write Catalog entry for 's3_system_raw_5' - it will be skipped due to error 'str' object has no attribute 'items'
Copy code
- path: "<s3://payit-paw-raw-system>"
                  name: "s3_system_raw_5"
                  pattern: "topics/paw_mongodb_events/.*json"
                  start_date: "2020-01-01T00:00:00Z"
                  key_properties: [ ]
                  format: json
with this ^ config, if I remove that
.meltano/run/tap-spreadsheets-anywhere
directory, and then run
meltano --environment=system --log-level=debug run --full-refresh tap-spreadsheets-anywhere target-jsonl
, I can see that the files I expect to be included are actually included in the output of the tap
Copy code
DEBUG Including key "topics/paw_mongodb_events/year=2022/month=12/day=07/hour=15/paw_mongodb_events+0+0000087047.json"
DEBUG Last modified: 2022-12-07 15:35:58+00:00 comparing to 2020-01-01 00:00:00+00:00
DEBUG Including key "topics/paw_mongodb_events/year=2022/month=12/day=07/hour=15/paw_mongodb_events+0+0000087048.json"
DEBUG Last modified: 2022-12-07 15:38:32+00:00 comparing to 2020-01-01 00:00:00+00:00
DEBUG Including key "topics/paw_mongodb_events/year=2022/month=12/day=07/hour=15/paw_mongodb_events+0+0000087049.json"
DEBUG Last modified: 2022-12-07 18:20:05+00:00 comparing to 2020-01-01 00:00:00+00:00
INFO Sampling topics/paw_mongodb_events/year=2022/month=04/day=29/hour=22/paw_mongodb_events+0+0000000000.json (1000 records, every 5th record).
ERROR Unable to write Catalog entry for 's3_system_raw_7' - it will be skipped due to error 'str' object has no attribute 'items'
I found https://meltano.slack.com/archives/CMN8HELB0/p1651677808805029?thread_ts=1651351035.318969&amp;cid=CMN8HELB0 and ran that
--dump=catalog
command:
Copy code
$ meltano --environment=system invoke --dump=catalog tap-spreadsheets-anywhere
2022-12-07T18:57:10.586276Z [info     ] Environment 'system' is active
{
  "streams": []
}
Does this help any? Should there be something in that array?
p
ahhh yeah thats a problem mine looks like
Copy code
{
  "streams": [
    {
      "tap_stream_id": "test123",
      "key_properties": [],
      "schema": {
        "properties": {
          "my_data": {
            "type": [
              "null",
              "string"
            ]
          },
          "_smart_source_bucket": {
            "type": "string"
          },
          "_smart_source_file": {
            "type": "string"
          },
          "_smart_source_lineno": {
            "type": "integer"
          }
        },
        "selected": true,
        "type": "object"
      },
      "stream": "test123",
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "inclusion": "automatic",
            "selected": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "my_data"
          ],
          "metadata": {
            "inclusion": "automatic",
            "selected": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_smart_source_bucket"
          ],
          "metadata": {
            "inclusion": "automatic",
            "selected": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_smart_source_file"
          ],
          "metadata": {
            "inclusion": "automatic",
            "selected": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_smart_source_lineno"
          ],
          "metadata": {
            "inclusion": "automatic",
            "selected": true
          }
        }
      ],
      "selected": true
    }
  ]
}
for the sample I sent in https://meltano.slack.com/archives/C01UW1W4D5Y/p1670367316662309?thread_ts=1670282159.443659&amp;cid=C01UW1W4D5Y
What are the contents of your files? Is there a non-sensitive example you can share?
m
Copy code
{
  "checkpoint": 1670347122,
  "mongoEventType": "UPSERT",
  "payitId": "3d011309-3aa5-4477-bce7-04969559f8f3",
  "addedToKafka": "2022-12-06T17:18:42.698296059Z",
  "kafkaTopic": "paw_mongodb_events",
  "mongoCollectionName": "SignIn",
  "mongoDocument": "{\"_id\": {\"$oid\": \"638f79728f8f021e3c9ea5c3\"}, \"className\": \"com.payit.SignIn\", \"timestamp\": {\"$date\": \"2022-12-06T17:18:42.692Z\"}, \"signInAppName\": \"undefined\", \"userAgent\": \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36\", \"accountType\": \"ADMIN\", \"signUpOriginEnum\": \"WEB\", \"userId\": \"1776b519-c79c-4af4-994f-2ff534581107\", \"id\": \"3d011309-3aa5-4477-bce7-04969559f8f3\"}",
  "mongoDatabaseName": "signin_service",
  "mongoDocumentId": "638f79728f8f021e3c9ea5c3"
}
that’s the contents of one, i’ve sanitized it enough to share here
I’m starting to be curious whether it’s because this file (and many I’d be interested in) contain only one line - is it possibly getting parsed as a JSON file instead of a JSONL file? 🤔
@pat_nadolny i’d be curious what happens if you changed your test file from
Copy code
{"my_data": "abc"}
{"my_data": "def"}
to
Copy code
{
  "my_data": "abc"
}
ok I’m definitely getting somewhere… I updated https://github.com/menzenski/tap-spreadsheets-anywhere/blob/master/tap_spreadsheets_anywhere/json_handler.py#L9 to log the value of
obj
in that iterator. Sometimes it’s a JSON object, and throws no exception, but sometimes it’s a string, and throws the exception:
Copy code
DEBUG obj: 'checkpoint'
DEBUG line 110
ERROR 'str' object has no attribute 'items'
Traceback (most recent call last):
  File "/Users/matt/dev/pudl/src/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/file_utils.py", line 89, in sample_file
    for row in iterator:
  File "/Users/matt/dev/pudl/src/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/json_handler.py", line 13, in generator_wrapper
    for key, value in obj.items():
AttributeError: 'str' object has no attribute 'items'
INFO Sampled 0 records.
AHA
that’s totally what’s happening
if I update the
get_row_iterator
function to skip rows that are type
dict
, it proceeds without issue and I end up with a non-empty
streams
array in the catalog
p
yeah youre 100% right - I tried with 1 record in a file and it gives me an empty schema
m
awesome
I’m going to log an issue with this tap and see about contributing a fix
p
awesome - glad you got to the bottom of it! To solve your immediate use case you can use
elt
with the
--catalog
flag to pass in your catalog manually if you were able generate one after hacking the code a bit
m
thanks! yes, I did end up with a catalog in the end
I think the issue is not actually that a one-line file is read into a dict rather than a list, but that the first line of every file is. I have 72k lines of output and none of them are from the first line of a file 😕
Copy code
$ cat output/s3_system_raw_7.jsonl| jq  'select(._smart_source_lineno == 1)'
gives no output
When I use this fork ^ in my meltano project I successfully pull all 5k files / 325k records from my s3 bucket 🎉
d
Hi @Matt Menzenski! Found this thread while trying to add encoding support for this tap and improve jsonpath feature (using https://pypi.org/project/jsonpath-ng/ to extract more complex json files than just array of object at the top level [{}.{},{}]) Did you manage to run tests? I tried to build venv, but it keeps failing with google dependencies. For encoding it doesn’t seem a big deal, but for jsonpath is critical to run json-related tests.
Hi! I opened two pull requests to tap-spreadsheets-anywhere repository: • Add option to set source file encoding https://github.com/ets/tap-spreadsheets-anywhere/pull/43 • Add JSONPath parser support https://github.com/ets/tap-spreadsheets-anywhere/pull/44 @Matt Menzenski it would be great to doublecheck changes, if you still have a source code of the branch you mentioned above that uses github actions to run the test suite
m
Yes, I can run tests - I was added as a maintainer of the repo as well (so I can review and merge pull requests). I’ll look at these on Monday, thanks for contributing!