Another question about the `tap-spreadsheets-anywh...
# troubleshooting
j
Another question about the
tap-spreadsheets-anywhere
tap. I'm pulling json files from S3, each file contains 1 JSON object. From what I can tell the tap requires the json files contains an array. Is there any way to configure the tap to work with JSON docs that contain just 1 object that's not wrapped in
[...]
?
p
I think based on the jsonl example in that section you shouldnt have a problem with your single line json files
m
Yes, pass type jsonl instead of json
I added support for that a few months ago, we have the same need
p
@Matt Menzenski is that hub example incorrect then since its configured as
json
?
m
My understanding is that if the files only ever contain exactly one record, both json and jsonl will work. But if a file can contain multiple records you’ll need to use jsonl type. If using json type in that scenario (multiple records not wrapped in an array), records after the first will not be captured.
j
Here's the error I get when using JSONL.
Unable to write Catalog entry for 'ccda_documents' - it will be skipped due to error Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
m
are you able to share an example JSON file?
j
And here's the error with JSON:
Copy code
ERROR Unable to write Catalog entry for 'ccda_documents' - it will be skipped due to error 'str' object has no attribute 'items'
m
looking at that error message, the questions I would ask are “is it truly valid JSON”? and “are there newline characters in the JSON object?”
j
let me take a look and see if I can get a sample file.
m
Copy code
ERROR Unable to write Catalog entry for 'ccda_documents' - it will be skipped due to error 'str' object has no attribute 'items'
This error matches my experience trying to use JSON type FWIW
j
it's auto generated JSON by Rails from an XML document 🥲
m
👍🏼 then it’s probably valid JSON
does that
pattern
match any JSON files that do have a containing array / more than one item?
j
no, each file will only contain 1 object
Here's a sample from one of these files:
Copy code
{
  "+p_xml": "version=\"1.0\" ",
  "ClinicalDocument": {
    "+@xmlns": "urn:hl7-org:v3",
    "realmCode": {
      "+@code": "US"
    },
    "typeId": {
      "+@root": "2.16.840.1.abcdefg",
      "+@extension": "POCD_HD000040"
    },
    "templateId": [
      {
        "+@root": "2.16.840.abcdefg"
      },
      {
        "+@root": "2.16.840.1.abcdefg"
      }
    ],
    "id": {
      "+@root": "dce8b808-c812-11ed-redacted"
    },
    "code": {
      "+@code": "34133-9",
      "+@codeSystem": "2.16.840.1.113883.6.1",
      "+@codeSystemName": "LN",
      "+@displayName": "Summarization of Episode Note"
    },
}
The
+
represent XML properties. I wonder if the leading
+
causes any problems? Also, these files are ~10-20k lines long
m
Is the actual record on multiple lines like that, or is it minified? I believe the tap expects the JSON record to be minified (all on a single line)
j
the record is not minified, it is indeed on 10k+ lines
is there anything that we can do at the tap level so that it can handle multi-line JSON files?
m
yeah, I’m not sure what the best approach would be but I’m sure there’s something
maybe a new “type” that tells the tap to try to parse an entire document into one JSON object and emit that ? thinkspin
PRs welcome 🙂 (I’m a maintainer of that tap, I can merge them)
j
i have some more time tomorrow and will look into it 👀 to give me a head start, do you happen to know the line of file/function/line of code that would be failing?
i'll validate that the tap works when the json is all on one line by manually updating a file. If so, another option for me would be to adjust the Rails script to put the json onto one line before writing to the file..
but i'd sleep better at night if the tap could handle this 🙂
m
here’s the PR where I added
jsonl
support, if that’s useful - you might start by adding test cases that have newlines in the JSON objects and seeing what error is thrown. https://github.com/ets/tap-spreadsheets-anywhere/pull/28/files
j
ty! I'll start there.
also, i can confirm that when i make the json file 1 line then the ingest does work
m
I believe the generator_wrapper function will be where a lot of this happens https://github.com/ets/tap-spreadsheets-anywhere/blob/main/tap_spreadsheets_anywhere/json_handler.py#L9 that
for obj in root_iterator:
is essentially equal to
for line in file:
I believe
p
@Matt Menzenski for what its worth I'm having success with both json and jsonl format type when I test with a file of jsonl like this thinkspin
Copy code
{"user_id": 1, "first_name": "John", "last_name": "Doe"}
{"user_id": 2, "first_name": "Sarah", "last_name": "Smith"}
{"user_id": 3, "first_name": "Joe", "last_name": "Momma"}
{"user_id": 4, "first_name": "Steve", "last_name": "Madden"}
with my meltano.yml
Copy code
- name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    config:
      tables:
      - path: s3://<bucket>
        format: jsonl
        key_properties: [user_id]
        name: user_names
        start_date: '2020-01-01T00:00:00Z'
        pattern: "spreadsheets_test/user_names\\.json"
or as
Copy code
- name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    config:
      tables:
      - path: s3://<bucket>
        format: json
        key_properties: [user_id]
        name: user_names
        start_date: '2020-01-01T00:00:00Z'
        pattern: "spreadsheets_test/user_names\\.json"
m
that’s interesting, I wouldn’t have expected that behavior
p
I was trying to confirm that the hub docs section was incorrect before updating but it worked and now I'm confused lol
The array style works with
json
as expected also
Copy code
[
    {"user_id": 1, "first_name": "John", "last_name": "Doe"},
    {"user_id": 2, "first_name": "Sarah", "last_name": "Smith"},
    {"user_id": 3, "first_name": "Joe", "last_name": "Momma"},
    {"user_id": 4, "first_name": "Steve", "last_name": "Madden"}
]
m
if you have a one-line file, type
JSON
will error
p
ahh good find - ok that makes sense. 5 months ago...I forget things fast 😭
j
showing some ignorance here, python isn't my first language!.. it looks like smart_open, the package under the hood that actually pulls the data from S3, is streaming data 1 line at a time here. these lines are then immediately being iterated over one at a time. i'm not really sure how to modify this to handle files with multi-line JSON(L). the streaming piece aside, a few options i was thinking about: 1. ingest the whole file first then use a JSON formatter to convert multiline JSON to be single line 2. wrap multiline json in
[..]
to create a list of 1 which we already know the package handles today
if someone wants to pair program on it, i'd be happy to 👋
otherwise, i'm now going to spend some time seeing if i can instead change the way the JSON file is created by Rails so that it's either on 1 line or is wrapped in square brackets