Hi All, I’ve been using meltano for a tap-shopify...
# troubleshooting
n
Hi All, I’ve been using meltano for a tap-shopify to target-athena. I’ve opened and followed some issues and have a work around for my target-athena to get data loaded in: tap: https://github.com/singer-io/tap-shopify/issues/125 target: https://github.com/MeltanoLabs/target-athena/issues/27 https://github.com/MeltanoLabs/target-athena/issues/28 Two issues would appreciate some help with: 1 - target-athena (dataops-tk’s) requires me to add a schema into the meltano.yml file due to some issues with {} within the orders schema. When I load the data into Athena, all the columns come up as a varchar of type string making it inconvenient to use. What changes can I make so this isn’t coming up as a varchar for all columns?
Copy code
From my meltano.yml

    schema:
      orders:
        subtotal_price_set:
          type:
          - object
          - 'null'
        total_discounts_set:
          type:
          - object
          - 'null'
        total_line_items_price_set:
          type:
          - object
          - 'null'
        total_price_set:
          type:
          - object
          - 'null'
        total_shipping_price_set:
          type:
          - object
          - 'null'
        total_tax_set:
          type:
          - object
          - 'null'
        discounted_price_set:
          type:
          - object
          - 'null'
        price_set:
          type:
          - object
          - 'null'
2 - I’ve experimented with target-bigquery instead, which has it’s own problems. It fails because of the same schema issue in orders, but this time I’m not able to resolve this using the same schema fix as I use for target-athena. So instead I need to make a catalog.json file.
Copy code
TAP_SHOPIFY__CATALOG=extract/tap-shopify.catalog.json
here the issue is that i’m only getting customers, orders, products, and transactions. I’m failing to get inventory_levels and inventory_item which are supported by the tap. I added those two by running
Copy code
meltano invoke tap-shopify --discover > test.json
and copying the streams into my tap-shopify.catalog from test. Why doesn’t the same schema override for athena work for here? Why can’t i get inventory levels and items piping through?
v
How are you running Meltano?
meltano elt tap-shopify target-athena --job_id=abc
?|
When I load the data into Athena, all the columns come up as a varchar of type string making it inconvenient to use.
From https://github.com/andrewcstewart/target-athena/blob/master/target_athena/utils.py#L162 , target-athena looks to build a struct in athena. Which means your DDL should work fine. What's the DDL for your table that's created in Athena? Is it a struct currently or a string?
Why doesn’t the same schema override for athena work for here?
Because you're overriding the catalog with TAP_SHOPIFY__CATALOG=extract/tap-shopify.catalog.json Select doesn't work if you're passing in a catalog. Select builds the catalog for you.
My guess for 1. is that there's already a table made in athena, if you add a new database or remove the old athena database it'll get recreated for you. (you could also manually update the DDL). I had an example going https://github.com/AutoIDM/shopify2athena , I didn't check the DDL on those specific fields in the database. Removed the data and access already on my end so I can't double check it 😕
n
@visch The tables were cleared and automatically recreated before rerunning, but not the Athena database. Why would that matter? The DDL is naturally all strings ```CREATE EXTERNAL TABLE `orders`(
presentment_currency
string COMMENT 'from deserializer',
subtotal_price_set
string COMMENT 'from deserializer',
total_discounts_set
string COMMENT 'from deserializer',
total_line_items_price_set
string COMMENT 'from deserializer',
total_price_set
string COMMENT 'from deserializer',
total_shipping_price_set
string COMMENT 'from deserializer',
total_tax_set
string COMMENT 'from deserializer',
total_price
string COMMENT 'from deserializer',
line_items
string COMMENT 'from deserializer',
processing_method
string COMMENT 'from deserializer',
order_number
string COMMENT 'from deserializer',
confirmed
string COMMENT 'from deserializer',
total_discounts
string COMMENT 'from deserializer',
total_line_items_price
string COMMENT 'from deserializer',
order_adjustments
string COMMENT 'from deserializer',
shipping_lines
string COMMENT 'from deserializer',
admin_graphql_api_id
string COMMENT 'from deserializer',
device_id
string COMMENT 'from deserializer',
cancel_reason
string COMMENT 'from deserializer',
currency
string COMMENT 'from deserializer',
payment_gateway_names
string COMMENT 'from deserializer',
source_identifier
string COMMENT 'from deserializer',
id
string COMMENT 'from deserializer',
processed_at
string COMMENT 'from deserializer',
referring_site
string COMMENT 'from deserializer',
contact_email
string COMMENT 'from deserializer',
location_id
string COMMENT 'from deserializer',
fulfillments
string COMMENT 'from deserializer',
customer
string COMMENT 'from deserializer',
test
string COMMENT 'from deserializer',
total_tax
string COMMENT 'from deserializer',
payment_details
string COMMENT 'from deserializer',
number
string COMMENT 'from deserializer',
email
string COMMENT 'from deserializer',
source_name
string COMMENT 'from deserializer',
landing_site_ref
string COMMENT 'from deserializer',
shipping_address
string COMMENT 'from deserializer',
total_price_usd
string COMMENT 'from deserializer',
closed_at
string COMMENT 'from deserializer',
discount_applications
string COMMENT 'from deserializer',
name
string COMMENT 'from deserializer',
note
string COMMENT 'from deserializer',
user_id
string COMMENT 'from deserializer',
source_url
string COMMENT 'from deserializer',
subtotal_price
string COMMENT 'from deserializer',
billing_address
string COMMENT 'from deserializer',
landing_site
string COMMENT 'from deserializer',
taxes_included
string COMMENT 'from deserializer',
token
string COMMENT 'from deserializer',
app_id
string COMMENT 'from deserializer',
total_tip_received
string COMMENT 'from deserializer',
browser_ip
string COMMENT 'from deserializer',
discount_codes
string COMMENT 'from deserializer',
tax_lines
string COMMENT 'from deserializer',
phone
string COMMENT 'from deserializer',
note_attributes
string COMMENT 'from deserializer',
fulfillment_status
string COMMENT 'from deserializer',
order_status_url
string COMMENT 'from deserializer',
client_details
string COMMENT 'from deserializer',
buyer_accepts_marketing
string COMMENT 'from deserializer',
checkout_token
string COMMENT 'from deserializer',
tags
string COMMENT 'from deserializer',
financial_status
string COMMENT 'from deserializer',
customer_locale
string COMMENT 'from deserializer',
checkout_id
string COMMENT 'from deserializer',
total_weight
string COMMENT 'from deserializer',
gateway
string COMMENT 'from deserializer',
cart_token
string COMMENT 'from deserializer',
cancelled_at
string COMMENT 'from deserializer',
refunds
tried again with fresh database, still same issue
As for the BigQuery question:
Q: Why doesn’t the same schema override for athena work for here?
A: Because you’re overriding the catalog with TAP_SHOPIFY__CATALOG=extract/tap-shopify.catalog.json
Unfortunately if I unset TAP_SHOPIFY__CATALOG or even just start out with just the schema in the YML like doing for athena, it fails because of the schema
v
https://meltano.slack.com/archives/C01TCRBBJD7/p1636219908139000?thread_ts=1636127769.131300&cid=C01TCRBBJD7 is helpful How did you run Meltano to get that created? What was your run command? invoke? etl? Your entire meltano.yml would be helpful What's curious is
subtotal_price
string COMMENT 'from deserializer', ` , with your override this should be an object. Something's causing that not to happen. Are you also running meltano with the
TAP_SHOPIFY__CATALOG
set?
n
respective yml files are attached and named accordingly 1. Athena: TAP_SHOPIFY__CATALOG is NOT SET as an environment variable. The schema section in the yml is sufficient. This project is ran inside it’s own virtual environment using
meltano elt tap-shopify target-athena
2. BigQuery: I run TAP_SHOPIFY__CATALOG set for Bigquery with the following command meltano elt tap-shopify target-bigquery. This is ran it it’s own virtual environment. This command fails if just using the schema in the yml (see error below). The way to resolve is to use the catalog file I attached previously and setting the TAP_SHOPIFY__CATALOG environment variable within the virtualenv.
Copy code
line 22, in get_type\n    raise ValueError(\n', "ValueError: 'type' or 'anyOf' are required fields in property: {}\n"]
@visch
v
Which one do we want to fix #1 or #2 ? 😄
Focusing on both isn't working!
n
More interested in why nothing works 🙂. As for 1 vs 2, whichever is simpler. I’ve moved to custom pipelines since meltano has been slowing me down. Now just interested in solving the problem and having this documented for others.
v
To be fair you have data flowing into Athena, with open source tools! That's pretty cool, but we can try to do better as always. No one's forcing you to use this tooling, so if you want to write and manage the stuff yourself go ahead. I choose these tools because it saves me a ton of time. I'll chose one then, but it's your call!
What's curious is
subtotal_price
string COMMENT 'from deserializer', ` , with your override this should be an object. Something's causing that not to happen.
My best two guesses without diving: 1. Either https://github.com/andrewcstewart/target-athena/blob/master/target_athena/utils.py#L162 isn't working at all (I doubt, but I don't know) 2. The schema isn't actually set properly for
subtotal_price
, if you run
meltano invoke tap-shopify > out
and look at the schema for the
orders
stream I wonder if
subtotal_price
is actually being set to Object properly. If that's working then I'd run
cat out | meltano invoke target-athena
and see if athena properly sets the tables for you. Then it's more obvious where to go I think
n
it is cool! I want it to work not just for me, but for others. Thats the only reason i’m still engaging 🙂 plus the community has been helpful
after running meltano invoke tap-shopify > out and searching for subtotal_price
Copy code
"subtotal_price": {"format": "singer.decimal", "type": ["null", "string"]}
v
Ahh nice that is very helpful. With all of that it looks like my original worry about
subtotal_price
is just wrong. The field you're overriding to be an object is actually
subtotal_price_set
, so I'm going to focus on that field right now. Focusing just on
subtotal_price_set
, in the Schema that you posted it is set properly to an object from
Copy code
{"type": "SCHEMA", "stream": "orders", "schema": {"properties": {"presentment_currency": {"type": ["null", "string"]}, "subtotal_price_set": {"type": ["object", "null"]},...
From the Athena ddl
Copy code
`subtotal_price_set` string COMMENT 'from deserializer',
Odd for sure. From here next steps I'd take are: 1. Take the schema data you listed for the
orders
stream 2. Snag one record line for the
orders
stream, make a test file just for that, let's call it
orders_test
3. Run
cat orders_test | meltano invoke target-athena
, that should replicate the error again (I'd try to point to a different database just to seperate your tests from the stuff in production). 4. After you have that easy file around to do some testing, now we have a failing test and we just need to make it pass 5. I'd throw a log line here https://github.com/andrewcstewart/target-athena/blob/master/target_athena/utils.py#L162 (or setup a break point here in a debugger) , and just see what's happening. Is this condition getting hit? What is
attributes['type']
when
subtotal_price_set
is ran through here?
Another option to skip all the debugging would be to change the column type of
subtotal_price_set
in athena to a struct instead
n
@visch I tried:
meltano invoke tap-shopify > output_test.json
I then opened up output_test.json and picked a random row of type RECORD. I copied everything to the right of {“type”: “RECORD”, “stream”: “orders”, “record”: into a file
orders_test
when i tried to run
cat orders_test | meltano invoke target-athena
I get the following error
Copy code
RuntimeError: Config validation failed: 's3_bucket' is a required property

Failed validating 'required' in schema:
    {'properties': {'add_metadata_columns': {'type': ['boolean', 'null']},
                    'athena_database': {'type': ['string']},
                    'aws_access_key_id': {'type': ['string', 'null']},
                    'aws_profile': {'type': ['string', 'null']},
                    'aws_region': {'type': ['string']},
                    'aws_secret_access_key': {'type': ['string', 'null']},
                    'aws_session_token': {'type': ['string', 'null']},
                    'compression': {'default': 'gzip',
                                    'type': ['string', 'null']},
                    'delimiter': {'default': ',',
                                  'type': ['string', 'null']},
                    'encryption_key': {'type': ['string', 'null']},
                    'encryption_type': {'type': ['string', 'null']},
                    'flatten_records': {'type': ['boolean', 'null']},
                    'naming_convention': {'type': ['string', 'null']},
                    'object_format': {'default': 'csv',
                                      'type': ['string', 'null']},
                    'quotechar': {'default': '"',
                                  'type': ['string', 'null']},
                    's3_bucket': {'type': ['string']},
                    's3_key_prefix': {'type': ['string', 'null']},
                    'stream_map_config': {'properties': {},
                                          'type': ['object', 'null']},
                    'stream_maps': {'properties': {},
                                    'type': ['object', 'null']},
                    'temp_dir': {'type': ['string', 'null']}},
     'required': ['s3_bucket', 'athena_database', 'aws_region'],
     'type': 'object'}

On instance:
    {'aws_region': 'us-west-2',
     'compression': 'gzip',
     'delimiter': ',',
     'object_format': 'jsonl',
     'quotechar': '"',
     'schema_name': 'shopify'}
JSONSchema was: {'type': 'object', 'properties': {'s3_bucket': {'type': ['string']}, 'athena_database': {'type': ['string']}, 'aws_region': {'type': ['string']}, 'aws_access_key_id': {'type': ['string', 'null']}, 'aws_secret_access_key': {'type': ['string', 'null']}, 'aws_session_token': {'type': ['string', 'null']}, 'aws_profile': {'type': ['string', 'null']}, 's3_key_prefix': {'type': ['string', 'null']}, 'naming_convention': {'type': ['string', 'null']}, 'object_format': {'type': ['string', 'null'], 'default': 'csv'}, 'compression': {'type': ['string', 'null'], 'default': 'gzip'}, 'encryption_type': {'type': ['string', 'null']}, 'encryption_key': {'type': ['string', 'null']}, 'add_metadata_columns': {'type': ['boolean', 'null']}, 'flatten_records': {'type': ['boolean', 'null']}, 'delimiter': {'type': ['string', 'null'], 'default': ','}, 'quotechar': {'type': ['string', 'null'], 'default': '"'}, 'temp_dir': {'type': ['string', 'null']}, 'stream_maps': {'type': ['object', 'null'], 'properties': {}}, 'stream_map_config': {'type': ['object', 'null'], 'properties': {}}}, 'required': ['s3_bucket', 'athena_database', 'aws_region']}
Not sure if that error is helpful at all ^ Is the point of just taking a single order in orders_test just to speed the failure process? One of the stores i’m pulling from has a very small amount of data so it’s not that much of a time save to rerun the whole pipe, so i can skip to 5? Thanks for showing that tho, can be useful in the future. Where can i find the code for target-athena relative to my meltano project?
Another option to skip all the debugging would be to change the column type of 
subtotal_price_set
 in athena to a struct instead
Is there a shortcut other than copying to a new column and casting to struct by specifying the full schema? Can i get the assumed struct meltano wants to use anywhere?
v
RuntimeError: Config validation failed: 's3_bucket' is a required property
Have you tried to fix this one?
Can i get the assumed struct meltano wants to use anywhere?
It's
target-athena
, not meltano subtle but very key difference.
Is there a shortcut other than copying to a new column and casting to struct by specifying the full schema?
Maybe, I don't know.
target-athena
can do it based on the code I linked to but you have to debug that
n
ah opps, had to rerun my environment variables
so do i need to branch target-athena, make a change a change, push the code, and in the meltano.yml file point to my edited repo? Or can i reference a local version of my code?
v
Yeah each tap/target can be slightly different. There's generic guides on the meltano site. For target-athena I'd read the Readme, and it even has a
meltano.yml
in the repo here https://github.com/MeltanoLabs/target-athena/blob/main/meltano.yml which points to the local executable which shows how to run a local copy pretty easily
so do i need to branch target-athena, make a change a change, push the code, and in the meltano.yml file point to my edited repo?
Or can i reference a local version of my code?
Both can work, however you like to work
When I'm debugging I'll hop into my site packages in my venv and manually add things. Although a faster way to do this is probably setup a debugger 🤷