Here is the proof of the inconsistent columns ordering from Meltano #troubleshooting

Here is the proof of the inconsistent columns orde...

sai_tai

07/12/2021, 11:32 PM

Here is the proof of the inconsistent columns ordering from S3 CSV file and Athena table

Copy code

Header Sequence of CSV file

headlinesMultiAssetResponsiveDisplay,longHeadlineMultiAssetResponsiveDisplay,responsiveSearchAdDescriptions,responsiveSearchAdHeadlines,imprAbsTop,currency,account,timeZone,activeViewAvgCPM,activeViewViewableCTR,activeViewViewableImpressions,activeViewMeasurableImprImpr,activeViewMeasurableCost,activeViewMeasurableImpr,activeViewViewableImprMeasurableImpr,adGroupID,adGroup,adGroupState,network,adStrength,adType,allConvRate,allConv,allConvValue,autoAppliedAdSuggestion,avgCost,avgCPC,avgCPE,avgCPM,avgCPV,avgPosition,businessName,callToActionTextResponsive,campaignID,campaign,campaignState,clicks,approvalStatus,convRate,conversions,totalConvValue,cost,costAllConv,costConv,costConvCurrentModel,finalURL,crossDeviceConv,ctr,conversionsCurrentModel,convValueCurrentModel,clientName,day,description,descriptionLine1,descriptionLine2,device,displayURL,engagementRate,engagements,customerID,gmailForwards,gmailSaves,gmailClicksToWebsite,gmailAdBusinessName,gmailAdDescription,gmailAdHeadline,ad,headline1,headline2,adID,imageAdName,impressions,interactionRate,interactions,interactionTypes,longHeadline,businessNameMultiAssetResponsiveDisplay,callToActionTextMultiAssetResponsiveDisplay,promotionTextMultiAssetResponsiveDisplay,path1,promotionTextResponsive,responsiveSearchAdPath1,shortHeadline,adState,imprTop,valueAllConv,valueConv,valueConvCurrentModel,videoPlayedTo100,videoPlayedTo25,videoPlayedTo50,videoPlayedTo75,viewRate,views,viewThroughConv,_sdc_customer_id,_sdc_report_datetime

the screencap is the header sequence when inserting data to AWS Athena table.

andrew_stewart

07/13/2021, 1:24 AM

Hi @sai_tai - This is known behavior with the CSV serialization that the target inherited from

target-csv-s3

, either https://github.com/dataops-tk/target-athena/issues/13 or https://github.com/dataops-tk/target-athena/issues/6

sai_tai

07/13/2021, 9:33 AM

Thank you @andrew_stewart. I think issue 13 is relavent to my case. So actually even we've defined the schema (like accounts, ad_groups, ads and campaigns in tap-adwords) in extrator, target-athena will still reorder the column sequence under the current behavior? Are there any workaround that you can think of so that I can ensure both column and data consistent in the final Athena table?

andrew_stewart

07/13/2021, 3:45 PM

The CSV serializer needs to be completely rewritten imho. I would suggest using the jsonl serializer instead.

andrew_stewart

07/13/2021, 3:45 PM

(Hoping to add parquet at some point soon)

sai_tai

07/13/2021, 4:02 PM

Thank you @andrew_stewart. Since I'm creating an urgent proof of concept product currently, let me switch to use another destination at the moment.

sai_tai

07/13/2021, 4:02 PM

Do you have a plan to rewrite the CSV serializer part or you are looking for someone to contribute that part?

andrew_stewart

07/13/2021, 4:03 PM

I'd certainly welcome a contributor on that part.

andrew_stewart

07/13/2021, 5:56 PM

(or even at least contributed suggestions on the CSV serializer.. you can see from those issues that part of the problem is that the field order in the schema dict from some taps isn’t guaranteed to be consistent, and that particular CSV serializer basically just writes each record dict as-is as a new line in the temp csv file. One idea is for the SDK to use an OrderedDict instead, but that doesn’t really solve the problem of a row-minded CSV serializer not being able to anticipate vastly different field sets. Maybe pandas or some other lib has a good way to deal with this.)

sai_tai

07/13/2021, 9:33 PM

Would you please give me the line number so that I can play around with that part locally? Would you tell me the quickest way to test and develop the package? Is it using

some-singer-tap | target-athena --config [config.json]

? I haven't ever tried to develop any sign tap before.. I think that is a good exercise for me but I need some guidance haha

sai_tai

07/13/2021, 9:34 PM

I tried to fork the repo to run it locally but unfortunately, I've no clue where to start.. After forked and clone a copy locally, I tried to run the noest command. It returns

ModuleNotFoundError: No module named 'target_athena'

even I have used

pip install .

to install the package.

andrew_stewart

07/13/2021, 9:37 PM

I'm not that far ahead of you on the singer learning curve, so I might not be best to ask 😄 . But what I have generally done is to test the target from within Meltano

andrew_stewart

07/13/2021, 9:38 PM

If you want to test completely local, I think what you can do is configure the target in meltano.yml w/ the pip_url parameter pointed to /local/path/to/pkg

sai_tai

07/13/2021, 10:52 PM

Thank you @andrew_stewart. I'll try to take a look tomorrow Regarding the point that you mention here; https://meltano.slack.com/archives/C01TCRBBJD7/p1626199005062400 I believe this is the part around the issue you mentioned above?

andrew_stewart

07/13/2021, 10:52 PM

andrew_stewart

07/13/2021, 10:53 PM

that logic is actually being bypassed by default

andrew_stewart

07/13/2021, 10:55 PM

see https://github.com/dataops-tk/target-athena/blob/main/target_athena/formats.py

sai_tai

07/14/2021, 11:36 AM

great thank you. I wonder when you run

nosetests --where=tests/unit

. Did you face the same issue which is

AttributeError: module 'target_athena' has no attribute 'utils'

from tests/unit/test_unit.py ?

sai_tai

07/14/2021, 12:06 PM

Or do you just use Meltano to directly run the extractor code without using the unit test file?

sai_tai

07/14/2021, 2:56 PM

One thing that I don't understand - Where did we define the schema for creating the AthenaSink object? I thought the schema that we defined in tap-adwords should be the one that AthenaSink uses, but it seems to me it doesn't.. To understand that part I should read through the Meltano document right? https://sdk.meltano.com/en/main/sinks.html

Copy code

class AthenaSink(BatchSink): 
 def __init__(
        self,
        target,
        stream_name,
        schema,
        key_properties,
    ):

andrew_stewart

07/14/2021, 6:39 PM

All the tests there are leftover from the repo that this was forked from. We haven’t updated any of them yet.

andrew_stewart

07/14/2021, 6:39 PM

Though please feel free to.

andrew_stewart

07/14/2021, 6:39 PM

But yeah, I’ve just been using Meltano as a test harness.

andrew_stewart

07/14/2021, 6:40 PM

Questions about the SDK classes are probably best aimed at @aaronsteers

aaronsteers

07/14/2021, 6:43 PM

One thing that I don't understand - Where did we define the schema for creating the AthenaSink object?

I thought the schema that we defined in tap-adwords should be the one that AthenaSink uses, but it seems to me it doesn't..

Hi, @sai_tai - The SDK internal code handles receiving SCHEMA messages from the tap and initializing your Sink to handle any records it receives with the same stream name as the SCHEMA messages. So, as you are writing the Sink implementation, you can assume you are only handling records that correspond to the matched SCHEMA message and its containing info.

aaronsteers

07/14/2021, 6:44 PM

If there seems to be a mismatch between the Sink's schema and the records it gets passed, that would be a bug on the SDK side (or the tap side).

aaronsteers

07/14/2021, 6:45 PM

I'll also mention that (for now at least) if a stream's schema is replaced with a new SCHEMA message, that forces a flush of the existing stream and the SDK will auto-create a new instance of your Sink to handle records with the new schema. In this way, you can assume a constant schema while you are working with a single batch of records.

sai_tai

07/16/2021, 2:17 PM

Thank you, folks. That is a very clear explanation about how the Sinks schema works.

3 Views

Open in Slack

Previous Next