Here is the proof of the inconsistent columns orde...
# troubleshooting
s
Here is the proof of the inconsistent columns ordering from S3 CSV file and Athena table
Copy code
Header Sequence of CSV file

headlinesMultiAssetResponsiveDisplay,longHeadlineMultiAssetResponsiveDisplay,responsiveSearchAdDescriptions,responsiveSearchAdHeadlines,imprAbsTop,currency,account,timeZone,activeViewAvgCPM,activeViewViewableCTR,activeViewViewableImpressions,activeViewMeasurableImprImpr,activeViewMeasurableCost,activeViewMeasurableImpr,activeViewViewableImprMeasurableImpr,adGroupID,adGroup,adGroupState,network,adStrength,adType,allConvRate,allConv,allConvValue,autoAppliedAdSuggestion,avgCost,avgCPC,avgCPE,avgCPM,avgCPV,avgPosition,businessName,callToActionTextResponsive,campaignID,campaign,campaignState,clicks,approvalStatus,convRate,conversions,totalConvValue,cost,costAllConv,costConv,costConvCurrentModel,finalURL,crossDeviceConv,ctr,conversionsCurrentModel,convValueCurrentModel,clientName,day,description,descriptionLine1,descriptionLine2,device,displayURL,engagementRate,engagements,customerID,gmailForwards,gmailSaves,gmailClicksToWebsite,gmailAdBusinessName,gmailAdDescription,gmailAdHeadline,ad,headline1,headline2,adID,imageAdName,impressions,interactionRate,interactions,interactionTypes,longHeadline,businessNameMultiAssetResponsiveDisplay,callToActionTextMultiAssetResponsiveDisplay,promotionTextMultiAssetResponsiveDisplay,path1,promotionTextResponsive,responsiveSearchAdPath1,shortHeadline,adState,imprTop,valueAllConv,valueConv,valueConvCurrentModel,videoPlayedTo100,videoPlayedTo25,videoPlayedTo50,videoPlayedTo75,viewRate,views,viewThroughConv,_sdc_customer_id,_sdc_report_datetime
the screencap is the header sequence when inserting data to AWS Athena table.
a
Hi @sai_tai - This is known behavior with the CSV serialization that the target inherited from
target-csv-s3
, either https://github.com/dataops-tk/target-athena/issues/13 or https://github.com/dataops-tk/target-athena/issues/6
s
Thank you @andrew_stewart. I think issue 13 is relavent to my case. So actually even we've defined the schema (like accounts, ad_groups, ads and campaigns in tap-adwords) in extrator, target-athena will still reorder the column sequence under the current behavior? Are there any workaround that you can think of so that I can ensure both column and data consistent in the final Athena table?
a
The CSV serializer needs to be completely rewritten imho. I would suggest using the jsonl serializer instead.
(Hoping to add parquet at some point soon)
s
Thank you @andrew_stewart. Since I'm creating an urgent proof of concept product currently, let me switch to use another destination at the moment.
Do you have a plan to rewrite the CSV serializer part or you are looking for someone to contribute that part?
a
I'd certainly welcome a contributor on that part.
(or even at least contributed suggestions on the CSV serializer.. you can see from those issues that part of the problem is that the field order in the schema dict from some taps isn’t guaranteed to be consistent, and that particular CSV serializer basically just writes each record dict as-is as a new line in the temp csv file. One idea is for the SDK to use an OrderedDict instead, but that doesn’t really solve the problem of a row-minded CSV serializer not being able to anticipate vastly different field sets. Maybe pandas or some other lib has a good way to deal with this.)
s
Would you please give me the line number so that I can play around with that part locally? Would you tell me the quickest way to test and develop the package? Is it using
some-singer-tap | target-athena --config [config.json]
? I haven't ever tried to develop any sign tap before.. I think that is a good exercise for me but I need some guidance haha
I tried to fork the repo to run it locally but unfortunately, I've no clue where to start.. After forked and clone a copy locally, I tried to run the noest command. It returns
ModuleNotFoundError: No module named 'target_athena'
even I have used
pip install .
to install the package.
a
I'm not that far ahead of you on the singer learning curve, so I might not be best to ask 😄 . But what I have generally done is to test the target from within Meltano
If you want to test completely local, I think what you can do is configure the target in meltano.yml w/ the pip_url parameter pointed to /local/path/to/pkg
s
Thank you @andrew_stewart. I'll try to take a look tomorrow Regarding the point that you mention here; https://meltano.slack.com/archives/C01TCRBBJD7/p1626199005062400 I believe this is the part around the issue you mentioned above?
a
No
that logic is actually being bypassed by default
s
great thank you. I wonder when you run
nosetests --where=tests/unit
. Did you face the same issue which is
AttributeError: module 'target_athena' has no attribute 'utils'
from tests/unit/test_unit.py ?
Or do you just use Meltano to directly run the extractor code without using the unit test file?
One thing that I don't understand - Where did we define the schema for creating the AthenaSink object? I thought the schema that we defined in tap-adwords should be the one that AthenaSink uses, but it seems to me it doesn't.. To understand that part I should read through the Meltano document right? https://sdk.meltano.com/en/main/sinks.html
Copy code
class AthenaSink(BatchSink): 
 def __init__(
        self,
        target,
        stream_name,
        schema,
        key_properties,
    ):
a
All the tests there are leftover from the repo that this was forked from. We haven’t updated any of them yet.
Though please feel free to.
But yeah, I’ve just been using Meltano as a test harness.
Questions about the SDK classes are probably best aimed at @aaronsteers
a
One thing that I don't understand - Where did we define the schema for creating the AthenaSink object?
I thought the schema that we defined in tap-adwords should be the one that AthenaSink uses, but it seems to me it doesn't..
Hi, @sai_tai - The SDK internal code handles receiving SCHEMA messages from the tap and initializing your Sink to handle any records it receives with the same stream name as the SCHEMA messages. So, as you are writing the Sink implementation, you can assume you are only handling records that correspond to the matched SCHEMA message and its containing info.
If there seems to be a mismatch between the Sink's schema and the records it gets passed, that would be a bug on the SDK side (or the tap side).
I'll also mention that (for now at least) if a stream's schema is replaced with a new SCHEMA message, that forces a flush of the existing stream and the SDK will auto-create a new instance of your Sink to handle records with the new schema. In this way, you can assume a constant schema while you are working with a single batch of records.
s
Thank you, folks. That is a very clear explanation about how the Sinks schema works.