I have been using the default batch_size_rows sett...
# troubleshooting
s
I have been using the default batch_size_rows settings for the pipelinewise target-snowflake i.e. the batch_size_rows = 100,000 records when loading the data via a Parquet format via a S3 bucket. I’m using Parquet because the job failed with an invalid character when I used a CSV file format and crashed.
Copy code
Found character '/' instead of field delimiter ','
The issue is the job refuses to process if I use the default setting of 100,000 records. I decide to back it off and changed it to 50,000 records and it loads. I would have thought however given the default was 100,000 rows that it should be okay. Please Note: I am using the latest docker image meltano/meltano:v1.86.0-python3.8 . I am running this in AWS ECS with a Task memory (MiB) 4096, and Task CPU (unit) 1024. Perhaps given the volume of data it sized a little too small for this initial load of at 500,000 row table with batches of 100,000 records? My other theory was perhaps processing 100,000 records on the Snowflake side takes longer and some ELT process is timing out. I would be interested in your thoughts on this and the two types of errors I received (see my logs below). @florian.hines, I am wondering if this is similar to the issue reported by @nick_hamlin because he is having to back off the batch_size_rows in a pipelinewise target as well. His target is redshift and mine is snowflake.   Thanks Steve   First Attempt using a batch_size of 100,000 for Snowflake. ``` 2021-10-29T184547.545+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.059658050537109375, "tags": {"endpoint": "batch_result_list", "sobject": "My_Salesforce_Table", "status": "succeeded"}} 2021-10-29T184547.546+13:00 tap-salesforce | extractor | INFO Making GET request to https://my_salesforce_instance/services/async/41.0/job/7505P000000Y5jxQAC/batch/7515P000000p1D3QAI/result/7525P000000P84G with params: None 2021-10-29T184621.977+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 1, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T184721.979+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7580, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T184821.980+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7455, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T184921.981+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7689, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T185021.983+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7773, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T185121.983+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7683, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T185221.987+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7576, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T185321.989+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7715, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T185422.041+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 8031, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T185522.043+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7987, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29T185622.044+13:00 tap-salesforce | extractor | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 7576, "tags": {"endpoint": "My_Salesforce_Table"}} 2021-10-29…
f
I’m guessing both the 1st error - the broken pipe while communicating with the tap or target, and 2nd are actually the same. Just that the loader failing (seems like its timing out perhaps?) is trigger an error in different spots depending on what meltano was doing internally at the time of failure. @aaronsteers i think you’ve used snowflake quite a bit, is there an easy way to track down if snowflake is timing out or the like ? (turning on debug logging is going to be super noisy because of the volume)
a
@steve_clarke - I don't know the details of the pipelinewise variant's implementation, but a similar batch issue was recently discussed, where the UPSERT max entries for a list is just over 16K. A similar error was discussed recently here, although I think this was for the (deprecating) 'Meltano' variant: https://meltano.slack.com/archives/CMN8HELB0/p1634209714296300?thread_ts=1634116216.278500&cid=CMN8HELB0
I read through this thread a few times, and I don't know what would cause the 50K batch size to work, while the 100K batch size does not work. Seems like two issues might be helpful to log on the Pipelinewise variant's issue tracker - 1 for the CSV parsing issue and another for the batch size failure. Since Parquet was a recent addition, I wonder if there are batch size constraints specific to Parquet in play here.
s
Thank you @aaronsteers and @florian.hines for your feedback - I appreciate your time. I will make contact with Pipelinewise about the two issues I have encountered and see what their feedback is. I will feedback to this thread once I have a bit more information.
a
Thanks! Much appreciated 👍 As was the case on the meltano variant's thread, I'll advocate that if there are inherent limits, we should try to document those into the respective README and ideally also we'd want to "fail fast" during tap initialization if a config setting is out of range from what Snowflake can support.
k
@steve_clarke have you already ruled out out-of-memory errors? Many warehouse targets (including the PPW snowflake variant) do record-bucketing in-memory before flushing to file for upload/insert/copy. We found that higher batch sizes (even the default) consume a lot of memory (upwards of 8GB). This is somewhat compounded by the 'incomplete' bucket challenge inherent with in-memory bucketing per-stream (half empty buckets hanging around in memory until a final flush call once all records have been received). Some ways to mitigate are: • As you have found, use a smaller batch size. • Increase memory resources. • Split streams into separate ECS Tasks. Hope that helps 🙂
s
Hi @ken_payne, thanks very much for contributing to this conversation. I have been experimenting with my ECS settings today. Here have been my findings. Update: I have continued to experiment with adjusting resources allocated to my Docker / Fargate environment. 1. Increase the Ephemeral storage from 20GiB -> 200GiB. Result : Still failing 2. Double the Memory and CPU resources from: 3. 1024 CPU (1 vCPU) -> 2048 (2 vCPU) 4. 4096 Memory (4 GB) -> 8192 (8 GB) With the increase in the CPU and Memory I was able to successfully ingest a large table with the batch size set to 100,000 records. I'm guessing that some sort of limit was being hit. My initial comment was "Will need to experiment further to determine whether it is CPU or Memory related, I'm guessing it is memory related." Thanks very much for you suggestion, it is great to get some confirmation around what is happening with the memory. I do have a Mega table many rows and more than 500 Columns. This one continues to trouble me so I would like to use the batch_wait_limit_seconds parameter in target-snowflake to force more regular writes. NOTE: It appears this may be a missing parameter from target-snowflake like the role parameter. I'll message Taylor about this because he might be able to add this to the same issue. Thanks Again.