Hey all, getting a few loader errors when trying t...
# troubleshooting
n
Hey all, getting a few loader errors when trying to load Campaign Monitor data using target-bigquery: The target will hang once trying to load data from the tap and wont exit. When setting the loader method to
batch_job
I notice that we are getting certain 400 errors with the google API, and a
can't set attribute
error (1st ss). However, the pipeline will not exit and will continue to pull data, but no data is loaded and the program never exits. When trying to load the data with the
storage_write_api
method, we instead get a loader failed error right away, specifically with a
Failed to parse CustomFields field: expected string or bytes-like object
error, and the pipeline exits. (full trace in 2nd ss). Given the errors we are getting I believe it may be an issue with data types but haven’t found a resolution. Has anyone had a similar error to this? cc: @alexander_butler
a
Are you using the snake case option?
n
Not currently, should we?
a
No. That was just related to schema evolution error id heard. If you want to go with the most bulletproof load method first, my recommendation is batch job and denormalized false. Storage write should ship with a disclaimer that I would only recommend it for simple jsonschemas because the runtime translation to protobuf is not straightforward. So for simple data like salesforce or flat structures like dbs it’s fine. Even then I find batch job can outperform surprisingly.
The jobs attempts thing should have already been fixed. If your certain you are running on main I can give it a look. Maybe it was reverted?
n
I set denormalized to false using batch_job and that solved the 400 issue Will let you know if we are still facing the hanging issue once data finishes loading Thank you for the help!
Noticed a new issue, after loading data for a few minutes we get the following error: OSError: [Errno 24] Too many open files The extractor appears to make a few more API calls before stalling out and no more data is loaded. Have you seen this error before?
Update on this: after stalling for 20ish minutes we get a loader error: AttributeError: ‘Compressor’ object has no attribute ’_gzip
a
Ah the batch size is probably very low isnt it?
batch_job
should be set along with a
batch_size: 100000
or more, depends on the capacity of the node you run on. Its compressed on the fly. It says it in the readme. We should have better defaults I think, basically method: batch_job and batch_size 100k and denormalized: false since it should be fairly infallible.
So theres less experimentation 🤷
Then again, anyone could PR it if they cared enough.
n
Perfect, that appears to have done the trick!
One other issue I found, looks like now with every pipeline run a new table is being created rather than loading into a previous table, any ideas on whats causing this?
a
Those are temp tables used to atomically update the target table. They have an expiration and will clear themselves automatically
n
Hey @alexander_butler, for certain streams, we are getting a similar
'Compressor' object has no attribute '_gzip'
error as before. This appears to be an issue when the temp tables for these streams are merging into the main table. For each of these streams, we are loading data successfully into the temp tables, so I believe it is an error on the loader side. We have tested up to a batch size of 1 million and setting denormalized to
true
and
false
, but in every case the same attribute error arises. Do you have any insight into what could be causing an error when merging data from the temp table into the main table?
Hey @alexander_butler sorry for all the pings, any ideas on this? This is only happening for a select few larger tables which makes me think its an issue with batch size but have yet to find a solution, have tried to increase the batch size and testing the other methods
a
Can you test removing the
__del__
method from the compressor in
core.py
? Its all I can think of. There should never be a case where
._gzip
is not set. We have an if-else branch in the
init
that guarantees its set, one branch or the other should run. Also I never saw this issue in all my use. Let me know if that helps?