Hello, I have a single table that I am having a di...
# troubleshooting
a
Hello, I have a single table that I am having a difficult time getting to load. After one batch is loaded, the process as a whole "hangs" (best word I could think to describe the situation). The table has about 370 columns and 8GB of total data. I have followed the execution flow to the
get_batches()
method (https://github.com/meltano/sdk/blob/main/singer_sdk/streams/core.py#L1431) which is the point in which execution hangs. A second "record_count" metric is sent for what would be the second batch but nothing comes of it. After 30 minutes of waiting, if I try and manually kill the process, a python process taking about 20GB of RAM remains which I also have to kill manually. This does not happen with any other load. Any ideas on what might be causing this? The system has more than enough resources to handle the load (usually only taking about 20% of CPU and 30% of memory), so I do not suspect that is the issue. I receive the same results when running from a Windows 11 machine, a Windows 2022 Server VM, and an Ubunutu 22.04 VM
👀 1
v
What I'd do is run
meltano invoke tap-oracle > out
does that hang? If not
meltano invoke target-redshit < out
does that hang? If so where? Is it one record? Can you isolate it, etc
They shouldn't ever hang like that so it's a pretty nasty bug
a
I ran both commands and
meltano invoke tap-oracle > out
hung in exactly the same spot. After killing the process (waited about 20 minutes to see if it would finish), I ran
meltano invoke target-redshift < out
and it completed successfully for the one batch file that did get written.
I added a row count log and it always loads up to 1,000,000 records (so it fills the sink fully a 2nd time since that batch limit is 500k records), sends out the record_count metric for the 2nd sink and then just... stops. The only unique thing I am seeing is that the first record_count metric pings with a count of "1". Could be nothing but I don't see that being the case for other streams.
To add some diversity to the tests, I set the start_date value to see what would happen: 1. start_date: NULL a. Hangs at 1,000,000 records (2 full batches) 2. start_date: 2000-01-01 a. Hangs at 1,000,000 records (2 full batches) 3. start_date: 2023-10-01 a. Hangs at 920,000 records (1 full batch, one partial batch) 4. start_date: 2024-10-01 a. Completes successfully (1 partial batch, 120_000 records)
e
Could be nothing but I don't see that being the case for other loads.
Do the other loads use different connectors or do they not use batch?
a
I corrected my message. I meant others streams in the same run
@Edgar Ramírez (Arch.dev) I spent the day trying out many different configurations and I found that regardless of start_date or number of records, it always hung when the table had at least 342 columns. 341 was totally fine and ran like normal. 342 columns and it hung. I tried many different columns of different types and content for the 342nd column and it made no difference. Any thoughts?
(Just live taking notes for anyone else that may experience this) I dropped the batch count down from 1_000_000 to 500_000 with no luck but dropping it down to 200_000 records per batch did end up clearing the hurdle. I am struggling to understand why because there is absolutely no resource contention but this is where my python knowledge starts to hit its limit.
The second batch of 200_000 records takes only 17s to retrieve and process during the ``_sync_records()` call. However, after the 200_000 records are retrieved, the subsequent call to
lazy_chunked_generator()
does not occur for another 128s. I cannot figure out what is happening in that gap.
v
I'd try to query that column yourself with your own query engine, maybe something is up? Make sure that it works when you pull it, then it's just narrowing it down. This is a weird one for sure
Dbeaver or something?
My guess is the data has something weird about it that's causing issues, the problem is the issue could be in a bunch of places
(Oracle DB itself, SQL Driver, Network/OS, Tap itself, etc) unfortunante part of that tap is it isn't super clean (not sdk based) but still a hard problem regardless. Debugging options should be better
a
I thought the same as well, but as mentioned above I tried many different columns with different types (date, string, number, etc.) and with different contents and nothing changed. The last set of columns (320-370) are pretty much exclusively NULL other than the two final auditing columns. We are using our own custom tap for oracle built directly from the SDK. Our custom tap has virtually no overrides beyond the basics like
get_sqlalchemy_url
and
create_engine
for the SqlConnector. (I queried the data directly and it returned fine. Even exported the table as a whole from SQL Oracle Developer and DataGrip.)
I would think that if it was a content issue, no batch size would solve the problem but my mental logic could be off.
You are right though, it could be a whole number of things that are hard to debug through a slack message chain. At least dropping the batch size down allowed me to continue forward.