Hello I have a single table that I am having a difficult tim Meltano #troubleshooting

Hello, I have a single table that I am having a di...

Adam Wegscheid

01/06/2025, 8:29 PM

Hello, I have a single table that I am having a difficult time getting to load. After one batch is loaded, the process as a whole "hangs" (best word I could think to describe the situation). The table has about 370 columns and 8GB of total data. I have followed the execution flow to the

get_batches()

method (https://github.com/meltano/sdk/blob/main/singer_sdk/streams/core.py#L1431) which is the point in which execution hangs. A second "record_count" metric is sent for what would be the second batch but nothing comes of it. After 30 minutes of waiting, if I try and manually kill the process, a python process taking about 20GB of RAM remains which I also have to kill manually. This does not happen with any other load. Any ideas on what might be causing this? The system has more than enough resources to handle the load (usually only taking about 20% of CPU and 30% of memory), so I do not suspect that is the issue. I receive the same results when running from a Windows 11 machine, a Windows 2022 Server VM, and an Ubunutu 22.04 VM

👀 1

visch

01/07/2025, 1:20 PM

What I'd do is run

meltano invoke tap-oracle > out

does that hang? If not

meltano invoke target-redshit < out

does that hang? If so where? Is it one record? Can you isolate it, etc

visch

01/07/2025, 1:20 PM

They shouldn't ever hang like that so it's a pretty nasty bug

Adam Wegscheid

01/07/2025, 2:24 PM

I ran both commands and

meltano invoke tap-oracle > out

hung in exactly the same spot. After killing the process (waited about 20 minutes to see if it would finish), I ran

meltano invoke target-redshift < out

and it completed successfully for the one batch file that did get written.

Adam Wegscheid

01/07/2025, 2:28 PM

I added a row count log and it always loads up to 1,000,000 records (so it fills the sink fully a 2nd time since that batch limit is 500k records), sends out the record_count metric for the 2nd sink and then just... stops. The only unique thing I am seeing is that the first record_count metric pings with a count of "1". Could be nothing but I don't see that being the case for other streams.

Adam Wegscheid

01/07/2025, 3:15 PM

To add some diversity to the tests, I set the start_date value to see what would happen: 1. start_date: NULL a. Hangs at 1,000,000 records (2 full batches) 2. start_date: 2000-01-01 a. Hangs at 1,000,000 records (2 full batches) 3. start_date: 2023-10-01 a. Hangs at 920,000 records (1 full batch, one partial batch) 4. start_date: 2024-10-01 a. Completes successfully (1 partial batch, 120_000 records)

Edgar Ramírez (Arch.dev)

01/07/2025, 3:32 PM

Could be nothing but I don't see that being the case for other loads.

Do the other loads use different connectors or do they not use batch?

Adam Wegscheid

01/07/2025, 3:33 PM

I corrected my message. I meant others streams in the same run

Adam Wegscheid

01/07/2025, 8:29 PM

@Edgar Ramírez (Arch.dev) I spent the day trying out many different configurations and I found that regardless of start_date or number of records, it always hung when the table had at least 342 columns. 341 was totally fine and ran like normal. 342 columns and it hung. I tried many different columns of different types and content for the 342nd column and it made no difference. Any thoughts?

Adam Wegscheid

01/07/2025, 11:47 PM

(Just live taking notes for anyone else that may experience this) I dropped the batch count down from 1_000_000 to 500_000 with no luck but dropping it down to 200_000 records per batch did end up clearing the hurdle. I am struggling to understand why because there is absolutely no resource contention but this is where my python knowledge starts to hit its limit.

Adam Wegscheid

01/07/2025, 11:58 PM

The second batch of 200_000 records takes only 17s to retrieve and process during the ``_sync_records()` call. However, after the 200_000 records are retrieved, the subsequent call to

lazy_chunked_generator()

does not occur for another 128s. I cannot figure out what is happening in that gap.

visch

01/08/2025, 1:10 PM

I'd try to query that column yourself with your own query engine, maybe something is up? Make sure that it works when you pull it, then it's just narrowing it down. This is a weird one for sure

visch

01/08/2025, 1:10 PM

Dbeaver or something?

visch

01/08/2025, 1:11 PM

My guess is the data has something weird about it that's causing issues, the problem is the issue could be in a bunch of places

visch

01/08/2025, 1:12 PM

(Oracle DB itself, SQL Driver, Network/OS, Tap itself, etc) unfortunante part of that tap is it isn't super clean (not sdk based) but still a hard problem regardless. Debugging options should be better

Adam Wegscheid

01/08/2025, 1:48 PM

I thought the same as well, but as mentioned above I tried many different columns with different types (date, string, number, etc.) and with different contents and nothing changed. The last set of columns (320-370) are pretty much exclusively NULL other than the two final auditing columns. We are using our own custom tap for oracle built directly from the SDK. Our custom tap has virtually no overrides beyond the basics like

get_sqlalchemy_url

and

create_engine

for the SqlConnector. (I queried the data directly and it returned fine. Even exported the table as a whole from SQL Oracle Developer and DataGrip.)

Adam Wegscheid

01/08/2025, 1:50 PM

I would think that if it was a content issue, no batch size would solve the problem but my mental logic could be off.

Adam Wegscheid

01/08/2025, 1:53 PM

You are right though, it could be a whole number of things that are hard to debug through a slack message chain. At least dropping the batch size down allowed me to continue forward.

4 Views

Open in Slack

Previous Next