Wondering if anyone has seen anything like this - ...
# troubleshooting
l
Wondering if anyone has seen anything like this • i’m using the most recent version of meltano, and the most recent (as of a few days ago) versions of the postgres tap (transferwise variant) and bigquery loader (z3z1ma variant), incremental replication on a ~260M row table • i’m using the GCS staging functionality of the bigquery loader and denormalization, otherwise default settings (this is perhaps part of my problem, default batch size is 500 which produces a ton of small-ish jsonl files that are staged in GCS) • the loader has been hanging for about 14 hours. it appears to be stuck doing
[<http://self.logger.info|self.logger.info>](<<http://self.logger.info/>>)("URIs: %s", ", ".join(self.uris))
where
self.uris
is probably a few 100,000 URIs. see https://github.com/z3z1ma/target-bigquery/blob/c4cc54f760032c8e9f2575adbe7d72b7a0a92d94/target_bigquery/gcs_stage.py#L185 • strace on the ec2 instance i’m running this on shows this for pid
18578
Copy code
ubuntu@ip-10-1-4-224:~/meltano-elt$ sudo strace -f -p 18758
strace: Process 18758 attached with 2 threads
[pid 18760] futex(0x7fbf31416630, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 18758] write(2, "jsonl.gz,<gs://some_bucket/targ>"..., 56696505^Cstrace: Process 18758 detached
 <detached ...>
strace: Process 18760 detached
• this seems to confirm that it is indeed stuck trying to write this giant string (i think 56696505 may be the number of bytes? this means about 54 megs which is roughly in line with 100s of thousands of URLs) to fd 2 which i think is usually STDOUT, but looking at
/proc/18758/fd
i see
2 -> 'pipe:[361373]'
and looking at lsof i see
Copy code
ubuntu@ip-10-1-4-224:~/meltano-elt$ lsof | grep 361373
meltano    18753                           ubuntu   16r     FIFO               0,13       0t0     361373 pipe
meltano    18753  18759 meltano            ubuntu   16r     FIFO               0,13       0t0     361373 pipe
target-bi  18758                           ubuntu    2w     FIFO               0,13       0t0     361373 pipe
target-bi  18758  18760 jemalloc_          ubuntu    2w     FIFO               0,13       0t0     361373 pipe
gzip       19815                           ubuntu    2w     FIFO               0,13       0t0     361373 pipe
gzip       19816                           ubuntu    2w     FIFO               0,13       0t0     361373 pipe
gzip      575232                           ubuntu    2w     FIFO               0,13       0t0     361373 pipe
• which seems like pid 18758 (among others) is writing to this fifo and pid 18753 is trying reading from it (or at least, would normally be trying to read from it), probably to write logs to stdout. 18753 is the parent process of this, see pstree (hiding threads)
Copy code
ubuntu@ip-10-1-4-224:~/meltano-elt$ pstree -p 18753 -T
meltano(18753)───target-bigquery(18758)─┬─gzip(19815)
                                        ├─gzip(19816)
                                        └─gzip(575232)
• the gzip processes are doing nothing, looks like just waiting to read input. • the parent process, 18753 is doing stuff, according to strace, but i can’t really tell what. seems to be opening/writing/reading to the meltano sqlite db over and over, but there’s also activity related to sqlite itself (wal/shm files). the db doesn’t seem to be large or growing, just chilling at 176 kb • i was eventually able to “unclog” this situation by just manually running
head -n 1
on the parent process fd that pointed at the named pipe. according to strace, the child process 18758 then began attempting to hit the bigquery API which was nice, but this eventually failed as per below: ```2023-05-03T173738.276785Z [error ] Loader failed 2023-05-03T173738.279070Z [error ] Block run completed. block_type=ExtractLoadBlocks err=RunnerError('Loader failed') exit_codes={<PluginType.LOADERS: 'loaders…
v
I don't have anything offhand to help, great writeup. I wonder if you could turn the log level up to warning for target-bigquery to confirm your theory that it's that log line. (monkey patch would work just as well) https://github.com/benfred/py-spy might also be able to triple confirm your theory 🤷 Sorry I'm not more help busy day today 😄