very related to this parallel execution I am using the tap m Meltano #best-practices

very related to this parallel execution. I am usin...

jose_riego_valenzuela

07/08/2021, 3:08 PM

very related to this parallel execution. I am using the tap-mysql replication using binlog replication. The problem we are facing is that the MySQL database generates loads of binlog data. So the tap cannot loop through the binlog as fast as it is generated and it ends up being way too late so I was wondering, does anyone know of any way to optimize that? whether it is writing some threads for processing the event log and discarding events we don't care in parallel or any other ideas

douwe_maan

07/08/2021, 3:10 PM

One thing you can try is increasing the ELT buffer size (https://meltano.com/docs/settings.html#elt-buffer-size), so that the tap won’t be sitting there blocked and waiting too long when the target is slow to work through the extracted records

jose_riego_valenzuela

07/08/2021, 3:17 PM

thanks @douwe_maan! actually the problem we have is the opposite, the tap is too slow compared to the source database and on 100% cpu usage

douwe_maan

07/08/2021, 3:20 PM

Ah, I thought the tap may be appearing too slow because of the target slowing it down, but if you’ve confirmed it’s actually the tap at fault then the buffer size won’t help 🙂

douwe_maan

07/08/2021, 3:21 PM

The only route would then be to optimize the implementation of the tap. I’d start by finding out where the bottleneck is

jose_riego_valenzuela

07/08/2021, 3:36 PM

sure! from what I could see, the

tap-mysql

when it uses binlog replication it uses this library. Profiling the code it looked like most of the CPU time is spent here: https://github.com/noplay/python-mysql-replication/blob/98a4ecf6dbfff842078da46de39ca463e24e08d2/pymysqlreplication/binlogstream.py#L430

jose_riego_valenzuela

07/08/2021, 3:36 PM

if I just query using SQL the throughput is much bigger than what this library can do so I don't think it's the network

jose_riego_valenzuela

07/08/2021, 3:37 PM

but I'm not an expert

douwe_maan

07/08/2021, 3:55 PM

That

_read_packet

method is defined here: https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/connections.py#L683. It looks like a pretty low level method that reads directly from the MySQL connection, so I don’t think there’s much to optimize there

ken_payne

07/08/2021, 6:14 PM

This is an interesting problem. My 2c: If I understand correctly, the problem is that decoding binary files sequentially is cpu-bound and slow. So to speed things up we'd like to decode several files in parallel (one per thread). From the MySQL docs on the binlog format:

The final event is a log-rotation event that specifies the next binary log filename.

I don't see an easy way to get the last record from a remote log file, but you can retrieve from an offset. With the file size (

SHOW BINARY LOGS;

) and an understanding of the max row size, you could quite easily fetch the last few records to get the next binlog file. This could then be started in its own thread or process before the current one has started processing, for n parallel threads 😅 It would take a fair amount of bashing at

tap-mysql

but given the current implementation instantiates a single BinLogStreamReader and emits records as they arrive, doing a few files at a time in parallel should be much faster 🚀 In a parallel world, that raises the question of ordering - whilst its probably ok to retrieve and decode binary files in parallel, you still want to preserve ordering and emit messages in their order (after filtering etc.) rather than as they are encountered in each thread. This would have to be handled by in-memory buffering in the calling/parent thread, but there is no getting around that I don't think. Hope that makes sense 😅

ken_payne

07/08/2021, 6:29 PM

Actually, that assumes that having the binlog file as the unit of parallelism is desirable, but some binlog files might grow quite large before closing 🤔 It might actually be simpler to just get the current log file and position from the bookmark, the end file and position from this existing function and split the resulting row index into n size chunks (one per thread) 😅 Ordering is still important, but retrieval becomes slightly easier.

ken_payne

07/08/2021, 6:32 PM

@aaronsteers is database tap support on the horizon for the SDK? It would be great to separate these 'data retrieval concerns' from 'singer spec concerns' 😅

aaronsteers

07/08/2021, 6:34 PM

@ken_payne, yes, we actually removed it from the initial Tap launch but there's still a (now very stale) MR containing the archived code. SDK support for Database-type Streams (#74) · Issues · Meltano / Meltano SDK for Singer Taps and Targets · GitLab

aaronsteers

07/08/2021, 6:35 PM

Another community member also suggested "TabularStream" as a common base class for DatabaseStreams, but also useful for things like Parquet and CSV which are tabular but not SQL-based.

ken_payne

07/08/2021, 6:43 PM

I wonder if binlogs break the SDK mould somewhat 🤔 Records for all streams are mixed in, so it isn't really possible to do

for stream in tap.streams

without decoding the binlog and bucketing records into their respective streams first...

ken_payne

07/08/2021, 6:45 PM

A TabularStream base class would work for

FULL_TABLE

and

INCREMENTAL

as is - those just rely on

select * from where ...

which is 'stream-wise' 🙂

aaronsteers

07/08/2021, 6:47 PM

I wonder if binlogs break the SDK mould somewhat 🤔

Yes, perhaps so. Said another way, get_records() equivalent for binlog data retreival might always be a custom overloaded method.

aaronsteers

07/08/2021, 6:48 PM

For SQL Server, I think the implementation is just a very messy SQL statement, so the developer can just override how the SQL is formulated, but then the sync operation itself might be almost the same as a typical INCREMENTAL stream:

SELECT .. FROM ... WHERE ...

aaronsteers

07/08/2021, 6:49 PM

I'm a little out of my depth with the mysql binlog implementation, except that I expect it's probably a custom method.

aaronsteers

07/08/2021, 7:25 PM

@ken_payne and I were chatting in DM on this a bit more, and he's convinced me the implementation is trickier than I'd presented. Because multiple streams' data can be in the same binlog stream, we'd likely want to read once and emit records individually... and this might require new base classes in the SDK.

jose_riego_valenzuela

07/09/2021, 7:26 AM

just seeing this now. thanks every one for your help! for now I'll just try to see if we can find a faster server to run this on I'll check activity on this to see if I can collaborate somehow

Open in Slack

Previous Next