very related to this parallel execution. I am usin...
# best-practices
j
very related to this parallel execution. I am using the tap-mysql replication using binlog replication. The problem we are facing is that the MySQL database generates loads of binlog data. So the tap cannot loop through the binlog as fast as it is generated and it ends up being way too late so I was wondering, does anyone know of any way to optimize that? whether it is writing some threads for processing the event log and discarding events we don't care in parallel or any other ideas
d
One thing you can try is increasing the ELT buffer size (https://meltano.com/docs/settings.html#elt-buffer-size), so that the tap won鈥檛 be sitting there blocked and waiting too long when the target is slow to work through the extracted records
j
thanks @douwe_maan! actually the problem we have is the opposite, the tap is too slow compared to the source database and on 100% cpu usage
d
Ah, I thought the tap may be appearing too slow because of the target slowing it down, but if you鈥檝e confirmed it鈥檚 actually the tap at fault then the buffer size won鈥檛 help 馃檪
The only route would then be to optimize the implementation of the tap. I鈥檇 start by finding out where the bottleneck is
j
sure! from what I could see, the
tap-mysql
when it uses binlog replication it uses this library. Profiling the code it looked like most of the CPU time is spent here: https://github.com/noplay/python-mysql-replication/blob/98a4ecf6dbfff842078da46de39ca463e24e08d2/pymysqlreplication/binlogstream.py#L430
if I just query using SQL the throughput is much bigger than what this library can do so I don't think it's the network
but I'm not an expert
d
That
_read_packet
method is defined here: https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/connections.py#L683. It looks like a pretty low level method that reads directly from the MySQL connection, so I don鈥檛 think there鈥檚 much to optimize there
k
This is an interesting problem. My 2c: If I understand correctly, the problem is that decoding binary files sequentially is cpu-bound and slow. So to speed things up we'd like to decode several files in parallel (one per thread). From the MySQL docs on the binlog format:
The final event is a log-rotation event that specifies the next binary log filename.
I don't see an easy way to get the last record from a remote log file, but you can retrieve from an offset. With the file size (
SHOW BINARY LOGS;
) and an understanding of the max row size, you could quite easily fetch the last few records to get the next binlog file. This could then be started in its own thread or process before the current one has started processing, for n parallel threads 馃槄 It would take a fair amount of bashing at
tap-mysql
but given the current implementation instantiates a single BinLogStreamReader and emits records as they arrive, doing a few files at a time in parallel should be much faster 馃殌 In a parallel world, that raises the question of ordering - whilst its probably ok to retrieve and decode binary files in parallel, you still want to preserve ordering and emit messages in their order (after filtering etc.) rather than as they are encountered in each thread. This would have to be handled by in-memory buffering in the calling/parent thread, but there is no getting around that I don't think. Hope that makes sense 馃槄
Actually, that assumes that having the binlog file as the unit of parallelism is desirable, but some binlog files might grow quite large before closing 馃 It might actually be simpler to just get the current log file and position from the bookmark, the end file and position from this existing function and split the resulting row index into n size chunks (one per thread) 馃槄 Ordering is still important, but retrieval becomes slightly easier.
@aaronsteers is database tap support on the horizon for the SDK? It would be great to separate these 'data retrieval concerns' from 'singer spec concerns' 馃槄
a
@ken_payne, yes, we actually removed it from the initial Tap launch but there's still a (now very stale) MR containing the archived code. SDK support for Database-type Streams (#74) 路 Issues 路 Meltano / Meltano SDK for Singer Taps and Targets 路 GitLab
Another community member also suggested "TabularStream" as a common base class for DatabaseStreams, but also useful for things like Parquet and CSV which are tabular but not SQL-based.
k
I wonder if binlogs break the SDK mould somewhat 馃 Records for all streams are mixed in, so it isn't really possible to do
for stream in tap.streams
without decoding the binlog and bucketing records into their respective streams first...
A TabularStream base class would work for
FULL_TABLE
and
INCREMENTAL
as is - those just rely on
select * from where ...
which is 'stream-wise' 馃檪
a
I wonder if binlogs break the SDK mould somewhat聽馃
Yes, perhaps so. Said another way, get_records() equivalent for binlog data retreival might always be a custom overloaded method.
For SQL Server, I think the implementation is just a very messy SQL statement, so the developer can just override how the SQL is formulated, but then the sync operation itself might be almost the same as a typical INCREMENTAL stream:
SELECT .. FROM ... WHERE ...
I'm a little out of my depth with the mysql binlog implementation, except that I expect it's probably a custom method.
@ken_payne and I were chatting in DM on this a bit more, and he's convinced me the implementation is trickier than I'd presented. Because multiple streams' data can be in the same binlog stream, we'd likely want to read once and emit records individually... and this might require new base classes in the SDK.
j
just seeing this now. thanks every one for your help! for now I'll just try to see if we can find a faster server to run this on I'll check activity on this to see if I can collaborate somehow