jose_riego_valenzuela
07/08/2021, 3:08 PMdouwe_maan
07/08/2021, 3:10 PMjose_riego_valenzuela
07/08/2021, 3:17 PMdouwe_maan
07/08/2021, 3:20 PMdouwe_maan
07/08/2021, 3:21 PMjose_riego_valenzuela
07/08/2021, 3:36 PMtap-mysql
when it uses binlog replication it uses this library. Profiling the code it looked like most of the CPU time is spent here:
https://github.com/noplay/python-mysql-replication/blob/98a4ecf6dbfff842078da46de39ca463e24e08d2/pymysqlreplication/binlogstream.py#L430jose_riego_valenzuela
07/08/2021, 3:36 PMjose_riego_valenzuela
07/08/2021, 3:37 PMdouwe_maan
07/08/2021, 3:55 PM_read_packet
method is defined here: https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/connections.py#L683. It looks like a pretty low level method that reads directly from the MySQL connection, so I don鈥檛 think there鈥檚 much to optimize thereken_payne
07/08/2021, 6:14 PMThe final event is a log-rotation event that specifies the next binary log filename.I don't see an easy way to get the last record from a remote log file, but you can retrieve from an offset. With the file size (
SHOW BINARY LOGS;
) and an understanding of the max row size, you could quite easily fetch the last few records to get the next binlog file. This could then be started in its own thread or process before the current one has started processing, for n parallel threads 馃槄
It would take a fair amount of bashing at tap-mysql
but given the current implementation instantiates a single BinLogStreamReader and emits records as they arrive, doing a few files at a time in parallel should be much faster 馃殌
In a parallel world, that raises the question of ordering - whilst its probably ok to retrieve and decode binary files in parallel, you still want to preserve ordering and emit messages in their order (after filtering etc.) rather than as they are encountered in each thread. This would have to be handled by in-memory buffering in the calling/parent thread, but there is no getting around that I don't think.
Hope that makes sense 馃槄ken_payne
07/08/2021, 6:29 PMken_payne
07/08/2021, 6:32 PMaaronsteers
07/08/2021, 6:34 PMaaronsteers
07/08/2021, 6:35 PMken_payne
07/08/2021, 6:43 PMfor stream in tap.streams
without decoding the binlog and bucketing records into their respective streams first...ken_payne
07/08/2021, 6:45 PMFULL_TABLE
and INCREMENTAL
as is - those just rely on select * from where ...
which is 'stream-wise' 馃檪aaronsteers
07/08/2021, 6:47 PMI wonder if binlogs break the SDK mould somewhat聽馃Yes, perhaps so. Said another way, get_records() equivalent for binlog data retreival might always be a custom overloaded method.
aaronsteers
07/08/2021, 6:48 PMSELECT .. FROM ... WHERE ...
aaronsteers
07/08/2021, 6:49 PMaaronsteers
07/08/2021, 7:25 PMjose_riego_valenzuela
07/09/2021, 7:26 AM