farid
05/14/2023, 3:14 PMmeltano --environment prod elt tap-mysql target-duckdb
2023-05-14T14:44:06.306455Z [info ] Environment 'prod' is active
2023-05-14T14:44:10.693964Z [info ] Running extract & load... name=meltano run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb
2023-05-14T14:44:11.317200Z [info ] Reading state from Local Filesystem
2023-05-14T14:44:11.329636Z [info ] No state found for 2023-05-14T144406--tap-mysql--target-duckdb.
2023-05-14T14:44:11.329888Z [warning ] No state was found, complete import.
2023-05-14T14:44:12.290767Z [info ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Attempting SSL connection cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:44:12.315352Z [info ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Server Parameters: version: 5.7.12-log, wait_timeout: 28800, innodb_lock_wait_timeout: 3600, max_allowed_packet: 4194304, interactive_timeout: 28800 cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:44:12.317000Z [info ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Server SSL Parameters(blank means SSL is not active): [ssl_version: TLSv1.2], [ssl_cipher: ECDHE-RSA-AES256-GCM-SHA384] cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:44:12.435909Z [info ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Beginning sync for InnoDB table base_schema.my_table cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:44:12.436729Z [info ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Stream base_schema-my_table is using incremental replication cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
...
4518--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:53:27.651602Z [info ] time=2023-05-14 14:53:27 name=singer level=INFO message=METRIC: {"type": "counter", "metric": "record_count", "value": 276018, "tags": {"database": "base_schema", "table": "my_table"}} cmd_type=extractor name=tap-mysql run_id=5274c17b-d65e-4266-8368-152634ec8304 state_id=2023-05-14T144518--tap-mysql--target-duckdb stdio=stderr
...
stop wrting logs just hanged or taking long
meltano.yml
version: 1
default_environment: dev
project_id: b99aa32d-65ef-4cae-a5bf-a7bdd94f5ab3
environments:
- name: dev
- name: prod
config:
plugins:
extractors:
- name: tap-mysql
config:
host: hostname
user: user
ssl: true
select:
- '!!(base_schema-.*).*'
- base_schema-my_table.*
loaders:
- name: target-duckdb
config:
filepath: ${MELTANO_PROJECT_ROOT}/output/warehouse-prod.duckdb
default_target_schema: prod_raw_data
plugins:
extractors:
- name: tap-mysql
variant: transferwise
pip_url: pipelinewise-tap-mysql
config:
batch_size: 1000000
metadata:
'*':
replication-method: INCREMENTAL
replication-key: updated
loaders:
- name: target-duckdb
variant: jwills
pip_url: target-duckdb~=0.4 duckdb==0.7.1
config:
default_target_schema: dev_raw_data
batch_size_rows: 1000000
parallelism: '4'
elt:
buffer_size: 1048576000
luke_rodgers
05/15/2023, 1:25 PMbatch_size
option for the mysql tap https://github.com/transferwise/pipelinewise-tap-mysql
• also don’t see a parallelism option for the duckdb loader https://github.com/jwills/target-duckdb
• why is the elt buffer so large? that strikes as way larger than should be necessary here
• what do CPU and RAM usage look like during this process? I had a similar issue where the tap was working fine but the loader had a bug which prevented it from actually processing records so they just got stored in memory, causing RAM usage to creep up until the machine became unresponsive. my best guess as to what’s going on here is something similar.
• you could try configuring debug logging for meltano and see if that provides any insight
• i would also try using default batch_size_rows
for the loader and see what happens (maybe you’ve already tried this)farid
05/15/2023, 2:11 PMmeltano config tap-mysql set batch_size 300000
for target-duckdb I noticed also the output size is not reasonable
It seems target-duckdb need improvements or I am not able to use it correctly
as of now I am thinking to switch to target-postgres which is much better
also postgress tables can be attached in duckdb for any transformation: not sure how this will goes though
There is a replication method in singer fast sync that could help for first time load but seems not implemented in meltanofarid
05/15/2023, 2:22 PMPipelineWise detects automatically when Fast Sync gives better performance than the singer components and uses it automatically whenever it's possible.
luke_rodgers
05/15/2023, 3:10 PMalexander_butler
05/22/2023, 11:31 AM