hi all, I am trying to load a table with 500MB siz...
# troubleshooting
f
hi all, I am trying to load a table with 500MB size 9,366,932 record but it is taking for ever is there a way to enhance and speedup the extract load p.s. I also tried target-jsonl but also took very long below details, thanks
Copy code
meltano --environment prod  elt tap-mysql target-duckdb
2023-05-14T14:44:06.306455Z [info     ] Environment 'prod' is active
2023-05-14T14:44:10.693964Z [info     ] Running extract & load...      name=meltano run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb
2023-05-14T14:44:11.317200Z [info     ] Reading state from Local Filesystem
2023-05-14T14:44:11.329636Z [info     ] No state found for 2023-05-14T144406--tap-mysql--target-duckdb.
2023-05-14T14:44:11.329888Z [warning  ] No state was found, complete import.
2023-05-14T14:44:12.290767Z [info     ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Attempting SSL connection cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:44:12.315352Z [info     ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Server Parameters: version: 5.7.12-log, wait_timeout: 28800, innodb_lock_wait_timeout: 3600, max_allowed_packet: 4194304, interactive_timeout: 28800 cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:44:12.317000Z [info     ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Server SSL Parameters(blank means SSL is not active): [ssl_version: TLSv1.2], [ssl_cipher: ECDHE-RSA-AES256-GCM-SHA384] cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:44:12.435909Z [info     ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Beginning sync for InnoDB table base_schema.my_table cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:44:12.436729Z [info     ] time=2023-05-14 14:44:12 name=tap_mysql level=INFO message=Stream base_schema-my_table is using incremental replication cmd_type=extractor name=tap-mysql run_id=c2f9e1b6-8080-414a-b3b8-01fa9cc90af9 state_id=2023-05-14T144406--tap-mysql--target-duckdb stdio=stderr
...
4518--tap-mysql--target-duckdb stdio=stderr
2023-05-14T14:53:27.651602Z [info     ] time=2023-05-14 14:53:27 name=singer level=INFO message=METRIC: {"type": "counter", "metric": "record_count", "value": 276018, "tags": {"database": "base_schema", "table": "my_table"}} cmd_type=extractor name=tap-mysql run_id=5274c17b-d65e-4266-8368-152634ec8304 state_id=2023-05-14T144518--tap-mysql--target-duckdb stdio=stderr
...
stop wrting logs just hanged or taking long




meltano.yml 
version: 1
default_environment: dev
project_id: b99aa32d-65ef-4cae-a5bf-a7bdd94f5ab3
environments:
- name: dev
- name: prod
  config:
    plugins:
      extractors:
      - name: tap-mysql
        config:
          host: hostname
          user: user
          ssl: true
        select:
        - '!!(base_schema-.*).*'
        - base_schema-my_table.*
      loaders:
      - name: target-duckdb
        config:
          filepath: ${MELTANO_PROJECT_ROOT}/output/warehouse-prod.duckdb
          default_target_schema: prod_raw_data
plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: pipelinewise-tap-mysql
    config:
      batch_size: 1000000
    metadata:
      '*':
        replication-method: INCREMENTAL
        replication-key: updated
  loaders:
  - name: target-duckdb
    variant: jwills
    pip_url: target-duckdb~=0.4 duckdb==0.7.1
    config:
      default_target_schema: dev_raw_data
      batch_size_rows: 1000000
      parallelism: '4'
elt:
  buffer_size: 1048576000
l
i haven’t used these particular extractors/loaders, but have you tried just using the defaults? • i don’t see a
batch_size
option for the mysql tap https://github.com/transferwise/pipelinewise-tap-mysql • also don’t see a parallelism option for the duckdb loader https://github.com/jwills/target-duckdb • why is the elt buffer so large? that strikes as way larger than should be necessary here • what do CPU and RAM usage look like during this process? I had a similar issue where the tap was working fine but the loader had a bug which prevented it from actually processing records so they just got stored in memory, causing RAM usage to creep up until the machine became unresponsive. my best guess as to what’s going on here is something similar. • you could try configuring debug logging for meltano and see if that provides any insight • i would also try using default
batch_size_rows
for the loader and see what happens (maybe you’ve already tried this)
f
thanks @luke_rodgers I did try the defaults and for small table it works but for largers tables it take so long. for batch_size it is not mentioned in tap-mysql but below works and take effects on next run
Copy code
meltano config tap-mysql set batch_size 300000
for target-duckdb I noticed also the output size is not reasonable It seems target-duckdb need improvements or I am not able to use it correctly as of now I am thinking to switch to target-postgres which is much better also postgress tables can be attached in duckdb for any transformation: not sure how this will goes though There is a replication method in singer fast sync that could help for first time load but seems not implemented in meltano
just small update: for fast sync it might be internally implemented in full load not sure though below: https://github.com/transferwise/pipelinewise/tree/e0a8ee15c05f019916f5400158b81de72cf33dc8/pipelinewise/fastsync
Copy code
PipelineWise detects automatically when Fast Sync gives better performance than the singer components and uses it automatically whenever it's possible.
l
interesting - for 9 million rows, based solely on my experience with largeish tables but different extractors/loaders (postgres and bigquery) i wouldn’t be surprised to see it take more than an hour, maybe even several hours (obviously depending significantly on network speed between meltano and the source/target)
a
You should write to parquet not duckdb. Duckdb then reads parquet. That is more standard.