mark_johnston
06/02/2023, 11:05 AMfunctools
and applying @functools.cache
and @functools.cached_property
on SQLStream.get_selected_schema
and SQLStream.schema
respectively. Earlier I did some profiling and saw that the schema property was being called for each row of my table, slowing the process down quite considerably over a 250,000 ish row table.
The results based on enabling caching on this property and method are:
### Control - SDK as is:
2023-06-02 11:04:53,606 | INFO | tap-db2 | Beginning incremental sync of 'TAPDB2-TAP_TEST_500M'...
2023-06-02 11:04:53,606 | INFO | tap-db2 | Tap has custom mapper. Using 1 provided map(s).
2023-06-02 11:05:53,607 | INFO | singer_sdk.metrics | METRIC: {"type": "counter", "metric": "record_count", "value": 211043, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}}}
2023-06-02 11:06:04,117 | INFO | singer_sdk.metrics | METRIC: {"type": "timer", "metric": "sync_duration", "value": 70.51015210151672, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}, "status": "succeeded"}}
2023-06-02 11:06:04,118 | INFO | singer_sdk.metrics | METRIC: {"type": "counter", "metric": "record_count", "value": 34027, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}}}
### Caching on schema data party
023-06-02 11:06:40,672 | INFO | tap-db2 | Beginning incremental sync of 'TAPDB2-TAP_TEST_500M'...
2023-06-02 11:06:40,672 | INFO | tap-db2 | Tap has custom mapper. Using 1 provided map(s).
2023-06-02 11:07:11,594 | INFO | singer_sdk.metrics | METRIC: {"type": "timer", "metric": "sync_duration", "value": 30.920732259750366, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}, "status": "succeeded"}}
2023-06-02 11:07:11,594 | INFO | singer_sdk.metrics | METRIC: {"type": "counter", "metric": "record_count", "value": 245070, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}}}
### Caching on schema and get_selected_schema
2023-06-02 11:07:36,851 | INFO | tap-db2 | Beginning incremental sync of 'TAPDB2-TAP_TEST_500M'...
2023-06-02 11:07:36,852 | INFO | tap-db2 | Tap has custom mapper. Using 1 provided map(s).
2023-06-02 11:08:18,430 | INFO | singer_sdk.metrics | METRIC: {"type": "timer", "metric": "sync_duration", "value": 41.577131271362305, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}, "status": "succeeded"}}
2023-06-02 11:08:18,430 | INFO | singer_sdk.metrics | METRIC: {"type": "counter", "metric": "record_count", "value": 245070, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}}}
These aren't compatible with Python <= 3.7 - I took a bit of inspiration by looking at this link:
https://stackoverflow.com/questions/7388258/replace-property-for-perfomance-gainmark_johnston
06/02/2023, 11:18 AMvisch
06/02/2023, 12:40 PMjan_soubusta
06/02/2023, 12:56 PMtaylor
06/02/2023, 3:29 PMmark_johnston
06/02/2023, 7:38 PMschema
property being set to @property
and `@functools.cached_property`:
### Profiling script and command:
# ~/repos/cprofile_db2_sdk.py
#!/usr/bin/env python3.8
from tap_db2.tap import Tapdb2
Tapdb2.cli()
poetry run python -m cProfile -o ~/repos/concept_sdk_cache.log ~/repos/cprofile_db2_sdk.py --config config.json --catalog concept_catalog.json >/dev/null^C
### Getting profiling results (sorted by CUMULATIVE)
import os
import pstats
from pstats import SortKey
p = pstats.Stats(f"{os.getenv('HOME')}/repos/concept_sdk_cache.log")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(100)
### @property
- current
ncalls tottime percall cumtime percall filename:lineno(function)
1063/1 0.024 0.000 213.664 213.664 {built-in method builtins.exec}
1 0.000 0.000 213.664 213.664 cprofile_db2_sdk.py:2(<module>)
1 0.000 0.000 212.933 212.933 core.py:1128(__call__)
1 0.000 0.000 212.933 212.933 core.py:987(main)
1 0.000 0.000 212.932 212.932 core.py:1393(invoke)
1 0.000 0.000 212.932 212.932 core.py:709(invoke)
1 0.000 0.000 212.932 212.932 tap_base.py:442(cli)
1 0.000 0.000 212.922 212.922 tap_base.py:402(sync_all)
1 0.320 0.320 212.921 212.921 core.py:1159(sync)
532966 3.094 0.000 212.600 0.000 core.py:1041(_sync_records)
532965 1.313 0.000 188.601 0.000 core.py:856(_write_record_message)
1065930 4.237 0.000 163.886 0.000 core.py:823(_generate_record_messages)
1065933 3.114 0.000 122.297 0.000 sql.py:77(schema)
1065937 1.353 0.000 85.552 0.000 sql.py:48(_singer_catalog_entry)
1065938 5.448 0.000 84.061 0.000 catalog.py:298(from_dict)
1065938 5.230 0.000 49.470 0.000 catalog.py:105(from_iterable)
6395628 7.541 0.000 44.239 0.000 catalog.py:56(from_dict)
6395610/1065935 22.225 0.000 33.484 0.000 schema.py:71(to_dict)
1065935 1.852 0.000 27.244 0.000 schema.py:80(<dictcomp>)
6395628/1065938 14.069 0.000 26.782 0.000 schema.py:91(from_dict)
533020 1.076 0.000 23.410 0.000 messages.py:193(write_message)
1065938 3.374 0.000 21.729 0.000 schema.py:111(<dictcomp>)
532965 0.581 0.000 20.785 0.000 _typing.py:364(conform_record_data_types)
532965 3.594 0.000 20.090 0.000 _typing.py:384(_conform_record_data_types)
6395628 11.298 0.000 18.009 0.000 catalog.py:67(<dictcomp>)
533020 0.834 0.000 17.999 0.000 messages.py:181(format_message)
533020 2.537 0.000 16.581 0.000 __init__.py:276(dumps)
6395689 9.984 0.000 16.308 0.000 dataclasses.py:1022(fields)
2664825 3.239 0.000 15.774 0.000 _typing.py:466(_conform_primitive_property)
174286139 15.292 0.000 15.292 0.000 {method 'get' of 'dict' objects}
533020 1.156 0.000 12.858 0.000 encoder.py:277(encode)
532966 2.557 0.000 12.684 0.000 client.py:107(get_records)
### @functools.cached_property
``` ncalls tottime percall cumtime percall filename:lineno(function)
1066/1 0.019 0.000 58.711 58.711 {built-in method builtins.exec}
1 0.000 0.000 58.711 58.711 cprofile_db2_sdk.py:2(<module>)
1 0.000 0.000 57.101 57.101 core.py:1128(call)
1 0.000 0.000 57.101 57.101 core.py:987(main)
1 0.000 0.000 57.100 57.100 core.py:1393(invoke)
1 0.000 0.000 57.100 57.100 core.py:709(invoke)
1 0.000 0.000 57.100 57.100 tap_base.py:445(cli)
1 0.000 0.000 57.091 57.091 tap_base.py:405(sync_all)
1 0.248 0.248 57.090 57.090 core.py:1134(sync)
532966 2.311 0.000 56.841 0.000 core.py:101…mark_johnston
06/02/2023, 8:31 PM