Possibly relevant to <@U06BXCNMFBR>’s message, I'd...
# contributing
m
Possibly relevant to @jan_soubusta’s message, I'd already started typing it up in a separate thread. I have just been toying with
functools
and applying
@functools.cache
and
@functools.cached_property
on
SQLStream.get_selected_schema
and
SQLStream.schema
respectively. Earlier I did some profiling and saw that the schema property was being called for each row of my table, slowing the process down quite considerably over a 250,000 ish row table. The results based on enabling caching on this property and method are: ### Control - SDK as is:
Copy code
2023-06-02 11:04:53,606 | INFO     | tap-db2              | Beginning incremental sync of 'TAPDB2-TAP_TEST_500M'...
2023-06-02 11:04:53,606 | INFO     | tap-db2              | Tap has custom mapper. Using 1 provided map(s).
2023-06-02 11:05:53,607 | INFO     | singer_sdk.metrics   | METRIC: {"type": "counter", "metric": "record_count", "value": 211043, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}}}
2023-06-02 11:06:04,117 | INFO     | singer_sdk.metrics   | METRIC: {"type": "timer", "metric": "sync_duration", "value": 70.51015210151672, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}, "status": "succeeded"}}
2023-06-02 11:06:04,118 | INFO     | singer_sdk.metrics   | METRIC: {"type": "counter", "metric": "record_count", "value": 34027, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}}}
### Caching on schema data party
Copy code
023-06-02 11:06:40,672 | INFO     | tap-db2              | Beginning incremental sync of 'TAPDB2-TAP_TEST_500M'...
2023-06-02 11:06:40,672 | INFO     | tap-db2              | Tap has custom mapper. Using 1 provided map(s).
2023-06-02 11:07:11,594 | INFO     | singer_sdk.metrics   | METRIC: {"type": "timer", "metric": "sync_duration", "value": 30.920732259750366, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}, "status": "succeeded"}}
2023-06-02 11:07:11,594 | INFO     | singer_sdk.metrics   | METRIC: {"type": "counter", "metric": "record_count", "value": 245070, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}}}
### Caching on schema and get_selected_schema
Copy code
2023-06-02 11:07:36,851 | INFO     | tap-db2              | Beginning incremental sync of 'TAPDB2-TAP_TEST_500M'...
2023-06-02 11:07:36,852 | INFO     | tap-db2              | Tap has custom mapper. Using 1 provided map(s).
2023-06-02 11:08:18,430 | INFO     | singer_sdk.metrics   | METRIC: {"type": "timer", "metric": "sync_duration", "value": 41.577131271362305, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}, "status": "succeeded"}}
2023-06-02 11:08:18,430 | INFO     | singer_sdk.metrics   | METRIC: {"type": "counter", "metric": "record_count", "value": 245070, "tags": {"stream": "TAPDB2-TAP_TEST_500M", "context": {}}}
These aren't compatible with Python <= 3.7 - I took a bit of inspiration by looking at this link: https://stackoverflow.com/questions/7388258/replace-property-for-perfomance-gain
v
Looks nice to me! Could probably get that into the SDK, 3.7 is almost EOL anyway (this month I think)
j
This is cool! Let me know if I could help in any way here. Obviously, the issue would remain in taps not based on SDK - this would be a great leverage to push owners of taps to migrate to SDK.
t
yep, we’re planning the 1.0 release to include an end of support for 3.7. targeting July
m
Thanks for the feedback! I decided to profile a before and after with the
schema
property being set to
@property
and `@functools.cached_property`: ### Profiling script and command:
Copy code
# ~/repos/cprofile_db2_sdk.py 
#!/usr/bin/env python3.8
from tap_db2.tap import Tapdb2
Tapdb2.cli()

poetry run python -m cProfile -o ~/repos/concept_sdk_cache.log ~/repos/cprofile_db2_sdk.py --config config.json --catalog concept_catalog.json >/dev/null^C
### Getting profiling results (sorted by CUMULATIVE)
Copy code
import os
import pstats
from pstats import SortKey
p = pstats.Stats(f"{os.getenv('HOME')}/repos/concept_sdk_cache.log")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(100)
###
@property
- current
Copy code
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1063/1    0.024    0.000  213.664  213.664 {built-in method builtins.exec}
        1    0.000    0.000  213.664  213.664 cprofile_db2_sdk.py:2(<module>)
        1    0.000    0.000  212.933  212.933 core.py:1128(__call__)
        1    0.000    0.000  212.933  212.933 core.py:987(main)
        1    0.000    0.000  212.932  212.932 core.py:1393(invoke)
        1    0.000    0.000  212.932  212.932 core.py:709(invoke)
        1    0.000    0.000  212.932  212.932 tap_base.py:442(cli)
        1    0.000    0.000  212.922  212.922 tap_base.py:402(sync_all)
        1    0.320    0.320  212.921  212.921 core.py:1159(sync)
   532966    3.094    0.000  212.600    0.000 core.py:1041(_sync_records)
   532965    1.313    0.000  188.601    0.000 core.py:856(_write_record_message)
  1065930    4.237    0.000  163.886    0.000 core.py:823(_generate_record_messages)
  1065933    3.114    0.000  122.297    0.000 sql.py:77(schema)
  1065937    1.353    0.000   85.552    0.000 sql.py:48(_singer_catalog_entry)
  1065938    5.448    0.000   84.061    0.000 catalog.py:298(from_dict)
  1065938    5.230    0.000   49.470    0.000 catalog.py:105(from_iterable)
  6395628    7.541    0.000   44.239    0.000 catalog.py:56(from_dict)
6395610/1065935   22.225    0.000   33.484    0.000 schema.py:71(to_dict)
  1065935    1.852    0.000   27.244    0.000 schema.py:80(<dictcomp>)
6395628/1065938   14.069    0.000   26.782    0.000 schema.py:91(from_dict)
   533020    1.076    0.000   23.410    0.000 messages.py:193(write_message)
  1065938    3.374    0.000   21.729    0.000 schema.py:111(<dictcomp>)
   532965    0.581    0.000   20.785    0.000 _typing.py:364(conform_record_data_types)
   532965    3.594    0.000   20.090    0.000 _typing.py:384(_conform_record_data_types)
  6395628   11.298    0.000   18.009    0.000 catalog.py:67(<dictcomp>)
   533020    0.834    0.000   17.999    0.000 messages.py:181(format_message)
   533020    2.537    0.000   16.581    0.000 __init__.py:276(dumps)
  6395689    9.984    0.000   16.308    0.000 dataclasses.py:1022(fields)
  2664825    3.239    0.000   15.774    0.000 _typing.py:466(_conform_primitive_property)
174286139   15.292    0.000   15.292    0.000 {method 'get' of 'dict' objects}
   533020    1.156    0.000   12.858    0.000 encoder.py:277(encode)
   532966    2.557    0.000   12.684    0.000 client.py:107(get_records)
###
@functools.cached_property
``` ncalls tottime percall cumtime percall filename:lineno(function) 1066/1 0.019 0.000 58.711 58.711 {built-in method builtins.exec} 1 0.000 0.000 58.711 58.711 cprofile_db2_sdk.py:2(<module>) 1 0.000 0.000 57.101 57.101 core.py:1128(call) 1 0.000 0.000 57.101 57.101 core.py:987(main) 1 0.000 0.000 57.100 57.100 core.py:1393(invoke) 1 0.000 0.000 57.100 57.100 core.py:709(invoke) 1 0.000 0.000 57.100 57.100 tap_base.py:445(cli) 1 0.000 0.000 57.091 57.091 tap_base.py:405(sync_all) 1 0.248 0.248 57.090 57.090 core.py:1134(sync) 532966 2.311 0.000 56.841 0.000 core.py:101…
I raised a PR - needed a few tweaks to get it working for 3.7, but got there in the end: feat(taps): Change SQLStream schema into a cached_property https://github.com/meltano/sdk/pull/1745