Hey everyone. I don't know if this is the right ch...
# getting-started
a
Hey everyone. I don't know if this is the right channel for my question. How do I go about finding a performance bottleneck in the pipeline? I have an extremely simple custom tap that essentially fetches only one but relatively large JSON object from an API endpoint of eodhistoricaldata.com. My pipeline is configured to persist data into a postgres target. All works fine, but it takes about a couple of minutes to complete which I find very unsatisfying. Yes, the JSON object is big, but I still don't think it should take even a couple of seconds to complete end to end. It shouldn't take minutes to parse/validate
d
I’d start by running the tap and target separately, and seeing where the slowness arises: 1. Run the tap by itself and dump its output to a file:
meltano invoke tap-foo > output.json
2. Take the tap output and pipe it into the target by itself:
cat output.json | meltano invoke target-foo
That will tell you whether the slow part is extracting the data from the source, or loading it into the destination
a
I was able to isolate the issue
Here is my base class
Copy code
"""REST client handling, including eodhistoricaldataStream base class."""

import requests
from pathlib import Path
from typing import Any, Dict, Optional

from singer_sdk.streams import RESTStream

SCHEMAS_DIR = Path(__file__).parent / Path("./schemas")


class eodhistoricaldataStream(RESTStream):
    """eodhistoricaldata stream class."""

    url_base = "<https://eodhistoricaldata.com/api>"

    def get_next_page_token(
            self, response: requests.Response, previous_token: Optional[Any]
    ) -> Optional[Any]:
        return None

    def get_url_params(
            self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Dict[str, Any]:
        """Return a dictionary of values to be used in URL parameterization."""
        params: dict = {"api_token": self.config['api_token']}
        return params
And here is my top level stream
Copy code
"""Stream type classes for tap-eodhistoricaldata."""

from pathlib import Path
from typing import Any, Dict, Optional, Iterator

from tap_eodhistoricaldata.client import eodhistoricaldataStream

SCHEMAS_DIR = Path(__file__).parent / Path("./schemas")


class Fundamentals(eodhistoricaldataStream):
    """Define custom stream."""
    name = "fundamentals"
    path = "/fundamentals/{Code}"
    primary_keys = ["Code"]
    selected_by_default = True

    replication_key = None
    schema_filepath = SCHEMAS_DIR / "fundamentals.json"

    @property
    def partitions(self) -> Iterator[Dict[str, Any]]:
        return map(lambda x: {'Code': x}, self.config['symbols'])

    def post_process(self, row: dict, context: Optional[dict] = None) -> dict:
        row['Code'] = context['Code']
        return row
As you can see the implementation is extremely simple
I wrote a very simple test and added a profiler
Copy code
@pytest.mark.vcr
def test_selected():
    tap1 = Tapeodhistoricaldata(config=SAMPLE_CONFIG, parse_env_config=True)

    tap1.sync_all()
Here is my profiling results
message has been deleted
As you can see the tap spends most of the time in:
Copy code
_catalog:27:is_property_selected
and
Copy code
keys_order_dependent:4:make_key
d
Yep, I see. Looks like we have an optimization opportunity there! @aaronsteers (who leads SDK development) is out today (since it’s a US holiday), but he’ll be back tomorrow. Would you mind creating an issue with these findings so that he can look into it once he’s back? @edgar_ramirez_mondragon joined our team today and may also feel inspired to dive into this 🙂
(Also, not quite the same thing, but potentially relevant: https://gitlab.com/DouweM/tap-investing)
a
@douwe_maan yeah I saw it, but I think it will suffer from the same bottleneck. The isssue is not in the implementation of the custom tap, but in the base tap class. It spends too much time traversing the response.
d
Yeah I agree, it’s an SDK issue, not a specific tap issue. AJ/Edgar should be able to look into that and resolve it sooner rather than later.
a
Great
d
@artem_vysotsky If you could move this info into an issue that’d be much appreciated!
a