Hey everyone I don t know if this is the right channel for m Meltano #getting-started

Hey everyone. I don't know if this is the right ch...

artem_vysotsky

07/05/2021, 1:45 PM

Hey everyone. I don't know if this is the right channel for my question. How do I go about finding a performance bottleneck in the pipeline? I have an extremely simple custom tap that essentially fetches only one but relatively large JSON object from an API endpoint of eodhistoricaldata.com. My pipeline is configured to persist data into a postgres target. All works fine, but it takes about a couple of minutes to complete which I find very unsatisfying. Yes, the JSON object is big, but I still don't think it should take even a couple of seconds to complete end to end. It shouldn't take minutes to parse/validate

douwe_maan

07/05/2021, 3:30 PM

I’d start by running the tap and target separately, and seeing where the slowness arises: 1. Run the tap by itself and dump its output to a file:

meltano invoke tap-foo > output.json

2. Take the tap output and pipe it into the target by itself:

cat output.json | meltano invoke target-foo

That will tell you whether the slow part is extracting the data from the source, or loading it into the destination

artem_vysotsky

07/05/2021, 4:41 PM

I was able to isolate the issue

artem_vysotsky

07/05/2021, 4:42 PM

Here is my base class

Copy code

"""REST client handling, including eodhistoricaldataStream base class."""

import requests
from pathlib import Path
from typing import Any, Dict, Optional

from singer_sdk.streams import RESTStream

SCHEMAS_DIR = Path(__file__).parent / Path("./schemas")


class eodhistoricaldataStream(RESTStream):
    """eodhistoricaldata stream class."""

    url_base = "<https://eodhistoricaldata.com/api>"

    def get_next_page_token(
            self, response: requests.Response, previous_token: Optional[Any]
    ) -> Optional[Any]:
        return None

    def get_url_params(
            self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Dict[str, Any]:
        """Return a dictionary of values to be used in URL parameterization."""
        params: dict = {"api_token": self.config['api_token']}
        return params

artem_vysotsky

07/05/2021, 4:43 PM

And here is my top level stream

artem_vysotsky

07/05/2021, 4:43 PM

Copy code

"""Stream type classes for tap-eodhistoricaldata."""

from pathlib import Path
from typing import Any, Dict, Optional, Iterator

from tap_eodhistoricaldata.client import eodhistoricaldataStream

SCHEMAS_DIR = Path(__file__).parent / Path("./schemas")


class Fundamentals(eodhistoricaldataStream):
    """Define custom stream."""
    name = "fundamentals"
    path = "/fundamentals/{Code}"
    primary_keys = ["Code"]
    selected_by_default = True

    replication_key = None
    schema_filepath = SCHEMAS_DIR / "fundamentals.json"

    @property
    def partitions(self) -> Iterator[Dict[str, Any]]:
        return map(lambda x: {'Code': x}, self.config['symbols'])

    def post_process(self, row: dict, context: Optional[dict] = None) -> dict:
        row['Code'] = context['Code']
        return row

artem_vysotsky

07/05/2021, 4:43 PM

As you can see the implementation is extremely simple

artem_vysotsky

07/05/2021, 4:44 PM

I wrote a very simple test and added a profiler

artem_vysotsky

07/05/2021, 4:44 PM

Copy code

@pytest.mark.vcr
def test_selected():
    tap1 = Tapeodhistoricaldata(config=SAMPLE_CONFIG, parse_env_config=True)

    tap1.sync_all()

artem_vysotsky

07/05/2021, 4:44 PM

Here is my profiling results

artem_vysotsky

07/05/2021, 4:47 PM

message has been deleted

artem_vysotsky

07/05/2021, 4:47 PM

As you can see the tap spends most of the time in:

Copy code

_catalog:27:is_property_selected

artem_vysotsky

07/05/2021, 4:48 PM

and

Copy code

keys_order_dependent:4:make_key

douwe_maan

07/05/2021, 4:49 PM

Yep, I see. Looks like we have an optimization opportunity there! @aaronsteers (who leads SDK development) is out today (since it’s a US holiday), but he’ll be back tomorrow. Would you mind creating an issue with these findings so that he can look into it once he’s back? @edgar_ramirez_mondragon joined our team today and may also feel inspired to dive into this 🙂

douwe_maan

07/05/2021, 4:49 PM

(Also, not quite the same thing, but potentially relevant: https://gitlab.com/DouweM/tap-investing)

artem_vysotsky

07/05/2021, 4:51 PM

@douwe_maan yeah I saw it, but I think it will suffer from the same bottleneck. The isssue is not in the implementation of the custom tap, but in the base tap class. It spends too much time traversing the response.

douwe_maan

07/05/2021, 4:53 PM

Yeah I agree, it’s an SDK issue, not a specific tap issue. AJ/Edgar should be able to look into that and resolve it sooner rather than later.

artem_vysotsky

07/05/2021, 4:53 PM

Great

douwe_maan

07/05/2021, 6:10 PM

@artem_vysotsky If you could move this info into an issue that’d be much appreciated!

artem_vysotsky

07/05/2021, 7:02 PM

https://gitlab.com/meltano/sdk/-/issues/161

Open in Slack

Previous Next