Hi! I am working on a custom tap and working with ...
# singer-tap-development
n
Hi! I am working on a custom tap and working with an API that does not really have paths and running into a 403 error message in pytest. This is a REST API that only accepts POST requests and expects an "entity" to be passed as a config setting vs. as a path in the url. I.e. - Instead of www.api.com/characters I just need to specify the base url and then add a config setting that looks like "entity": "characters". I tried setting the path to an empty string in
streams.py
and added the following to the stream class:
Copy code
rest_method = "POST"
It is at this point I am getting a 403 error. I am able to get this request to work via Postman where I have to enter the data as "form-data" in the body and the form-data section seems to correspond to the properties list in
taps.py
, but its not working for some reason. My best guess is that it has something to do with the endpoint property. I have tried leaving it blank as mentioned as well as specifying the whole url and leaving the base url blank and vice versa, but all return the same error. Any help would be much appreciated!
r
Let me see if I understand correctly: • You have a REST API with some base url -
<http://www.api.com|www.api.com>
• There are no endpoints, so every request is made to
<http://www.api.com|www.api.com>
• All requests are
POST
requests that submit entity data to access resources --- Assuming you are using the Singer SDK, you will need to supply
url_base
and override
rest_method
in the base tap stream class generated for you in
client.py
- let’s refer to this as
APIStream
.
client.py
Copy code
from singer_sdk.streams import RESTStream


class APIStream(RESTStream):
    """API stream class."""

    url_base = "<http://www.api.com|www.api.com>"
    rest_method = "POST"
RESTStream::url_base --- The streams you define in
streams.py
that inherit from
APIStream
(e.g.
CharactersStream
) will then need to override the
RESTStream::prepare_request_payload
method to send entity data.
streams.py
Copy code
from typing import Any, Optional

from tap_api.client import APIStream


class CharactersStream(APIStream):
    """Characters stream class"""

    name = "stream_characters"

    # overrides RESTStream::prepare_request_payload
    def prepare_request_payload(
        self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Optional[dict]:

        # entity data
        return {
            "entity": "characters",
        }
RESTStream::prepare_request_payload --- If you want to get entity data passed as a setting to the tap, you can access this through `Stream::config`:
Copy code
def prepare_request_payload(
        self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Optional[dict]:
        return {
            "entity": self.config["characters_entity"],
        }
Stream::config This use-case doesn’t really make sense to me though, since you would need to supply an entity identifier setting for each entity stream, which doesn’t scale very well. If you have an
entities
setting that specifies an array of entity identifiers, a better approach might be to do something like override
Tap::discover_streams
in
tap.py
(done for you if you used the SDK project
cookiecutter
) and dynamically generate a stream for each entity from a generic stream class (e.g.
EntityStream
).
streams.py
Copy code
from typing import Any, Optional

from singer_sdk.plugin_base import PluginBase as TapBaseClass
from singer_sdk.streams import RESTStream


class EntityStream(RESTStream):
    """Entity stream class."""

    def __init__(self, tap: TapBaseClass, entity: str):
        super().__init__(tap, f"stream_{entity}")
        self.entity = entity

    def prepare_request_payload(
        self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Optional[dict]:
        return {
            "entity": self.entity,
        }
tap.py
Copy code
from typing import List

from singer_sdk import Stream, Tap

from tap_api.streams import EntityStream


class TapAPI(Tap):
    """API tap class."""

    def discover_streams(self) -> List[Stream]:
        """Return a list of discovered streams."""

        entities: List[str] = self.config["entities"]
        return [EntityStream(self, entity) for entity in entities]
Tap::discover_streams
You added some extra details while I was writing this, so I might have repeated some stuff you already said. 😅
n
Thanks so much for this response Reuben and sorry if I caused you any extra work changing details (was making incremental progress myself). I think your idea here to create a more generic class makes sense, however my main issue still seems to be an authentication issue where the config settings I am defining that include my access token are not being recognized. So my tap has a property list that includes a token and a few other properties:
Copy code
config_jsonschema = th.PropertiesList(    
        th.Property(
            "token",
            th.StringType,
            required=True,
            description="The token to authenticate against the API service"
        ),
I am adding a property for token and any other properties into the SAMPLE_CONFIG in the initial pytest setup, but still getting a 403 forbidden error. I see your suggestion to add
url_base
directly to the APIStream class, but
Path
is a required attribute. That is why I have tried leaving this blank and trying to just use the
url_base
in client.py to specify the full url and also tried adding full url to the
Path
attribute. Neither has worked. It seems like I need to potentially override the
Path
attribute?
r
Not sure what you are referring to with
Path
, but
RESTStream::path
is not a required attribute.
RESTStream::get_url
will use
""
instead of
self.path
if not specified:
singer_sdk.streams.RESTStream
Copy code
def get_url(self, context: Optional[dict]) -> str:
        """Get stream entity URL.

        Developers override this method to perform dynamic URL generation.

        Args:
            context: Stream partition or context dictionary.

        Returns:
            A URL, optionally targeted to a specific partition or context.
        """
        url = "".join([self.url_base, self.path or ""])
        vals = copy.copy(dict(self.config))
        vals.update(context or {})
        for k, v in vals.items():
            search_text = "".join(["{", k, "}"])
            if search_text in url:
                url = url.replace(search_text, self._url_encode(v))
        return url
I assume you are writing unit tests (as opposed to integration tests) with
pytest
, in which case are you sure you are mocking the API calls correctly? If you are writing integration tests, then I would expect the API a
401 Unauthorized
response given invalid credentials, but this might not be the case since the API seems pretty unconventional.
403 Forbidden
implies you are authorised but not permitted to access the resource, which could caused by a request made to an incorrect URL, as you say. Maybe this is an issue with your client stream authenticator. What does your
auth.py
look like?
n
Thanks again for all of your advice! When I omit path, I get an AttributeError in the test that my stream "object has no attribute 'path' " I am simply running
poetry run pytest
and the default test is failing on this. I searched and I see no other reference to a path attribute so maybe need to dig a bit further into this test. Looks like its using Singer's SDK for tests: singer_sdk.testing
r
I’ve had a look at the SDK source a bit more and it looks like
self.path
is not assigned if a Falsy value for
path
is passed to
RESTStream::__init__
. So you will need to supply a
path
of
""
in your stream class after all, in order to circumvent this - sorry for the confusion! Any thoughts on this @aaronsteers @edgar_ramirez_mondragon? Looks like a stream inheriting from
RESTStream
has to be supply
path
as as static/class property, or assign
self.path
in an instance method before
RESTStream::get_url
is called. Onto the
403 Forbidden
issue: given that your
SAMPLE_CONFIG
is correct, there must be an issue with your client authenticator class, if you are making a request to the same URL with the same credentials in Postman successfully.
n
Thanks so much! When you mention "client authenticator" class, I do not believe I have one because I selected "6" in the cookiecutter setup. My API does not require any authentication and I have Postman set to No Auth. Authentication happens via a token passed as a config setting which is the snippet I shared above. This is exactly how I have it setup in Postman and it works. I am thinking the url must be formatting incorrectly with the empty string, but not sure.
Ah yes, setting it up as you described there is actually a slightly different error too, not the 403:
Copy code
FAILED tests/test_core.py::test_standard_tap_tests - requests.exceptions.MissingSchema: Invalid URL '': No scheme supplied.
r
Authentication happens via a token passed as a config setting
In Postman, do you pass the token as a URL parameter or in the request body?
```requests.exceptions.MissingSchema: Invalid URL '': No scheme supplied.
```
Do you have
url_base
defined on your client stream?
n
Yes, I do have the full url_base specified. In Postman, the token is included in the request body, not as a URL parameter.
r
full url_base specified
Do you have it prefixed with a scheme (i.e.
http://
or
https://
)? A scheme is required in
url_base
.
the token is included in the request body
Great - in that case you will need to supply
token
in the
prepare_request_payload
method of your client stream!
Copy code
def prepare_request_payload(
        self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Optional[dict]:
        return {
            "token": self.config["token"],
        }
n
ah, thanks for pointing this out. I did not put it together the first time you suggested it! I setup this method now in the stream, however the 403 error still persists! I double checked and I am using the exact url I am using in Postman and its prefixed with
https://
When I add this as the base_url and leave path as an empty string The specific error is:
Copy code
403 Client Error: Forbidden for path:
So it seems like I need a way to override or ignore the path setting.
r
Since
path
is an empty string, naturally it is not displayed in the error message. What’s happening there is you are getting
403 Forbidden
on
<https://api.com>
(just the
url_base
value). There must be some difference in how you are supplying the value of
token
in your tap versus in Postman. Can you share your client stream
prepare_request_payload
method implementation?
n
ah, thanks. Yes, I am simply following your code example and adding in a few additional attributes that are required as part of the body:
Copy code
# overrides RESTStream::prepare_request_payload
    def prepare_request_payload(
        self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Optional[dict]:
        return {
            "token": self.config["token"],
            "content": self.config["content"],
            "action": self.config["action"],
            "format": self.config["format"],
       }
I am then specifying actual values for these in
test_core.py
in SAMPLE_CONFIG
Ah, one specific thing I am noticing in my Postman test, is these values MUST be specified in the "form-data" option within the body. I am not sure how I would make sure the tap is sending the payload in this way.
r
Looks like you might have to override
RESTStream::prepare_request
, since the SDK does not provides the
files
parameter to the underlying
requests.Request
object, which seems to be required for
multipart/form-data
POST requests - see here.
Copy code
from typing import Any, Optional, cast

import requests
from singer_sdk.streams import RESTStream


class APIStream(RESTStream):
    """API stream class."""

    # overrides RESTStream::prepare_request_payload
    def prepare_request_payload(
        self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Optional[dict]:
        return {
            "token": self.config["token"],
            "content": self.config["content"],
            "action": self.config["action"],
            "format": self.config["format"],
       }

    # overrides RESTStream::prepare_request
    def prepare_request(
        self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> requests.PreparedRequest:
        request_data = self.prepare_request_payload(context, next_page_token)

        request = cast( 
            requests.PreparedRequest,
            self.requests_session.prepare_request(
                requests.Request(
                    method="POST",
                    url=self.url_base,
                    files=request_data,
                ),
            ),
        )
        return request
You then do not need to define the
url_base
and
rest_method
properties in your client stream, since these are now supplied inside
prepare_request
. No call to
RESTStream::get_url
will happen either, so you shouldn’t see the
AttributeError
for
path
you were getting earlier.
n
@Reuben (Matatika) - Thank you so much for all of your help, this is working! I just had to make one minor change in
prepare_request
updating
files
variable name to
data
. Really appreciate your help. If you're curious, I am working with an OpenSource research data capture system called RedCap and seems to not be so standard. Appreciate all of your help!
r
Happy to help! Was an interesting issue for sure! 😁
a
I'm late to this thread, but thankyou to @Reuben (Matatika) for the insight here. I've not run into this use case before but the approach proposed seems very solid:
Looks like you might have to override
RESTStream::prepare_request
, since the SDK does not provides the
files
parameter to the underlying
requests.Request
object, which seems to be required for
multipart/form-data
POST requests - see here.
If there's a more natural integration point we can add into the SDK, we are always open to improvements. But for this case, the overriding of
prepare_request()
does seem like a great solution.