matthew_funk
08/16/2023, 8:32 PMReuben (Matatika)
08/16/2023, 9:02 PM20230815
) stored in state, so you know when to continue from on subsequent syncs.
Unless you are getting some token from the response that you can use to construct the next URL, you will need to parse and increment the date value used in the previous request manually in a custom Paginator
class (like you said) - that logic should probably sit in the get_next
implementation. I think the has_more
implementation can simply check the date from the previous request against the current date (e.g. 20230816
), and just assume there are no more records to process if they match.
Bear in mind that you probably want a sensible default date value to sync from if no other is provided, either from state or config. This could be dynamic (e.g. last two weeks).matthew_funk
08/17/2023, 1:57 PMReuben (Matatika)
08/17/2023, 2:04 PMmatthew_funk
08/17/2023, 2:38 PMmatthew_funk
08/17/2023, 2:38 PMmatthew_funk
08/17/2023, 2:47 PMReuben (Matatika)
08/17/2023, 4:33 PMget_next
will be called by the default implementation of RESTStream
, given that you are registering your Paginator
in your stream class inheriting from RESTStream
. To make sure it is set for the URL, you will probably have to override prepare_request
like so:
def prepare_request(self, context, next_page_token):
prepared_request = super().prepare_request(context, next_page_token)
# `next_page_token`: from paginator
# `self.get_starting_timestamp(context)`: from state (rest of state logic need implementing)
# `self.config["start_date"]`: from config
# `self.default_start_date`: from default (needs implementing)
date = next_page_token or self.get_starting_timestamp(context) or self.config["start_date"] or self.default_start_date
prepared_request.prepare_url(
self.url_base.format(date=date), # where `url_base` is something like `<https://www.caiso.com/outlook/SP/History/{date}>`
self.get_url_params(context, next_page_token),
)
return prepared_request
Reuben (Matatika)
08/17/2023, 4:45 PMurl_base
- it is a property (@property
) so just access it like
url = self.url_base
matthew_funk
08/17/2023, 5:56 PMmatthew_funk
08/17/2023, 5:57 PMReuben (Matatika)
08/17/2023, 5:59 PMmeltano invoke tap-caiso
from the root of your project personally. That will probably give you a stack trace to follow.Reuben (Matatika)
08/17/2023, 6:03 PMformat
is a Python string method that replaces keys enclosed in curly braces with kwarg values:
>>> test = "hello world: {my_key}"
>>> test
'hello world: {my_key}'
>>> test.format(my_key="test")
'hello world: test'
matthew_funk
08/17/2023, 6:05 PMReuben (Matatika)
08/17/2023, 6:05 PMmatthew_funk
08/17/2023, 6:06 PMmatthew_funk
08/17/2023, 6:06 PMReuben (Matatika)
08/17/2023, 6:09 PMAccept
header in the request to application/json
? Otherwise you can probably override parse_response
to deserialise the CSV.matthew_funk
08/17/2023, 6:13 PMReuben (Matatika)
08/17/2023, 6:17 PMmatthew_funk
08/17/2023, 6:24 PMReuben (Matatika)
08/17/2023, 6:27 PMimport csv
from io import StringIO
from singer_sdk.helpers.jsonpath import extract_jsonpath
# ...
def parse_response(self, response):
data = list(csv.DictReader(StringIO(response.text)))
yield from extract_jsonpath(self.records_jsonpath, input=data)
Reuben (Matatika)
08/17/2023, 6:34 PMmatthew_funk
08/17/2023, 6:35 PMReuben (Matatika)
08/17/2023, 6:38 PM__init__
for casioPaginator
, you can instantiate it passing in self.url_base
from get_new_paginator
in your stream class.
See this too.matthew_funk
08/17/2023, 6:45 PMReuben (Matatika)
08/17/2023, 6:49 PMBaseAPIPaginator
though, so you could just call super().__init__
to achieve the same thing.matthew_funk
08/17/2023, 6:49 PMmatthew_funk
08/17/2023, 6:50 PMReuben (Matatika)
08/17/2023, 6:52 PMReuben (Matatika)
08/17/2023, 7:01 PMcaisoPaginator
with an initial start date instead of the URL, then use self.current_value
to extract year
, month
and day
from in has_more
and get_next
.__init__
and just parse out the date from response.request.url
in has_more
and get_next
.matthew_funk
08/17/2023, 7:16 PMmatthew_funk
08/17/2023, 7:18 PMReuben (Matatika)
08/17/2023, 7:26 PMget_new_paginator
without any arguments like before. You need to figure out how you can get your date back from the paginator get_next
in the format YYYYMMDD
expected for the URL. Also, this is where your logic for reforming the URL is not correct, since you are missing the /demand.csv
path segment.
You can also call <http://self.logger.info|self.logger.info>
from the stream class methods to print out runtime information - even if it's just for debugging during development. Here, I'd recommend you log out next_page_token
, self.url_base
and url
in prepare_request
, just so we can see how they are changing together.Reuben (Matatika)
08/17/2023, 7:29 PMformat
stuff I shared ealier is probably relevant here too now.matthew_funk
08/17/2023, 7:29 PMmatthew_funk
08/17/2023, 7:34 PMReuben (Matatika)
08/17/2023, 7:35 PMdemandStream
is extending caisoStream
, then you can override the URL from there:
client.py
class caisoStream(RESTStream):
url_base = "<https://www.caiso.com/outlook/SP/History/{date}>"
streams.py
class demandStream(caisoStream):
@property
def url_base(self):
return super().url_base + "/demand.csv"
matthew_funk
08/17/2023, 7:39 PMReuben (Matatika)
08/17/2023, 7:43 PM>>> datetime.now()
datetime.datetime(2023, 8, 17, 20, 41, 31, 978053)
>>> datetime.now().strftime("%Y%m%d")
'20230817'
so you probably need to do the same strftime
call on date
in parse_respone
(it's still a datetime
object) - most likely it's making the first request with the default datetime
string representation:
>>> str(datetime.now())
'2023-08-17 20:43:34.482836'
Reuben (Matatika)
08/17/2023, 7:49 PMOkay, it still wont let me create instance of caisoPaginator without a start_valuePass it
None
when instantiating the paginator
def get_new_paginator(self):
return caisoPaginator(None)
or declare
def __init__(self, *args, **kwargs):
super().__init__(None, *args, **kwargs)
so you don't have to pass None
explicitly on instantiation (a lot of the inbuilt paginator classes do this):
def get_new_paginator(self):
return caisoPaginator()
matthew_funk
08/17/2023, 7:56 PMReuben (Matatika)
08/17/2023, 7:59 PM>>> date = datetime.now()
>>> isinstance(date, datetime)
True
>>> isinstance(date, str)
False
It's probably better to do this though:
start_date = self.get_starting_timestamp(context) or self.config["start_date"] or self.default_start_date
date = next_page_token or start_date.strftime("%Y%m%d")
Reuben (Matatika)
08/17/2023, 8:06 PMurl_base
implementation?matthew_funk
08/17/2023, 8:07 PMmatthew_funk
08/17/2023, 8:07 PMReuben (Matatika)
08/17/2023, 8:09 PM<http://self.logger.info|self.logger.info>(url)
to print the URL out)matthew_funk
08/17/2023, 8:09 PMReuben (Matatika)
08/17/2023, 8:10 PMmatthew_funk
08/17/2023, 8:11 PMReuben (Matatika)
08/17/2023, 8:12 PMmatthew_funk
08/17/2023, 8:18 PMReuben (Matatika)
08/17/2023, 8:20 PMmatthew_funk
08/17/2023, 8:24 PMmatthew_funk
08/17/2023, 8:25 PMmatthew_funk
08/17/2023, 8:26 PMReuben (Matatika)
08/17/2023, 8:27 PMmatthew_funk
08/17/2023, 8:33 PMReuben (Matatika)
08/17/2023, 8:39 PMReuben (Matatika)
08/17/2023, 8:40 PMpoetry.lock
and pyproject.toml
files? They're weren't included in the zip.matthew_funk
08/17/2023, 8:41 PMmatthew_funk
08/17/2023, 8:42 PMReuben (Matatika)
08/17/2023, 8:46 PMReuben (Matatika)
08/17/2023, 8:46 PMmatthew_funk
08/17/2023, 8:47 PMReuben (Matatika)
08/17/2023, 8:48 PMmatthew_funk
08/17/2023, 8:51 PMReuben (Matatika)
08/18/2023, 12:14 AMstart_date
config now, the tap will get the last 4 weeks of data as a fallback.
meltano invoke tap-caiso
I've only got the request and pagination logic working here intentionally, since I didn't want to step on your toes in learning the SDK and how it works with the Singer spec. There's definitely still some things to do, like:
• Define your config
• Define your stream schemas
• Last processed date state implementation (i.e. start from where you left off on the next sync)
• More streams!matthew_funk
08/18/2023, 1:09 PMReuben (Matatika)
08/18/2023, 1:52 PMmatthew_funk
08/18/2023, 3:27 PM