Anyone has a clue why such a long delay happens be...
# troubleshooting
j
Anyone has a clue why such a long delay happens between writing lock to Minio S3 and printing first metadata info msgs related to tap-salesforce?
Copy code
2023-03-03T13:56:04.857208Z [info     ] smart_open.s3.MultipartWriter('meltano', 'meltano_state/dev_local:tap-salesforce-to-target-postgres-sfdc/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)
2023-03-03T13:59:45.719671Z [info     ] INFO Starting sync             cmd_type=elb consumer=False name=tap-salesforce producer=True stdio=stderr string_id=tap-salesforce
Hm, with --fulll-refresh S3 state is not considered but it is still lagging
Minutes and minutes...strange
But then it finishes successfully
Could the fact that it is running on a custom DNS play a role? Our endpoint is https://gooddata--full.sandbox.my.salesforce.com. And yes, I set is_sandbox=true in meltano.yml.
The delay disappears if I move the plugin config from a particular dev_local environment to the master plugin config.
Copy code
environments:
  - name: dev_local
#    config:
#      plugins:
#        extractors:
#          - name: tap-salesforce
#            config:
#              api_type: "BULK"
#              select_fields_by_default: true
#              start_date: "2023-01-01T00:00:00Z"
#              username: integration.internal@gooddata.com.full
#              is_sandbox: true
.....
plugins:
  extractors:
  - name: tap-salesforce
    variant: meltanolabs
    pip_url: git+<https://github.com/meltanolabs/tap-salesforce.git>
    config:
      api_type: "BULK"
      select_fields_by_default: true
      start_date: "2023-01-01T00:00:00Z"
      username: integration.internal@gooddata.com.full
      is_sandbox: true
Also, start_date is not respected, I can see rows much older than start_date e.g. in
lead
table.
@alexander_butler have you observed such a behavior? I mean, is it expected that
start_date
is not applied to`lead` table? The lag in the beginning is really strange. Is my config in the environments section correct?
I ran meltano in debug mode.
In the beginning, it does a lot of calls like this:
Copy code
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.1849195957183838, "tags": {"endpoint": "ConnectedApplication", "status": "succeeded"}}
INFO Making GET request to <https://gooddata--full.sandbox.my.salesforce.com/services/data/v53.0/sobjects/UserProvisioningRequestShare/describe> with params: None
INFO Used 7149 of 282600 daily REST API quota
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.3235626220703125, "tags": {"endpoint": "UserProvisioningRequestShare", "status": "succeeded"}}
INFO Making GET request to <https://gooddata--full.sandbox.my.salesforce.com/services/data/v53.0/sobjects/DOZISF__ZoomInfo_Scoop__Tag/describe> with params: None
INFO Used 7139 of 282600 daily REST API quota
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.11390542984008789, "tags": {"endpoint": "DOZISF__ZoomInfo_Scoop__Tag", "status": "succeeded"}}
INFO Making GET request to <https://gooddata--full.sandbox.my.salesforce.com/services/data/v53.0/sobjects/IndividualShare/describe> with params: None
INFO Used 7152 of 282600 daily REST API quota
Looks like it collects some kind of metadata through REST API for all entities even though I specify very small set of entities I want to extract.
I am moron, I should not use config props I do not know. It was caused by:
Copy code
select_fields_by_default: true
a
I don't think I have seen any issues. I wonder if
start_date
is applied to
SystemModstamp
which is what is used as the replication key most of the time.
j
So the issue continues. With local config, which looks to be correct, containing custom
select
limiting the list of entities and fields, it still freezes in the beginning. When I turn on debug, huge amount of REST GET requests are issued against the salesforce instance, just like if the
select
would be ignored.
This is how it looks like:
Any idea why this is happenning?
a
Its not freezing, its running discovery (It has to do this to generate a catalog)
seems like its not caching it maybe? any changes in select spec should invalidate the cached catalog afaik so theres that too
it might not cache it at all since its capabilities include
properties
instead of catalog
not sure though
j
OK, now I understand it better. It does discover of all entities, it does not respect the list of requested entities in
select:
section. The discovery is very slow, it takes 3-4 minutes. Anytime I change the
select:
section, the cache is invalidated and the expensive discovery is executed again. Is the state of the discovery stored only locally or is it stored in state backend as well (AWS S3/Minio in my case)?
I try to let it finish the discovery and execute it once again, without the local state, only with the state in Minio.
OK, second execution is fast, it does not discover
Good for production, but annoying for development, when I change the select config often. Do you think it would make sense to deep dive into tap-salesforce source code and try to change the discovery to handle only selected entities? Or is this behavior tap independent, managed on higher level?