Anyone has a clue why such a long delay happens between writ Meltano #troubleshooting

Anyone has a clue why such a long delay happens be...

jan_soubusta

03/03/2023, 2:12 PM

Anyone has a clue why such a long delay happens between writing lock to Minio S3 and printing first metadata info msgs related to tap-salesforce?

Copy code

2023-03-03T13:56:04.857208Z [info     ] smart_open.s3.MultipartWriter('meltano', 'meltano_state/dev_local:tap-salesforce-to-target-postgres-sfdc/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)
2023-03-03T13:59:45.719671Z [info     ] INFO Starting sync             cmd_type=elb consumer=False name=tap-salesforce producer=True stdio=stderr string_id=tap-salesforce

jan_soubusta

03/03/2023, 2:31 PM

Hm, with --fulll-refresh S3 state is not considered but it is still lagging

jan_soubusta

03/03/2023, 2:31 PM

Minutes and minutes...strange

jan_soubusta

03/03/2023, 2:31 PM

But then it finishes successfully

jan_soubusta

03/03/2023, 2:35 PM

Could the fact that it is running on a custom DNS play a role? Our endpoint is https://gooddata--full.sandbox.my.salesforce.com. And yes, I set is_sandbox=true in meltano.yml.

jan_soubusta

03/03/2023, 2:43 PM

The delay disappears if I move the plugin config from a particular dev_local environment to the master plugin config.

Copy code

environments:
  - name: dev_local
#    config:
#      plugins:
#        extractors:
#          - name: tap-salesforce
#            config:
#              api_type: "BULK"
#              select_fields_by_default: true
#              start_date: "2023-01-01T00:00:00Z"
#              username: integration.internal@gooddata.com.full
#              is_sandbox: true
.....
plugins:
  extractors:
  - name: tap-salesforce
    variant: meltanolabs
    pip_url: git+<https://github.com/meltanolabs/tap-salesforce.git>
    config:
      api_type: "BULK"
      select_fields_by_default: true
      start_date: "2023-01-01T00:00:00Z"
      username: integration.internal@gooddata.com.full
      is_sandbox: true

jan_soubusta

03/03/2023, 2:46 PM

Also, start_date is not respected, I can see rows much older than start_date e.g. in

lead

table.

jan_soubusta

03/03/2023, 2:48 PM

@alexander_butler have you observed such a behavior? I mean, is it expected that

start_date

is not applied to`lead` table? The lag in the beginning is really strange. Is my config in the environments section correct?

jan_soubusta

03/03/2023, 4:00 PM

I ran meltano in debug mode.

jan_soubusta

03/03/2023, 4:00 PM

In the beginning, it does a lot of calls like this:

Copy code

INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.1849195957183838, "tags": {"endpoint": "ConnectedApplication", "status": "succeeded"}}
INFO Making GET request to <https://gooddata--full.sandbox.my.salesforce.com/services/data/v53.0/sobjects/UserProvisioningRequestShare/describe> with params: None
INFO Used 7149 of 282600 daily REST API quota
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.3235626220703125, "tags": {"endpoint": "UserProvisioningRequestShare", "status": "succeeded"}}
INFO Making GET request to <https://gooddata--full.sandbox.my.salesforce.com/services/data/v53.0/sobjects/DOZISF__ZoomInfo_Scoop__Tag/describe> with params: None
INFO Used 7139 of 282600 daily REST API quota
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.11390542984008789, "tags": {"endpoint": "DOZISF__ZoomInfo_Scoop__Tag", "status": "succeeded"}}
INFO Making GET request to <https://gooddata--full.sandbox.my.salesforce.com/services/data/v53.0/sobjects/IndividualShare/describe> with params: None
INFO Used 7152 of 282600 daily REST API quota

jan_soubusta

03/03/2023, 4:01 PM

Looks like it collects some kind of metadata through REST API for all entities even though I specify very small set of entities I want to extract.

jan_soubusta

03/03/2023, 4:04 PM

I am moron, I should not use config props I do not know. It was caused by:

Copy code

select_fields_by_default: true

alexander_butler

03/03/2023, 4:11 PM

I don't think I have seen any issues. I wonder if

start_date

is applied to

SystemModstamp

which is what is used as the replication key most of the time.

jan_soubusta

03/03/2023, 7:12 PM

So the issue continues. With local config, which looks to be correct, containing custom

select

limiting the list of entities and fields, it still freezes in the beginning. When I turn on debug, huge amount of REST GET requests are issued against the salesforce instance, just like if the

select

would be ignored.

jan_soubusta

03/03/2023, 7:13 PM

This is how it looks like:

jan_soubusta

03/03/2023, 7:13 PM

Any idea why this is happenning?

alexander_butler

03/03/2023, 7:25 PM

Its not freezing, its running discovery (It has to do this to generate a catalog)

alexander_butler

03/03/2023, 7:25 PM

seems like its not caching it maybe? any changes in select spec should invalidate the cached catalog afaik so theres that too

alexander_butler

03/03/2023, 7:26 PM

it might not cache it at all since its capabilities include

properties

instead of catalog

alexander_butler

03/03/2023, 7:26 PM

not sure though

jan_soubusta

03/03/2023, 7:38 PM

OK, now I understand it better. It does discover of all entities, it does not respect the list of requested entities in

select:

section. The discovery is very slow, it takes 3-4 minutes. Anytime I change the

select:

section, the cache is invalidated and the expensive discovery is executed again. Is the state of the discovery stored only locally or is it stored in state backend as well (AWS S3/Minio in my case)?

jan_soubusta

03/03/2023, 7:39 PM

I try to let it finish the discovery and execute it once again, without the local state, only with the state in Minio.

jan_soubusta

03/03/2023, 7:49 PM

OK, second execution is fast, it does not discover

jan_soubusta

03/03/2023, 7:52 PM

Good for production, but annoying for development, when I change the select config often. Do you think it would make sense to deep dive into tap-salesforce source code and try to change the discovery to handle only selected entities? Or is this behavior tap independent, managed on higher level?

Open in Slack

Previous Next