Hey all, I’m experimenting with Meltano and have h...
# plugins-general
e
Hey all, I’m experimenting with Meltano and have hit a snag with the
pipelinewise-tap-s3-csv
extractor. For my PoC i’m loading CSVs into postgres. When using local CSVs with the
tap-csv
extractor the data is loaded into my postgres target, but when using the
tap-s3-csv
extractor along with the
--log-level=debug
option I can see that the CSVs on S3 are found, but they aren’t loaded into the postgres target. The command that works is
Copy code
meltano --log-level=debug elt tap-csv target-postgres
while
Copy code
meltano --log-level=debug elt tap-s3-csv target-postgres
doesn’t. If anyone is aware of what I may be missing, the help would be greatly appreciated 🙂 . Part of my
meltano.yml
is in the thread.
Copy code
plugins:
  extractors:
  - name: tap-csv
    variant: meltanolabs
    pip_url: git+<https://github.com/MeltanoLabs/tap-csv.git>
    config:
      files:
      - entity: competitors
        path: ../tmp/competitors.csv
        keys:
        - ID
  - name: tap-s3-csv
    namespace: musicleague_s3
    pip_url: git+<https://github.com/transferwise/pipelinewise-tap-s3-csv>
    executable: tap-s3-csv
    capabilities:
    - catalog
    - discover
    - state
    config:
      bucket: bucket_name
      start_date: "2021-12-13"
      tables:
        - table_name: competitors
          search_prefix: musicleague
          search_pattern: "competitors\\.csv"
          key_properties: ["ID"]
          delimiter: ","
  loaders:
  - name: target-postgres
    variant: transferwise
    pip_url: pipelinewise-target-postgres
  transformers:
  - name: dbt
    pip_url: dbt==0.21.1
  files:
  - name: dbt
    pip_url: git+<https://gitlab.com/meltano/files-dbt.git@config-version-2>
    update:
      transform/profile/profiles.yml: false
v
Copy code
I can see that the CSVs on S3 are found, but they aren't loaded into the postgres target
Can you show how you "see" that csvs are found?
Copy code
tables:
        - table_name: competitors
          search_prefix: musicleague
          search_pattern: "competitors\\.csv"
          key_properties: ["ID"]
          delimiter: ","
Is going to almost certainly be the issue. It's just finding which thing is off. What jumps out to me is
competitors\\.csv
I'd try just using
competitors.csv
or
.csv
e
I had tried the
competitors.csv
first, with the same results. I can see the files were found because the logs contain
Copy code
2021-12-14T14:35:18.384766Z [info     ] time=2021-12-14 08:35:18 name=tap_s3_csv level=INFO message=Checking bucket "<bucket>" for keys matching "competitors\.
csv" name=tap-s3-csv stdio=stderr type=discovery                                                                                                                                         
2021-12-14T14:35:18.384848Z [info     ] time=2021-12-14 08:35:18 name=tap_s3_csv level=INFO message=Skipping files which have a LastModified value older than 2021-12-13 00:00:00+00:00 n
ame=tap-s3-csv stdio=stderr type=discovery                                                                                                                                               
2021-12-14T14:35:18.634940Z [info     ] time=2021-12-14 08:35:18 name=tap_s3_csv level=INFO message=Found 4 files. name=tap-s3-csv stdio=stderr type=discovery                           
2021-12-14T14:35:18.636623Z [info     ] time=2021-12-14 08:35:18 name=tap_s3_csv level=INFO message=Will download key "musicleague/competitors.csv" as it was last modified 2021-12-13 19
:15:50+00:00 name=tap-s3-csv stdio=stderr type=discovery                                                                                                                                 
2021-12-14T14:35:18.637040Z [info     ] time=2021-12-14 08:35:18 name=tap_s3_csv level=INFO message=Sampling musicleague/competitors.csv (max records: 1000, sample rate: 5) name=tap-s3-
csv stdio=stderr type=discovery                                                                                                                                                          
2021-12-14T14:35:18.986515Z [info     ] time=2021-12-14 08:35:18 name=tap_s3_csv level=INFO message=Sampled 7 rows from musicleague/competitors.csv name=tap-s3-csv stdio=stderr type=dis
covery                                                                                                                                                                                   
2021-12-14T14:35:19.000312Z [info     ] time=2021-12-14 08:35:18 name=tap_s3_csv level=INFO message=Finished discover name=tap-s3-csv stdio=stderr type=discovery                        
2021-12-14T14:35:19.075018Z [info     ]                                name=tap-s3-csv stdio=stderr type=discovery
The last log output when using the
tap-s3-csv
extractor are
Copy code
2021-12-14T14:35:19.421533Z [info     ] time=2021-12-14 08:35:19 name=botocore.credentials level=INFO message=Found credentials in environment variables. cmd_type=extractor job_id=2021-12-14T143516--tap-s3-csv--target-postgres name=tap-s3-csv run_id=750da636-c1db-4592-b615-04bb2e09cd45 stdio=stderr
2021-12-14T14:35:19.853839Z [info     ] time=2021-12-14 08:35:19 name=tap_s3_csv level=WARNING message=I have direct access to the bucket without assuming the configured role. cmd_type=extractor job_id=2021-12-14T143516--tap-s3-csv--target-postgres name=tap-s3-csv run_id=750da636-c1db-4592-b615-04bb2e09cd45 stdio=stderr
2021-12-14T14:35:19.921320Z [debug    ] Deleted configuration at /Users/eric/dev/learning/musicleague-dbt/meltano/.meltano/run/elt/2021-12-14T143516--tap-s3-csv--target-postgres/750da636-c1db-4592-b615-04bb2e09cd45/target.b6758593-d498-4680-8855-37fcea3e1f49.config.json
2021-12-14T14:35:19.921720Z [debug    ] Deleted configuration at /Users/eric/dev/learning/musicleague-dbt/meltano/.meltano/run/elt/2021-12-14T143516--tap-s3-csv--target-postgres/750da636-c1db-4592-b615-04bb2e09cd45/tap.83533fcf-6206-435a-81b0-a054d676d220.config.json
2021-12-14T14:35:19.921836Z [info     ] Extract & load complete!       job_id=2021-12-14T143516--tap-s3-csv--target-postgres name=meltano run_id=750da636-c1db-4592-b615-04bb2e09cd45
2021-12-14T14:35:19.921965Z [info     ] Transformation skipped.        job_id=2021-12-14T143516--tap-s3-csv--target-postgres name=meltano run_id=750da636-c1db-4592-b615-04bb2e09cd45
. When I run the command using the
tap-csv
extractor, the logs contain all of the inserts into postgres
v
can you try
meltano select --all tap-s3-csv
No select could be it
e
that returns nothing
so is that something that should go under
capabilities
?
v
When you ran
meltano select --all tap-s3-csv
it added a select: * . * to your meltano.yml
Try running it again / try
meltano select --list tap-s3-csv
e
running
meltano select --list tap-s3-csv
duplicated the
Copy code
select:
  - '*.*'
meltano select --list tap-s3-csv
now returns:
Copy code
meltano select --list tap-s3-csv
Legend:
        selected
        excluded
        automatic

Enabled patterns:
        *.*

Selected attributes:
        [automatic] competitors.ID
        [selected ] competitors.Name
        [selected ] competitors._sdc_extra
        [selected ] competitors._sdc_source_bucket
        [selected ] competitors._sdc_source_file
        [selected ] competitors._sdc_source_lineno
v
So now when you run elt do you get the results you'd expect?
e
nope 😞 still doesn’t get loaded into the postgres db
e
@eric_goddard can you try invoking the tap without a target to see if any record or schema messages are output?
Copy code
meltano invoke tap-s3-csv
e
Copy code
❯ meltano invoke tap-s3-csv 
time=2021-12-14 11:13:40 name=botocore.credentials level=INFO message=Found credentials in environment variables.
time=2021-12-14 11:13:40 name=tap_s3_csv level=WARNING message=I have direct access to the bucket without assuming the configured role.
meltano invoke --dump=catalog tap-s3-csv
outputs info about the stream and metadata
@boggdan_barrientos shared his fork with me — https://github.com/boggdan95/tap-s3-csv and its working like a charm. my config looks like
Copy code
plugins:
  extractors:
  - name: tap-s3-csv
    namespace: tap_s3_csv
    variant: fishtown-analytics
    pip_url: git+<https://github.com/boggdan95/tap-s3-csv.git>
    executable: tap-s3-csv
    capabilities:
    - state
    settings:
    - name: aws_access_key_id
      kind: string
    - name: aws_secret_access_key
      kind: password
    - name: start_date
      kind: string
    - name: bucket
      kind: string
    - name: tables
      kind: object
    config:
      aws_access_key_id: $AWS_ACCESS_KEY_ID
      aws_secret_access_key: $AWS_SECRET_ACCESS_KEY
      bucket: bucket_name
      start_date: '2021-12-13 00:00:00'
      tables:
      - name: competitors
        pattern: musicleague/competitors.csv
        key_properties:
        - ID
        search_prefix: musicleague
        format: csv
        delimiter: ','
Putting this here so that hopefully it can help someone else searching for
tap-s3-csv
. Thanks everyone!