https://meltano.com/ logo
#announcements
Title
# announcements
b

brainy-appointment-14137

01/18/2021, 6:24 AM
Has anyone experienced before that the loading from CSV to Postgres gets less number of rows ? I used two csv files as sources
Copy code
extractors:
  - name: tap-csv
    variant: meltano
    pip_url: git+<https://gitlab.com/meltano/tap-csv.git>
    config:
        files:
          - entity: titanic
            file: train.csv
            keys: ["survived", "sex", "age", "n_siblings_spouses", "parch", "fare", "class", "deck", "embark_town", "alone"]
          - entity: abalone
            file: abalone_train.csv
            keys: ["Length", "Diameter", "Height", "Whole weight", "Shucked weight", "Viscera weight", "Shell weight", "Age"]

  loaders:
  - name: target-postgres
    variant: meltano
    pip_url: git+<https://github.com/meltano/target-postgres.git>
and these are the links to download them: titanic and abalone I set up a local Postgres server the abalone file synced correctly to Postgres (3320 rows) but the titanic table in Postgres had only 558 rows while the CSV file has 628 🤔 this is the command I ran :
Copy code
meltano elt tap-csv target-postgres --transform=skip --job_id=csv-postgres
am I doing something wrong ?
1
r

ripe-musician-59933

01/18/2021, 5:06 PM
@brainy-appointment-14137 That's weird... Before we start debugging the tap and target itself, are you on the latest version of Meltano? (1.65.0)
When I run
meltano invoke tap-csv
locally with your
train.csv
, I get 627 records as expected (since the first line in the CSV just has headers), so the issue isn't in tap-csv
That leaves Meltano itself (not unlikely if you're on an older version) or target-postgres
Did the titanic table in Postgres end up with the first 558 records, or some random selection?
b

brainy-appointment-14137

01/18/2021, 8:13 PM
@ripe-musician-59933 I tried the tap-csv to target-csv and it worked fine with same number of rows so I doubt it is something with csv and yes I am on the latest version of Melano I am not sure whether they were the first 558 records or random selection how do I validate that since they are not in the same order in postgres target ? this is my project if you need to know something from my configs
r

ripe-musician-59933

01/18/2021, 8:45 PM
@brainy-appointment-14137 In the project I see that you're now using the
transferwise
variant of target-postgres, while you were using the
meltano
variant in the snippet you posted before. Are you seeing this same incomplete sync issue with both variants?
That'll help us determine if it's a bug in the
meltano
variant specifically, or if it's instead Postgres-specific
It may have something to do with the fact that Postgres will use the configured
keys
as the compound primary key, meaning that they are treated as together uniquely identifying the record. If multiple records in your CSV have the exact same set of values, they would only be stored in Postgres once, because the second record would be interpreted as a duplicate of the first record
And indeed, the tap outputs a few identical records that would only show up in the DB once:
Copy code
$ meltano invoke tap-csv > train.jsonl
$ cat train.jsonl | grep RECORD | wc -l
627
$ cat train.jsonl | grep RECORD | uniq | wc -l
624
That doesn't explain why another 66 rows were also missing, but it's an issue with the current approach either way
That issue should go away if you don't specify primary keys at all:
keys: []
You'll probably need to drop the table in Postgres before running the pipeline again to make sure the index gets dropped, though
b

brainy-appointment-14137

01/18/2021, 9:36 PM
@ripe-musician-59933 this explains everything ! I just checked the train.csv and found 69 duplicates 627- 69 = 558 so my problem was mis-interpreting that the keys meant in tap-csv Thank you for your help 🙏 😄
r

ripe-musician-59933

01/18/2021, 9:46 PM
@brainy-appointment-14137 Great, I'm happy I could help!