laurent
04/02/2021, 2:38 PMtap-spreadsheets-anywhere to load a CSV with about 5 million rows (only 2 columns of fixed length strings). The file weighs 180MB, but it takes over an hour to load into target-postgres which seems impossibly long. My previous code uses pandas and loads the file into a dataframe in maybe 1 minute or less, so I'm pondering whether the csv.DictReader in the tap is too slow, or if there's something else at play. Everything happens on a single machine with 32GB RAM, so it's not a network issue, nor some sort of memory constraint. I'll try to do some testing over the weekend, but if anyone has any tips, happy to take them 🙂taylor
04/02/2021, 2:45 PMlaurent
04/02/2021, 2:55 PMlaurent
04/02/2021, 2:56 PMtarget-postgres | INFO METRIC: {"type": "timer", "metric": "job_duration", "value": 72.2939760684967, "tags": {"job_type": "table", "path":...taylor
04/02/2021, 3:37 PMlaurent
04/02/2021, 3:39 PMlaurent
04/02/2021, 3:40 PMlaurent
04/02/2021, 3:46 PMtarget-postgres. Too many weekend projects 😞taylor
04/02/2021, 4:01 PMtaylor
04/02/2021, 4:01 PMlaurent
04/02/2021, 4:19 PMtaylor
04/02/2021, 4:21 PMlaurent
04/02/2021, 4:26 PMtaylor
04/02/2021, 4:27 PMvisch
04/02/2021, 5:27 PMvisch
04/02/2021, 5:28 PMtaylor
04/02/2021, 6:37 PMlaurent
04/03/2021, 5:53 PMlru_cache on a few calls:
• calculating the field name, called about num_rows * (num_cols+5 metadata cols), caching works well because there seems to be only a limited combination of args
• formatting timestamps (the same timestamp is formatted once per row, with caching it's more or less 1 run/batch)
This dropped exec time to about 5s for the target.
Then I replaced a deepcopy (of the default row value) with pickling/unpickling, and that shaved off another 1.5-2 seconds.
Most of the time left now is actually spent in a postgres COPY operation, which seems fair.
With these changes, my initial file (5million rows) now loads in about 8 min, vs 50+ before 🎉 ⏩taylor
04/03/2021, 7:07 PMvisch
04/03/2021, 7:26 PMlaurent
04/04/2021, 1:19 AMtarget-postgres to suggest these changes: https://github.com/datamill-co/target-postgres/pull/204laurent
04/04/2021, 11:38 PMtaylor
04/05/2021, 1:21 PMlaurent
04/05/2021, 1:47 PM