Hello there! I’m wondering if a large csv file can...
# random
f
Hello there! I’m wondering if a large csv file can be extracted by chunks and what is the best way to do it. I tried with
tap-spreadsheets-anywhere
and it works perfect reading until 100k rows, after that it got freezed my pc (i have a macbook pro m2). I’m quite new here so, any suggestions are well received!
e
Hi @Facundo Miño! What loader are you using? You may be able to configure it to have a smaller batch size so records are persisted earlier instead of letting them fill up memory.
f
Hello @Edgar Ramírez (Arch.dev)! Thanks for you reply, I am using the
target-postgres
loader and the
dbt-postgres
to transform the data as well as the
tap-spreadsheets-anywhere
to extract the data from a sftp
c
if you can download the file locally first, you can use the CLI program
split
to chunk the file up into a bunch of smaller CSVs and then process those individually
f
Hey @Charles Feduke! Is there a tap that can do it? Or there is some docs reference that i can read? I’m trying to find something like that but without results
c
I haven’t ran across anything that does that in the Meltano world myself unfortunately. It’s a thing we’d do with big data datasets back when Hadoop was popular.
a
@Facundo Miño spreadsheets-anywhere uses smart_open which should stream quite efficiently. https://pypi.org/project/smart-open/ Could you limit the number of rows you are keeping in memory before loading into postgres with
batch_size_rows
on the target config? https://hub.meltano.com/loaders/target-postgres#batch_size_rows-setting How wide is your csv? Are there some large columns in there? Edit: just noticed you mention sftp, this might not be as efficient as S3 or GCS with their streaming options
a
@Edgar Ramírez (Arch.dev) Are there any target Postgres variants that support batch messages? This should be amazing throughput with a little enhancement to tap-spreadsheets-anywhere as well to batch the files!?
e
💪 1
a
Need to write up a little article and PR back to meltanolabs, but @Reuben (Matatika) recently added batch message support to tap-bigquery. 500k / hour into Snowflake down to 1M / minute. 🔥
wow 1
🔥 1
🫡 1
e
That is incredible!
😁 1