k
a
Larger files will be faster and easier for most systems (both within and outside GCP). An ideal file's size should be between 128MB to 1GB. The partitions form the seams and if you have less records in a single partition than would constitute 128MB I would (probably) have them all in one file per partition. You will hit real bottlenecks when you scale up the # of files with sizes that small because each one has to be read in its own frame, api call, etc; it's not a contiguous set of bytes. Most people who can't control the size of files have an intermediate step where the concatenate/merge the files to a staging bucket for subsequent processing by spark or w/e
a
is 1mb per file good? 10mb? 100mb?
It might help to know more details of your use case, but in general I'd say definitely aim to have files >10MB. For reference, Snowflake uses internal "micropartition files" around 16MB compressed, which is roughly 50-500MB uncompressed. https://stackoverflow.com/a/67945808 You can certainly use larger files but anything smaller and the overhead of processing files starts to compete with the cost of reading the data.