Hi all, is there any ways that I could extract chu...
# getting-started
l
Hi all, is there any ways that I could extract chunk of data in the span of 5-10 days each run? That would be very nice
t
If the tap supports incremental loading like that, then absolutely! Which tap are you working with?
l
I am looking for Mysql/Postgres/Mongo
does the exisiting one support it? And if there is, is there any example on how to config it (In Meltano.yml right)? Thanks alot
t
I misread what you asked originally. Most loaders should support that incremental loading, but it’s a bigger question if the tap supports incremental extraction
l
oh, thank you!
e
@ly_pham, This comes with no warranty, but I modified tap-postgres to do a TIME_BASED sync. There is a description of how to use it in the README.md: https://github.com/ers81239/pipelinewise-tap-postgres/tree/time_based_sync
k
I'm having trouble running an initial extract of a large table, because the Heroku PostgreSQL instance runs out of temporary disk space. Presumably this would allow me to run it in smaller chunks and avoid that problem?
Nope, still seem to be running out of disk space.
e
Is Heroku Postgres your target database?
k
Nope, it's the tap. The target is Snowflake.
e
Gotcha... that is tough. I build the above time_based_sync to deal with a similar problem. It might work for you as well
k
Heroku PostgreSQL has a ridiculously small amount of temp file space, it runs out with only 30,000 rows! In order to do my initial import, I've decided to run on my laptop with a downloaded copy of the production database.
It feels like your branch should have worked for me, although the number of rows per time period grows a lot over the 6 years of data.
I'd still have to run the pipeline about 1000 times to get all the 100 million rows of data in anyway!
We used to use Stitch to pull this table successfully. I'm wondering how they got around it 🙂
I've made a similar branch that limits the number of rows selected, rather than restricting to a time period - it works better for my use-case and I can set it to one value for Heroku and another for my laptop, where 80 GB spare space still isn't enough.
For the benefit of anyone who might have found this thread, another probably easier way to reduce the disk space used is to set the
batch_size_rows
configuration on your target. Discovered this when investigating memory usage rather than disk, but it has the same effect.