Hi all is there any ways that I could extract chunk of data Meltano #getting-started

Join Slack

Hi all, is there any ways that I could extract chu...

# getting-started

ly_pham

06/07/2021, 2:36 PM

Hi all, is there any ways that I could extract chunk of data in the span of 5-10 days each run? That would be very nice

taylor

06/07/2021, 2:43 PM

If the tap supports incremental loading like that, then absolutely! Which tap are you working with?

ly_pham

06/07/2021, 3:17 PM

I am looking for Mysql/Postgres/Mongo

ly_pham

06/07/2021, 3:18 PM

does the exisiting one support it? And if there is, is there any example on how to config it (In Meltano.yml right)? Thanks alot

taylor

06/07/2021, 9:15 PM

I misread what you asked originally. Most loaders should support that incremental loading, but it’s a bigger question if the tap supports incremental extraction

ly_pham

06/09/2021, 3:59 AM

oh, thank you!

edward_smith

06/10/2021, 12:30 PM

@ly_pham, This comes with no warranty, but I modified tap-postgres to do a TIME_BASED sync. There is a description of how to use it in the README.md: https://github.com/ers81239/pipelinewise-tap-postgres/tree/time_based_sync

kat_crichton-seager

09/07/2021, 10:47 AM

I'm having trouble running an initial extract of a large table, because the Heroku PostgreSQL instance runs out of temporary disk space. Presumably this would allow me to run it in smaller chunks and avoid that problem?

kat_crichton-seager

09/07/2021, 2:17 PM

Nope, still seem to be running out of disk space.

edward_smith

09/07/2021, 7:11 PM

Is Heroku Postgres your target database?

kat_crichton-seager

09/10/2021, 8:10 AM

Nope, it's the tap. The target is Snowflake.

edward_smith

09/10/2021, 2:26 PM

Gotcha... that is tough. I build the above time_based_sync to deal with a similar problem. It might work for you as well

kat_crichton-seager

09/10/2021, 3:54 PM

Heroku PostgreSQL has a ridiculously small amount of temp file space, it runs out with only 30,000 rows! In order to do my initial import, I've decided to run on my laptop with a downloaded copy of the production database.

kat_crichton-seager

09/10/2021, 3:56 PM

It feels like your branch should have worked for me, although the number of rows per time period grows a lot over the 6 years of data.

kat_crichton-seager

09/10/2021, 3:57 PM

I'd still have to run the pipeline about 1000 times to get all the 100 million rows of data in anyway!

kat_crichton-seager

09/10/2021, 3:58 PM

We used to use Stitch to pull this table successfully. I'm wondering how they got around it 🙂

kat_crichton-seager

09/10/2021, 4:01 PM

I've made a similar branch that limits the number of rows selected, rather than restricting to a time period - it works better for my use-case and I can set it to one value for Heroku and another for my laptop, where 80 GB spare space still isn't enough.

kat_crichton-seager

09/20/2021, 4:07 PM

For the benefit of anyone who might have found this thread, another probably easier way to reduce the disk space used is to set the

batch_size_rows

configuration on your target. Discovered this when investigating memory usage rather than disk, but it has the same effect.

Open in Slack

Previous Next