Conceptual question: say we’re fetching analytics ...
# best-practices
r
Conceptual question: say we’re fetching analytics from one API, where each row contains a location. We’d like to look up each location on Google Maps, and store the Google Maps place id. What’s the best way to set this up in Meltano? Using mappers (https://hub.meltano.com/mappers/)? Or a separate pipeline that goes over all imported records and checks against Google Maps for each new entry?
s
Hey @Ruben Vereecken, so you could do that using a mapper, BUT the best practice would be to first dump all data into your raw area, and then batch-enrich it. • Step 1: Ingest all your analytics API data • Step 2: Ingest your place id data (raw) • Step 3: Join the two together. Why? Because it's much faster and safer for your data, the pipeline is simpler and easier to debug. So it's not a separate pipeline but rather a separate ingestion we recommend.
r
@Sven Balnojan really appreciate the input! I guess that would require some way to generate a list of queries (=locations) for step 2. Conceptually (having just started with Meltano), I can think of two ways: 1. Generate a list of queries from step 1 that is passed as input to step 2 2. Do the same, but let step 2 compute this list of queries itself (higher coupling) Which of these two is the Meltano way 🙂?
s
Hmm, to me, (2) sounds like lower coupling. So what I'd do, but I'm curious what e.g. @visch would do is this: 1. Pull in your analytics data as raw data (with your location as a column) 2. Have a separate location => place id data mapping table 3. Join 1 & 2, pick the ones that are empty in the location table, and fill those in. And run this in one pipeline.
v
I'd look at the architecture and what you're after deeper first. ie what are you after with "location" what does "location" mean to you? What does the place id in google maps do for you. etc then build the solution But skipping all that (there's bunch of branches above that could leads us all kinds of places like smartystreets, potentially CASS certified addresses, etc etc it's deep depending on the goal) If I have data in a DB and I want to enrich it, and I really think a meltano tap is a great way (it may be it just depends on your setup). I'd make a tap for tap-google-maps with a configuration of "gps_coordinates" list of lat, long strings.
👍 1
r
Interesting, so @Sven Balnojan thinks let the tap that runs google place API calls figure out which fields are empty, and run at those (that’s what I meant by coupling btw, the tap now has to figure out its own input) @visch what I’m getting is to have “gps_coordinates” as config for the tap (I’m assuming this is what I call the input method). I’ll take a look at how to pass config from one stage to the next, if at all possible
v
env variables is probably the way to go nothing baked into meltano at this point
🙏 1
p
I'd add that this comes up every once in a while and theres no easy solution. One crazy approach that I'd toss out there for full coverage is a tap/mapper combo like map-gpt-embeddings that I built for this blog post https://meltano.com/blog/llm-apps-are-mostly-data-pipelines/ which would allow to run something like
tap-place-ids tap-mapper-google-maps target-postgres
. The first tap returns places IDs and the second tap/mapper combo accepts each one as input and makes a second API call before passing the final output to the target.
🙌 1
I've also heard of people building something similar to your new tap-google-maps in a way that reads a CSV file for its inputs. So you could basically do what Sven mentioned: 1. load all your raw place IDs into postgres 2. run dbt to create a view of all IDs you need to look up 3. use tap-postgres to extract those view IDs and target-csv to write to a CSV file 4. build your new tap-google-maps to consume a CSV as input and request each ID from the API. Point it at the CSV file you just output 5. write to target-postgres
r
@Pat Nadolny (Arch) I like your last proposal the best. That’s basically argument passing between programs. I’ll see how that plays with Airflow etc (completely new to that one). If I get anywhere with this, feels like something that would be useful for Meltano to support out of the box at one point [checking out your mapper approach, which definitely sounds like the easiest way, though maybe slightly more coupled – harder to rerun in case of failures, etc]
👍 1
p
In terms of coupling in the mapper approach, I’m not totally sure how your first analytics API works but I’ll note that the target only writes state out once data arrives safely in the destination. So if it’s built to run incrementally with state then a failure would still be bookmarked so rerunning shouldn’t be an issue, it will pick up where it left off.