Hello, I want to use a geocoding service (Geocodio...
# getting-started
p
Hello, I want to use a geocoding service (Geocodio) to translate addresses to geocodes and enrich with census blocks. I will have a table/csv with columns that represent address components; these are fed into Geocodio's service, and I'll store the results in a separate table for downstream transformation. What is the best way to do this in Meltano land? Meltano SDK doesn't call out a "Translation" capability (combined Tap + Target). Is building a mapper the way to go? My stack is Snowflake/Duckdb/DBT, and as far as I can tell, duckdb and dbt don't support arbitrary python code to make http requests. What's the best way to achieve this that fits with Meltano?
v
Mapper would work, I lean towards a utility so it turns into just a small python project and looks something like
Copy code
meltano run tap-name target-duckdb geocode_utility
Gives you a lot of flexibility.
p
Additionally: would like to use batching with geocodio to geocode thousands of records at a time. And I'd like to use batching from snowflake to duckdb. So I understand that I cannot use inline stream mappers
v
The utility I tend to lean towards pandas, then it's a pretty simple sql query, and you can push the data back to a table just like in dbt land 🤷
Follow the general principle that the utility is like a single model. Where it might pull from a bunch of places but it only pushes back to one table 🤷
p
I'm thinking utility route. The utility will need to know how to access underlying data source and write new data, right?
v
Yeah pass it in via the config, pandas is nice as it uses sqlalchemy which a lot of targets already use so you can just snag the same uri
DBT has python capability but similar to you I prefer to just run python code sometimes
Also for geocode data I'm kind of in love with smartystreets
p
Okay. That could work. Yeah, I was thinking DBT, to maintain dependencies and lineage, but doesn't look like I can use Python models with duckdb.
Using this method, can I pass in args?
Copy code
meltano run tap-name target-duckdb geocode_utility
ie
Copy code
meltano run tap-name target-duckdb geocode_utility --table orders
v
Do it via
config
in Meltano
Also if you really want dependencies and lineage with python you could look at https://github.com/fal-ai/fal
same idea but has a framework I kinda shy away from
fal
just because of the smaller community
p
Cool, I can pass along smarty. The team has already done due diligence on quite a number of vendors, and geocodio was chosen for high throughput/low cost/acceptable accuracy workload
v
I've never used geocodeio so I have no idea! 😄
p
The main benefit in our particular pipeline is that geocodeio can enrich results with census
How do you mean? For instance, I'd want to have a job that looks like this:
Copy code
tap-snowflake target-duckdb dbt-duckdb:dedupe geocodio:accounts geocodio:orders
p
This has come up a few times and I dont think we've ever landed on a best way to do it so share back how it goes. I've also used geocodio for reverse geocoding and was happy with it. Other options: 1. think of this process as extracting data from the geocodio API. Create a tap-geocodio that accepts an input file as a config, maybe a CSV with whatever inputs are needed (lat/long/unique identifier) and have it make a request for each row in the CSV. If I remember correctly you give them extra attributes then it returns them in the response for each record so you can later map your results back to your input record. Maybe this looks like
tap-snowflake target-csv tap-geocodio target-snowflake
🤷. The result being that you now have a table in snowflake like
geocodeio_addresses
that can be used as an address->geocode lookup. 2. Potentially a mapper somehow 🤔 it seems like
tap-snowflake geocodio_enrich target-duckdb
would be ideal to append attributes to the incoming records schema on the fly, ultimately landing in duckdb already enriched. I've found mappers to be a little tricky to use in their current state and this would be stretching pretty far beyond the current use cases so I dont think I'd recommend this.