I'm trying to figure out a good structure for hand...
# meltano-plugin-development
j
I'm trying to figure out a good structure for handling of incoming invoices within my Meltano project. We are reselling services from physiotherapists, so we're getting a large number of incoming invoices that need to be parsed before we can then invoice upstream. I want to use an external OCR tool for that (no specific tool yet, but I'd appreciate suggestions if anyone has experience). 1. We can get URLs for the invoices using the API provided by our accounting software. That sounds like a good job for an extractor. 2. Then the OCR tool will provide information about the downstream service provider (tax ID etc) as well as a list of lines on the invoice. 3. That data then needs to be loaded into our data warehouse. Currently I'm implementing it as a
utility
in Meltano parlance, but would such a "heavy" operation be suited for the stream mapping API? If not, then what is the alternative? I suppose I could have the utility output something adhering to the Singer spec but what is it then, if not a tap? Is
meltano run foo_utility target-postgres
a well-defined operation?
I should say I'd like to avoid writing it as a Python dbt model because we're using Postgres as our DW, which has no support for Python models.
e
but would such a "heavy" operation be suited for the stream mapping API? If not, then what is the alternative? I suppose I could have the utility output something adhering to the Singer spec but what is it then, if not a tap?
The stream maps interface is really meant for lightweight on-the-fly transformations, but a mapper can be really anything, though it would of course have be to custom and tightly coupled to the output of the accounting software.
j
What would be a better approach, then, for my use case? I get PDF URLs in my DW but I'd like another layer of analysis. Currently I can invoke my utility and pipe it to
target-postgres
pretty easily I suppose
Considering using
SimpleSingerWriter
from the Meltano SDK to have my utility write something coherent that I can then pipe into a target, but somehow it also feels a little "dirty".
I started out by trying to extend
Stream
but I cannot do that unless I have a
Tap
which I don't think is correct since I'm not tapping anything, I'm producing data with this utility. However
Tap
implements the
SimpleSingerWriter
interface which sounds handy
But maybe I'm too much in love with the idea of feeding the data to the target. Maybe I should just connect to the DW (which is Postgres) manually and ignore the whole Singer thing. After all I want to make one
SELECT
query targeting a dbt view, call an API and make a few `INSERT`s subsequently. Could be I'm making my life more miserable than it needs to be by trying to adhere to the Singer spec for this uncommon use case...
On the other hand there are taps written by people unburdened by such semantic qualms... https://github.com/ericlebail/tap-test-data-generator
😉 1
👀 1
a
Thanks for this discussion, had been wrapping my head around how I could make this work for Azure AI Sentiment analysis. Still not certain, think I will probably end up with a dagster op. But the idea of using meltano's state system to only extract and process new responses was quite a nice bonus if I could have done it. https://learn.microsoft.com/en-us/azure/ai-services/language-service/sentiment-opinion[…]/quickstart?tabs=windows&pivots=programming-language-python
👀 2
j
Yeah, I was thinking about something akin to a dbt incremental model, but, alas. For my case I'm going to construct a model that contains the PDF links, then write a program that `SELECT`s in that model, does the talking to the OCR API and then updates a table linking the PDF URL to the result. In that way I can join on the PDF URL, and I can even modify the first model to exclude the PDF URLs that have rows in the
ocr_results
table, such that the same PDF files don't get handled twice. At the end of the day it's not pretty and it's not failsafe but I also have some work to do 😉
d
I would split the problem up. Write something that pics up invoices, parses them, and writes to a file. Then you can run a simple pipeline to load those files. It seems like more steps, but it’s easier to monitor, easier to debug, and keeps the meltano part closer to the intended use.
👀 1
j
Thanks @dominic_parry! That's very actionable 🙂