I m trying to figure out a good structure for handling of in Meltano #meltano-plugin-development

I'm trying to figure out a good structure for hand...

Jens Christian Hillerup

09/25/2024, 9:31 AM

I'm trying to figure out a good structure for handling of incoming invoices within my Meltano project. We are reselling services from physiotherapists, so we're getting a large number of incoming invoices that need to be parsed before we can then invoice upstream. I want to use an external OCR tool for that (no specific tool yet, but I'd appreciate suggestions if anyone has experience). 1. We can get URLs for the invoices using the API provided by our accounting software. That sounds like a good job for an extractor. 2. Then the OCR tool will provide information about the downstream service provider (tax ID etc) as well as a list of lines on the invoice. 3. That data then needs to be loaded into our data warehouse. Currently I'm implementing it as a

utility

in Meltano parlance, but would such a "heavy" operation be suited for the stream mapping API? If not, then what is the alternative? I suppose I could have the utility output something adhering to the Singer spec but what is it then, if not a tap? Is

meltano run foo_utility target-postgres

a well-defined operation?

Jens Christian Hillerup

09/25/2024, 9:33 AM

I should say I'd like to avoid writing it as a Python dbt model because we're using Postgres as our DW, which has no support for Python models.

Edgar Ramírez (Arch.dev)

09/25/2024, 10:43 PM

but would such a "heavy" operation be suited for the stream mapping API? If not, then what is the alternative? I suppose I could have the utility output something adhering to the Singer spec but what is it then, if not a tap?

The stream maps interface is really meant for lightweight on-the-fly transformations, but a mapper can be really anything, though it would of course have be to custom and tightly coupled to the output of the accounting software.

Jens Christian Hillerup

09/26/2024, 6:35 AM

What would be a better approach, then, for my use case? I get PDF URLs in my DW but I'd like another layer of analysis. Currently I can invoke my utility and pipe it to

target-postgres

pretty easily I suppose

Jens Christian Hillerup

09/26/2024, 12:50 PM

Considering using

SimpleSingerWriter

from the Meltano SDK to have my utility write something coherent that I can then pipe into a target, but somehow it also feels a little "dirty".

Jens Christian Hillerup

09/26/2024, 12:52 PM

I started out by trying to extend

Stream

but I cannot do that unless I have a

Tap

which I don't think is correct since I'm not tapping anything, I'm producing data with this utility. However

Tap

implements the

SimpleSingerWriter

interface which sounds handy

Jens Christian Hillerup

09/26/2024, 12:57 PM

But maybe I'm too much in love with the idea of feeding the data to the target. Maybe I should just connect to the DW (which is Postgres) manually and ignore the whole Singer thing. After all I want to make one

SELECT

query targeting a dbt view, call an API and make a few `INSERT`s subsequently. Could be I'm making my life more miserable than it needs to be by trying to adhere to the Singer spec for this uncommon use case...

Jens Christian Hillerup

09/26/2024, 1:13 PM

On the other hand there are taps written by people unburdened by such semantic qualms... https://github.com/ericlebail/tap-test-data-generator

😉 1

👀 1

Andy Carter

10/08/2024, 12:05 PM

Thanks for this discussion, had been wrapping my head around how I could make this work for Azure AI Sentiment analysis. Still not certain, think I will probably end up with a dagster op. But the idea of using meltano's state system to only extract and process new responses was quite a nice bonus if I could have done it. https://learn.microsoft.com/en-us/azure/ai-services/language-service/sentiment-opinion[…]/quickstart?tabs=windows&pivots=programming-language-python

👀 2

Jens Christian Hillerup

10/08/2024, 9:25 PM

Yeah, I was thinking about something akin to a dbt incremental model, but, alas. For my case I'm going to construct a model that contains the PDF links, then write a program that `SELECT`s in that model, does the talking to the OCR API and then updates a table linking the PDF URL to the result. In that way I can join on the PDF URL, and I can even modify the first model to exclude the PDF URLs that have rows in the

ocr_results

table, such that the same PDF files don't get handled twice. At the end of the day it's not pretty and it's not failsafe but I also have some work to do 😉

dominic_parry

10/31/2024, 1:50 PM

I would split the problem up. Write something that pics up invoices, parses them, and writes to a file. Then you can run a simple pipeline to load those files. It seems like more steps, but it’s easier to monitor, easier to debug, and keeps the meltano part closer to the intended use.

👀 1

Jens Christian Hillerup

11/01/2024, 8:53 AM

Thanks @dominic_parry! That's very actionable 🙂

3 Views

Open in Slack

Previous Next