I have a bunch of XML files containing purchase orders, each file contains information on buyer / seller / items ordered. I want to load these into a SQL database (duckdb), where each entity should appear in its own table.
My thought was to use a custom stream mapper to split one source documents into multiple records - does anyone have experience with this or examples / alternatives to share?
m
Matt Menzenski
03/17/2023, 12:47 PM
Do you need to do this splitting during the initial load? We do something similar (JSON documents with arrays of objects on each record) and we load them into the db first, and use dbt to split them into one tow per array element after they are in the db.
h
hinnerk_gnutzmann
03/18/2023, 7:55 AM
I prototyped this approach, but I was not satisfied: XML support in postgres is limited (e.g. xpath 1.0 olny), and absent in duckdb. Also, the developer experience of having beautifulsoup + pytest to deal with the input docs looks more promising.
h
Henning Holgersen
03/18/2023, 12:42 PM
I haven’t had to implement an xml tap, but my initial thinking would be to make a generic xml-reader that accepts an xslt mapping, and then read the same xml files multiple times with different XSLTs, creating different streams.