I have a bunch of XML files containing purchase or...
# getting-started
h
I have a bunch of XML files containing purchase orders, each file contains information on buyer / seller / items ordered. I want to load these into a SQL database (duckdb), where each entity should appear in its own table. My thought was to use a custom stream mapper to split one source documents into multiple records - does anyone have experience with this or examples / alternatives to share?
m
Do you need to do this splitting during the initial load? We do something similar (JSON documents with arrays of objects on each record) and we load them into the db first, and use dbt to split them into one tow per array element after they are in the db.
h
I prototyped this approach, but I was not satisfied: XML support in postgres is limited (e.g. xpath 1.0 olny), and absent in duckdb. Also, the developer experience of having beautifulsoup + pytest to deal with the input docs looks more promising.
h
I haven’t had to implement an xml tap, but my initial thinking would be to make a generic xml-reader that accepts an xslt mapping, and then read the same xml files multiple times with different XSLTs, creating different streams.