Hello guys, I was syncing a table with 10M+ recor...
# plugins-general
d
Hello guys, I was syncing a table with 10M+ records, and I encountered an error:
Copy code
target-postgres | psycopg2.errors.UntranslatableCharacter: unsupported Unicode escape sequence
target-postgres | DETAIL:  \u0000 cannot be converted to text.
target-postgres | CONTEXT:  JSON data, line 1: ...ams commmunication": "false", "Waiver Signature":...
target-postgres | COPY tmp_7e554454_d6ae_43b2_9169_b6f767a05767, line 18152, column custom_data: "{"Adult and Youth Waiver": "true", "Rams commmunication": "false", "Waiver Signature": "Kristina Bro..."
ohthis is a known issue in Postgres with \0 as EOS. Any idea how can I add processing on the singer json, so I can replace all \0 or to cast a specific column to json or jsonb
d
@dejan_pejcev Hmm, transforming the tap's Singer messages before they make it to the target is not supported yet: https://gitlab.com/meltano/meltano/-/issues/2300
What tap are you using?
d
tap-postgres and target-postgres, pipelinewise variant
hmmmm, so no solution besides sanitizing the data? 😕
d
Ah, and the
\0
is in a JSON column on the source that is turned into a regular text column on the destination?
You can use https://meltano.com/docs/plugins.html#schema-extra to modify the schema that is passed to the tap in the catalog, but I'm not sure how you'd reflect the fact that you want a json(b) column in the destination there, and there's no guarantee that the tap would respect that overridden schema and forward it to the target, instead of simply ignoring it and using its own "best" judgment to determine a type
d
hmmm it is json type in the source database, it should map it to a json type in the target also, not clear why it casts it as text
d
What does the SCHEMA message for this table have as a type for this column?
d
where can I find the SCHEMA message?
d
If you run
meltano elt
in debug mode: https://meltano.com/docs/command-line-interface.html#debugging, you'll see each message with a
tap-postgres (out)
prefix
So you can find the SCHEMA message for the stream (table) in question
d
hmmm I need to rerun it and let it run for a while so it reaches that part. Hmmm if a column is of type
json
in the source db, it should map it also to a
json
column in the target db?
d
I'd expect tap-postgres to expose its
jsonb
columns as
object
type properties on streams, which target-postgres would then map to
jsonb
again
d
This is from the CHANGELOG after postgres v9.4.1
Copy code
...Therefore \u0000 will now also be rejected in json values when conversion to de-escaped form is required. This change does not break the ability to store \u0000 in json columns so long as no processing is done on the values...
d
What type does the column have in the created table?
jsonb
or
text
?
😬
d
jsonb
d
OK, so then it's ultimately a Postgres issue: it won't let you store a
\0
in a JSONB column using the type of COPY query target-postgres uses
d
yep, and they said somewhere that they do not plan on fixing that
actually, if you do any processing on a
json
type with a
\0
it will fail, so even converting from
json
to
jsonb
will error
d
Right...
With https://gitlab.com/meltano/meltano/-/issues/2300 or https://gitlab.com/meltano/meltano/-/issues/2299 it shouldn't be too hard to work around this, but that doesn't help much today 😄
Is sanitizing on the source an option?
d
no way to enforce
json
->
json
?
JSONB is definitely preferably if you want to be able to run queries against the nested data
d
that can be a fix, but it is not wise to go down this road in the long run
I’ll try to run a sanitize regexp_replace, but this is a json columng with 10+M records, it will take some time
d
👍
I wonder if it would be worth creating an issue on https://github.com/transferwise/pipelinewise-tap-postgres to automatically strip out these
\0
s, but that may be unexpected and undesirable in some contexts
Or an issue on pipelinewise-target-postgres to add an option to use “json” instead of “jsonb”, if that turns out to be a feasible workaround
a
Hey, may I resurrect this thread now that mapper plugins exist?
I am dealing with nearly the same situation as @dejan_pejcev described originally (target-postgres choking on
\0000
in a jsonb field). I’m trying to figure out how the mapper plugins can be used to either sanitize those records or just filter them out