I m using tap postgres to extract a largish table around 3 G Meltano #troubleshooting

I’m using tap-postgres to extract a largish table ...

Charles Feduke

06/26/2024, 2:53 AM

I’m using tap-postgres to extract a largish table (around 3 GB) and at some point along the way it encounters a record with an invalid year:

Copy code

ValueError: year 11975 is out of range

… is there a straightforward way to just ignore rows like this? Obviously the data is no good, but I cannot delete the rows in the source system. I narrowed down the call stack to: https://github.com/MeltanoLabs/tap-postgres/blob/main/tap_postgres/client.py#L324 and I don’t see any sort of “ignore bad rows” option but maybe there is a pre-filter or some setting I’m missing. (I’d paste the entire stack trace here, but I’m running in MWAA - a feat of strength unto itself - and left the colorized logging on, which CloudWatch proudly displays as illegible.)

Edgar Ramírez (Arch.dev)

06/26/2024, 10:22 AM

You might wanna take a look at https://docs.meltano.com/guide/mappers/. There's a way to drop records based on a user-defined condition: https://sdk.meltano.com/en/latest/stream_maps.html#filtering-out-records-from-a-stream-using-filter-operation.

visch

06/26/2024, 1:25 PM

Curious to see if that solution works for you @Charles Feduke !

Charles Feduke

06/26/2024, 2:39 PM

cool I was looking for something like that, will investigate

Charles Feduke

06/26/2024, 2:58 PM

filtering won’t work because filtering applies after mapping occurs, and Python’s

datetime

EOL is 9999:

Copy code

>>> datetime.MAXYEAR
9999
>>> datetime.datetime(11975, 2, 7, 0, 0, 0, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: year 11975 is out of range

I can’t create a view in the source system to cast the PG

date

field as

varchar

or else I would. If I could otherwise define the column data type as

varchar

prior to mapping and have PG perform the cast that’d work, but not sure if that is possible via the tap’s metadata configuration.

👀 1

Charles Feduke

06/26/2024, 3:09 PM

I could leverage

dates_as_string

configuration, as per the tap:

Copy code

if self.config["dates_as_string"] is True:
            sqltype_lookup["date"] = th.StringType()
            sqltype_lookup["datetime"] = th.StringType()

… but this converts all dates which is not desirable (there are a number of timestamp fields which are important, and I’d rather not import 3 GB of data to make a replica of that data where each of these varchar fields are cast into timestamp or date as appropriate if I don’t have to)

Edgar Ramírez (Arch.dev)

06/26/2024, 6:55 PM

Gotcha. Can you create an issue in the repo? I think the gist of the request is to skip extracting rows that fail deserialization from the db (?) for whatever reason.

➕ 1

Charles Feduke

06/26/2024, 7:28 PM

will do

🙏 1

28 Views

Open in Slack

Previous Next