I’m using tap-postgres to extract a largish table ...
# troubleshooting
c
I’m using tap-postgres to extract a largish table (around 3 GB) and at some point along the way it encounters a record with an invalid year:
Copy code
ValueError: year 11975 is out of range
… is there a straightforward way to just ignore rows like this? Obviously the data is no good, but I cannot delete the rows in the source system. I narrowed down the call stack to: https://github.com/MeltanoLabs/tap-postgres/blob/main/tap_postgres/client.py#L324 and I don’t see any sort of “ignore bad rows” option but maybe there is a pre-filter or some setting I’m missing. (I’d paste the entire stack trace here, but I’m running in MWAA - a feat of strength unto itself - and left the colorized logging on, which CloudWatch proudly displays as illegible.)
e
You might wanna take a look at https://docs.meltano.com/guide/mappers/. There's a way to drop records based on a user-defined condition: https://sdk.meltano.com/en/latest/stream_maps.html#filtering-out-records-from-a-stream-using-filter-operation.
v
Curious to see if that solution works for you @Charles Feduke !
c
cool I was looking for something like that, will investigate
filtering won’t work because filtering applies after mapping occurs, and Python’s
datetime
EOL is 9999:
Copy code
>>> datetime.MAXYEAR
9999
>>> datetime.datetime(11975, 2, 7, 0, 0, 0, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: year 11975 is out of range
I can’t create a view in the source system to cast the PG
date
field as
varchar
or else I would. If I could otherwise define the column data type as
varchar
prior to mapping and have PG perform the cast that’d work, but not sure if that is possible via the tap’s metadata configuration.
👀 1
I could leverage
dates_as_string
configuration, as per the tap:
Copy code
if self.config["dates_as_string"] is True:
            sqltype_lookup["date"] = th.StringType()
            sqltype_lookup["datetime"] = th.StringType()
… but this converts all dates which is not desirable (there are a number of timestamp fields which are important, and I’d rather not import 3 GB of data to make a replica of that data where each of these varchar fields are cast into timestamp or date as appropriate if I don’t have to)
e
Gotcha. Can you create an issue in the repo? I think the gist of the request is to skip extracting rows that fail deserialization from the db (?) for whatever reason.
1
c
will do
🙏 1