I have a bit of a curly one...re `tap-mssql`. Firs...
# singer-taps
d
I have a bit of a curly one...re
tap-mssql
. First of all, I have seen the discussions around
tap-mssql
(which relies on a JVM) and
pipelinewise-tap-mssql
which includes a Python wrapped version leveraging some C libs, and FreeTDS et al. The issue I am having is that I need to connect to an Azure SQL server.
pymssql
can do this IF the right steps are followed to ensure we have a version of FreeTDS that supports SSL encryption. The problem is that the bundled version of FreeTDS that comes with
pymssql
as part of the larger
pipelinewise-tap-mssql
package does not support this. On an ubuntu box, if I navigate specifically to the
site-packages
directory of that extractor, it will fail:
Copy code
cd /path/to/project/.meltano/extractors/tap-mssql/venv/lib/python3.8/site-packages
python
...
>>> import pymssql
>>> host = "<http://xxx.database.windows.net|xxx.database.windows.net>"
>>> port = 1433
>>> user = "abc"
>>> password = "def"
>>> db_name = "123"
>>> conn = pymssql.connect(f"{host}:{port}", f"{user}@{host.split('.')[0]}", password, db_name)
This gives us:
pymssql.OperationalError: (20002, b'DB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (<http://xxx.database.windows.net|xxx.database.windows.net>)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (<http://xxx.database.windows.net|xxx.database.windows.net>)\n')
If I am outside of that virtualenv, just straight on my ubunutu box where i have pip insalled both
pymssql
and apt installed
freetds-bin
Then it works. I'm not sure if it is because of an updated version of
pymssql
or it is somehow seeing the freetds libs on the system (but unsure why the virtual env would not also see the same libs...) Any ideas? Is there a way to update sub-packages in an installed tap?
I uninstalled
freetds-bin
and retried with the latest
pymssql
lib and it still works. So I've got my answer re: what the problem is. Just not sure how to upgrade a tap in-situ. I suppose I could fork the tap and stash it in the
extractors
folder and then manually upgrade the
pymssql
package, but wondered if there was a better way
This appears to have worked: • Download the
pipelinewise-tap-mssql
package with
git submodule
• Update
setup.py
and bump the version of
pymssql
from
2.1.5
to
2.2.1
(latest at time of writing) • Run:
meltano add --custom tap-mssql
• Set executable to
-e extract/pipelinewise-tap-mssql
Still curious if there is a better way
v
I'd submit a PR to the pipeline wise folks with the updated package. Then I'd point my Meltano version to the Fork you made that works until it gets merged in the upstream (hopefully)
I did make a target mssql that uses pyodbc instead of pymssql but I haven't made a tap that does the same. This is good to know though
d
Good idea on the PR. I've done that now. https://github.com/wintersrd/pipelinewise-tap-mssql/pull/8
s
@david_tout, thanks for the information about changing setup.py pymssql to 2.2.1. I found before the change it was not recognizing MSSQL dates as a datetime.date datatype rather it was a string.
d
I had to make a few more data-type changes as well. Can't recall if I PRd them. I remember
money
was not being handled correctly
s
Out of a matter of interest I'm working on a big PR to support MS SQL change data capture. I have most of it working now. Will submit a PR, it would be nice if it is merged in otherwise I will use my variant as we wish to handle incremental changes to handle things like deletes.
My variant that I'm working on is located here. https://github.com/s7clarke10/pipelinewise-tap-mssql
d
I'm not the maintainer of the original repository. I would also be interested in support for CDC. Maybe this tap could be adopted by meltano so that it can be maintained? @aaronsteers @taylor @douwe_maan
s
I'll let you know when I finish testing. I have just been adding support to state what the local timezone is for the MSSQL so it can emitted datetimes and dates with the correct UTC time offset.
p
@steve_clarke I just found this thread 👋. Do you think https://github.com/s7clarke10/pipelinewise-tap-mssql is in a good place? It looks like the upstream hasnt been accepting the open PRs and youve made a fair amount of commits ahead of it. Theres also an option to move one of these variants to the MeltanoLabs namespace so it gets more attention. I havent used any of them myself so I'd defer to you and @david_tout
d
Upstream is basically abandoned as far as I can tell. I have not made extensive modifications but I did resolve a few bugs I'm not confident how good my fixes are either. I'm happy to make the code base available back to the community, but am not sure how much time I have to support. We're also not using the tap in anger. It's part of a daily job, but we're not pushing it very hard.
s
Hi @pat_nadolny, @david_tout, I do think my variant on pipelinewise-tap-mssql is in a reasonably good place. Following suggestions from David, I updated tap-mssql to use the latest pymssql dependency and also brought in support for a lot of missing data types. I also added in a lot of additional code to allow
log_based
extraction of data using MS SQL Server Change Data Capture. So if you are looking to incremental extract data from MS SQL Server this could be quite useful. At the moment a work collogue is putting the tap through it's paces in terms of testing. I should have a good picture of if there are any issues with it in about a week.
d
It sounds like @steve_clarke’s version is a safer bet in terms of adopting one as a standard!
My latest code is here https://github.com/zendient/pipelinewise-tap-mssql Just in case any of it is useful
Apologies, there are some noisy commits in there thanks to code formatters
s
I should mention that there is another Meltano users who has created a variant which I believe supports Change Tracking which is an earlier version of Change Data Capture. So if a user wanted to support Change Tracking rather than Change Data Capture then his version should be used for
log-based
extracts. https://github.com/SteveDMurphy/pipelinewise-tap-mssql So there may be benefits for a few variants.
@david_tout, I could always do a bit of a compare between your variant and mine and potentially include some of your enhancements into my variant if I haven't supported them?
d
I'll see if I can put a cheatsheet here to save you time. There's probably only two or so that may be worth grabbing
This set of commits was all related to some issues I was facing with date fields not coming through with the right timezone, and it was causing duplicated records in my downstream database. In essence I tried to make it a bit smarter with timezones and date conversion, and set the incremental logic to only get NEWER data and not "greater than or equal to" to stop receiving the same data twice. The rest are trivial and you've already grabbed most of them
s
Out of a matter of interest, I added in support to emit datetimes with a datetime datetype so I could read them into Snowflake into a DateTime datatype without having to convert a string into a datetime datatype. There is a parameter to set if you want to use this feature.
d
That sounds helpful!
s
While this is slightly unrelated, I also needed to create a tap-sybase to connect to Sybase Databases so I converted my variant of pipeline-tap-mssql to work with Sybase databases here https://github.com/s7clarke10/tap-sybase. Because pymsql uses TDS under the hood it was fairly compatible.
d
We have an additional challenge because I'm based in Australia and some of our dates are in AEST/AEDT and some are UTC and not all have timezones specified. So we constantly have issues with bare dates... We use PostgreSQL as our main backend DB and it took us a little while to figure out that we could actually set the intended timezone for converting a bare date without timezone data to them allow comparison with datetimes that DO have timezones. Maybe MSSQL has something similar.
s
Hi @david_tout, @pat_nadolny, it has been great to see that Rob Winters has pushed through a number of Pull Request for requested features in the last week. I think given a number of requested changes have gone through, that pipeline-tap-mssql will be our best MS SQL TAP. I have reached out to Rob regarding some of my features and have agreed that I will refactor and push my enhancements from my fork to the tap over the next couple of weeks. The additional features will be. • Support for Log Based replication using the in-built MS SQL Server CDC. My setup for this is documented here: https://github.com/s7clarke10/pipelinewise-tap-mssql/blob/master/MS_CDC_SETUP.md and here https://github.com/s7clarke10/pipelinewise-tap-mssql/blob/master/README.md. • Support for some additional MS SQL Data Types e.g. Time, Numeric • Support to optionally emit Dates as Dates, this can be useful in databases like Snowflake. I put through an enhancement into pipelinewise-target-snowflake to accept Date datatypes. It allows the data to land as a native date. I mentioned changing the incremental loads to read > than the saved state rather than >= . In that way you don’t end up with the last record being re-sent. Rob has suggested with poor data sources or if the data type is date that could be problematic and you could miss records. I tend to agree with Rob's suggestion on this so I won't make a feature to support this.
As a FYI, tap-mssql has been updated now with the additional features included in the PR pushed into the main repository https://github.com/wintersrd/pipelinewise-tap-mssql . New Features include: • Support for Microsoft Change Data Capture (CDC) log based replication • Emitting date and time data types as date and time • Allowing the use of a different characterset when reading legacy on-premise MS SQL Server database e.g. CP1252 instead of UTF8, this fixes conversion issues. Updates will need to go into your meltano.yml file for additional parameters which have been introduced into this release. - name: use_date_datatype kind: boolean - name: characterset kind: string