Hi! Instead of doing useful things, I have pursued...
# singer-tap-development
h
Hi! Instead of doing useful things, I have pursued an LLM-related idea for ingesting pretty much any unstructured document and chunking the data. Combined with something like map-embeddings and target-chroma, this might be useful for people who have a lot of files on a drive somewhere, and a data-science group yelling “we want to do LLMs!“. Lots of the code was stolen from the upcoming
tap-file
: https://github.com/radbrt/tap-text-anywhere. This is mainly a proof-of-concept to check the interest, thoughts are welcome. The Meltano file in the repo is preconfigured with an s3 bucket with two PDF files so anyone should be able to fork, install and load from s3.
a
Awesome. Definitely going to take a look, we had a similar idea and created a ‘tap-gpt’ 😁 https://www.matatika.com/articles/technology-using-gpt-3-large-language-model-to-extract-structured-information-documents/
h
Nice, we are obviously thinking in the same direction, and I like your blog post. The privacy thing reminds me of a map-transformer I have wanted to create, using spacy and NER to cross out text recognized as person-names, as well as likely emails and any string of digits longer than 8 (which is often SSNs, phone numbers etc). It would never be 100% but with a pragmatic lawyer it might be good enough for some purposes.
a
Yes, love that too. Was mainly thinking of prod to test use cases for that.
t
Love this - cross-sharing to #C04NT2FR1SL