Hi Instead of doing useful things I have pursued an LLM rela Meltano #singer-tap-development

Hi! Instead of doing useful things, I have pursued...

Henning Holgersen

06/29/2023, 11:54 AM

Hi! Instead of doing useful things, I have pursued an LLM-related idea for ingesting pretty much any unstructured document and chunking the data. Combined with something like map-embeddings and target-chroma, this might be useful for people who have a lot of files on a drive somewhere, and a data-science group yelling “we want to do LLMs!“. Lots of the code was stolen from the upcoming

tap-file

: https://github.com/radbrt/tap-text-anywhere. This is mainly a proof-of-concept to check the interest, thoughts are welcome. The Meltano file in the repo is preconfigured with an s3 bucket with two PDF files so anyone should be able to fork, install and load from s3.

aaron_phethean

06/29/2023, 12:56 PM

Awesome. Definitely going to take a look, we had a similar idea and created a ‘tap-gpt’ 😁 https://www.matatika.com/articles/technology-using-gpt-3-large-language-model-to-extract-structured-information-documents/

Henning Holgersen

06/29/2023, 1:03 PM

Nice, we are obviously thinking in the same direction, and I like your blog post. The privacy thing reminds me of a map-transformer I have wanted to create, using spacy and NER to cross out text recognized as person-names, as well as likely emails and any string of digits longer than 8 (which is often SSNs, phone numbers etc). It would never be 100% but with a pragmatic lawyer it might be good enough for some purposes.

aaron_phethean

06/29/2023, 1:06 PM

Yes, love that too. Was mainly thinking of prod to test use cases for that.

taylor

07/03/2023, 4:13 PM

Love this - cross-sharing to #C04NT2FR1SL

2 Views

Open in Slack

Previous Next