Hey folks, I'm new to Meltano. However, my use-cas...
# getting-started
b
Hey folks, I'm new to Meltano. However, my use-case is to extract the content of Github Repo or even to extract content from Google Drive pdfs and word docs. I wonder if Meltano is capable of doing that? From reading the docs, it seems that Meltano is designed more to get the metadata (in Github's case), and not the content of the repo. I think Google Drive is also only for Excel files. Is it possible to create a custom Extractor that pulls the content of Github and Google Drive Docs, Excel. If not, I'm wondering if the community have done a similar-ish task, and if they've any libraries or know a 3rd party service that I can use for my use-case?
p
@bonhomiegandalf for github, what type of content are you looking for? I've used the github tap to extract full readme string blobs before so it might have already do what you're looking for. If its accessible through the github api then we can always add it to the tap
For google drive I know the gdrive utility exists https://hub.meltano.com/utilities/gdrive to retrieve files
Meltano has an SDK for building connectors https://github.com/meltano/sdk and an EDK (extension developer kit) https://github.com/meltano/edk for other non-connector utilities
b
Thanks so much for the Google Drive link, that looks a lot like what I need. For Github, I wanted to pull the codes in the files.
Also, as an aside, my use-case is to load those content into an embedding database eventually. I saw that ChromaDB is supported?
p
It looks like there is a way to retrieve raw content using the github api https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28 but depending on how much you want to extract it might be better to build a tap that does a
git clone
then iterates the repo to get content
What your use case? How many repos are you trying to extract content from?
I saw that ChromaDB is supported?
Yes its relatively new but there a target for writing to chromadb https://hub.meltano.com/loaders/target-chromadb
b
Yeap, I've built a custom Github Connector on my own using that API and agree git cloning and processing the tempfiles are a lot easier. We're probably thinking of scraping at the a scale of 100s of repo, so nothing too crazy. Our use-case is to extract github code content and indexing into a vectorDB for code explanations. I'm pretty keen to use Meltano, and don't mind building a custom connector just to pull the content of the Github repo. Do you know what resources might be useful to get started? Also, what's the best deployment route right now. I would love be on Meltano Cloud, altho I'm wait listed right now. In particular, I might want to use the Airbyte variant as that has the connectors I need right-out-of-the-box.
In particular, do we have to have a separate scheduler to call Meltano?
a
Hey @bonhomiegandalf I think this might be pretty close to what you want? https://www.matatika.com/articles/technology-using-gpt-3-large-language-model-to-extract-structured-information-documents/ Meant to get onto the office hours call and demo it at some point!