AI Hub – eScience Institute

NAIRR Award

SSEC won a National AI Research Resource (NAIRR) Award to build a tool library, named LLMaven, using a Generative AI approach. We will use RAG (Retrieval Augmented Generation) techniques as a means of extending LLMs by utilizing data that has privacy concerns in a manner that is safe and cost effective for individual researchers who do not have the resources to develop their own models (or purchase expensive equipment). LLMaven will leverage publicly available diverse datasets and disparate academic knowledge bases.

RAG Office Hours

As part of eScience Institute’s Office Hours program, SSEC is offering office hours every Tuesday from 10 AM – 11 AM at eScience Institute’s Data Science Studio on UW campus to help support the UW community on issues related to RAG (Retrieval-Augmented Generation) based workflows for Generative AI. Researchers who are curious about leveraging generative AI tools with private or pre-publication data are welcome to sign up here and stop by with their questions.

Projects

AutoDoc: SSEC worked with researchers from Brown University and University of Osnabruck to build a pipeline and train a freely available large language model (LLM) to translate research processes implemented in AutoRA (a collection of Python packages that together form a framework for closed loop empirical research). Such descriptions provide the basis for an automated and transparent documentation of the empirical research process. More details are available here.

Tutorials

SciPy2024 tutorial: The SSEC team presented a tutorial at the annual SciPy conference in Tacoma, WA on Jul 09 2024 to cover (1) the basics of language models, (2) setting up the environment for using open source LLMs without the use of expensive compute resources needed for training or fine-tuning, (3) learning a technique like Retrieval-Augmented Generation (RAG) to optimize output of LLM, and (4) build an app to demonstrate how researchers could turn disparate knowledge bases into special purpose AI-powered tools. 

Models

While experimenting with the limits of LLM inferencing for science use cases, SSEC worked on building useful applications for science utilizing open data, as well as open models. They started with Ai2’s OLMO models built on open datasets with published checkpoints. However, locally CPU run inference was taking too much time, so the next logical step was to speed up the model using the llamacpp approach. This method implied an intermediate step to convert the model to GGUF format. Upon posting the model to Hugging Face, the team saw the downloads for the model reach over 1.2k downloads after just one day. Now, there are over 6K downloads to date for the two models.