SuperLLaMA: Testing Generative AI on Open-Source Models

Behind Supernormal’s meeting notes is a world-class AI team laser-focused on helping everyone use AI to succeed at work. In addition to building the technology that powers the Supernormal products you depend on, our AI team works on cutting-edge research to discover new technologies to power the Supernormal of the future and contribute to the greater AI ecosystem. Over the last few months, our team has been experimenting with SuperLLaMA, a custom model built by Supernormal based on the LLaMA open-source model.

‍

How Supernormal Works Today

Like any meeting attendee, Supernormal's AI notetaker, Norm, attends the meeting. It then transcribes it and uses various LLM services (like OpenAI, PaLM, Claude) and proprietary fine-tuned models to generate meeting notes, including a summary and action items. The two main ML components in this workflow are speech-to-text (for transcribing meetings) and LLMs (to generate meeting notes from transcribed speech).

At Supernormal, we run the meeting transcription in real-time and use the transcribed text to generate notes for every k minutes of the meeting in real-time as well. This transcript, along with an appropriate prompt (depending on the meeting type), is then used to generate the meeting notes.

‍

Open-Source Model Advantages

The state-of-the-art in generative text AI applications, like summarizing meetings, currently is mostly produced through providers like OpenAI on GPT-4 and GPT-3.5-Turbo. These models are closed-source, accessed via an API, and require a fleet of dedicated servers to satisfy inference requests. However, a recently leaked Google memo [Link] pointed to a possible future for generative text built on open-source models.

The recent research in instruction-tuned models like Alpaca (LLaMA), Vicuna (LLaMA), Dolly (Pythia) has shown great promise on a multitude of various tasks. Many of them have shown at-par performance with some of the larger models like GPT-3.5, GPT-4, and PaLM. Using open-source models has a number of advantages including:

Costs - LLM services charge on a word level and for startups built around LLMs, service costs are a major factor.
Customization - Open-source models can be trained on internal data and leveraged for providing a better customer experience by personalizing the model to customer needs.
Control - The downtime of LLM providers has a direct impact on an LLM-dependent product, using self-serve models provides more control and contingency plan options.

Zooming in on cost, the summarization of an hour-long meeting is fairly expensive. If you want to extract high-quality action items, a summary, and other key details from a meeting, you can expect to spend more than $3 per meeting in computing costs.

We began by asking what it would take to provide the same quality of output, but only using models that run on your computer. Edge computing (on your laptop) skips the compute costs entirely but would require a full machine learning pipeline that has:

A local speech-to-text model.
A local LLM model for generating the meeting notes summary and action items from the transcribed text.

Experiment Goals

Since LLaMA cannot be used for commercial purposes yet, the goal of this experiment was not to develop a parallel commercial model, but to explore how the model performs on internal use cases. With a growing customer base and over 2 million notes per month, developing local models could help us solve the issues we outlined above. To evaluate the model we set out to (1) integrate Whisper + Speaker Diarization + feeding in input audio into an end-to-end demo, and (2) fine-tune the LLaMA model on anonymized notes.

‍

LLMs Background

Most LLMs are built on top of different variations of the transformer architecture proposed in Attention is All You Need [Link]. All of them were trained with the objective of predicting the next word given the initial sentence. This can be compared to the keyword suggestion feature on mobile phones, where the keyboard suggests words as we type. The models are known as Language Models (LM), and due to the large number of parameters in these models, these are referred to as Large Language Models (LLMs).

These LLMs vary due to differences in the training data used and different training objectives. GPT-3, GPT-3.5, GPT-4, and LLaMA were trained on different datasets of varying sizes. The models' performance differs due to variations in the pre-training data, training dataset size, and the number of parameters. For instance, GPT3 has 175B parameters trained on 500B tokens, while GPT3.5 has 6.7B parameters trained on 1T tokens.

Instruction tuning is a fine-tuning paradigm proposed on top of these LLMs. Instead of learning to predict the next word given the previous words in the sentence, instruction tuning involves training the model to follow the instruction and use the input sequence to generate the output tokens. LLM training involves backpropagation for every word in the sentences, whereas, for instruction tuning, only the output sequence words are used for backpropagation.

The instruction tuning dataset consists of 3 parts: instruction, input, and output. Training instances for instruction tuning look as follows:

‍

Instruction: Generate a summary for the following input.

Input: Queen Elizabeth was the queen of England. She ruled over ….

Output: ….. (A summary of the above passage).

‍

The use of instruction-input pairs makes the model generalizable to natural language instructions. It makes the model more adaptable to different variations of the above instruction, and adaptable to different instructions as well.

Alpaca is an instruction-tuned model built on top of the open-source LM named LLaMA (similar to GPT-3). The authors of Alpaca gathered a set of 52K unique instruction-input-output tuples and used this dataset to fine-tune the LLaMA model (according to the instruction tuning approach described above) and named the model Alpaca. There are 2 different ways to generate the output for a given instruction-input pair: (1) ask a human annotator to write the output or (2) use a high-quality LLM like GPT-3.5, GPT-4, or Claude to generate the output.

In the case of Alpaca, the authors used the GPT-3.5 model to generate the output sequences for the instruction-input pair. The large variety of the different instructions made it generalisable to instructions not part of the training set.

‍

Experiment Training Process

Alpaca was one of the first instruction-tuning approaches proposed on small models (<50B params) which showed surprisingly good results at a low training cost. Hence we adopted a similar approach. But we wanted to train a model for processing meeting transcripts, which is very different from the Alpaca dataset so we used a mixture of the Alpaca dataset (26k examples) and our anonymized internal meetings dataset (a mixture of GPT3, GPT3.5, and GPT4 outputs). The 26k anonymized internal examples consisted of both summary and action items. A few of the major differences between the Alpaca dataset and our internal dataset included:

Input lengths - Average Alpaca input length was 4 words vs 900 for our internal dataset
Noisy data - Alpaca inputs consisted of very few words, whereas our transcripts not only contained more words but they were also noisier with speaker changes.
Instructions - The Alpaca dataset had 52k unit instructions, whereas our internal dataset consists of only a handful of unique instructions.

Due to these differences, our training cost was 20x higher than the Alpaca dataset. Similar to Alpaca we are fine-tuned on the LLaMA7B model.

‍

Experiment Results

From the examples below, we can draw the following conclusions:

Large OpenAI models tend to be more specific as they included the names of the different parties involved rather than using generic terms like “the team” in the case of summaries.
In the case of action items, small models find it difficult to select the correct entities which leads to somewhat ambiguous outputs. And in a few other cases, it just predicts NONE, when it is not sure what action items exist.
The limited context window of LLaMA-like models makes it difficult to pass in-context examples which prevents us from completely leveraging the benefits of fine-tuning.

Meeting Summary

Action Items - Success Example

Action Items - Failure Example

‍Future Directions

We believe that using a larger language model with filtered data should somewhat help alleviate the issues we experienced, and we plan to continue the experiment in the following ways:

Experiment with larger model sizes.
Experiment with different base models other than LLaMA.
Explore quantization approaches to reduce training cost.

‍