LLMs: our future overlords are hungry and thirsty
generative AI microservice architecturePublic workshop: Sept 23rd-25th - Architecting for fast, sustainable flow - enabling DevOps and Team Topologies thru architecture. Learn more and enroll.
Since early this year, the news around generative AI technologies, such as ChatGPT, has been never ending. Some have even suggested that humanities existence is at stake. While there’s a massive amount of hype, there’s also a lot of potential as shown by tools such as ChatGPT and Copilot. Consequently, I decided to explore Generative AI from the perspective of an enterprise software architect. This article is the first in a series about generative AI - specifically Large Language Models (LLMs) - and the microservice architecture.
A large language model is a function….
Large Language Models (LLMs) are a generative AI technology for natural language processing.
Simply put, an LLM is a function that takes a sequence of words as input - the prompt
- and returns a sequence of words that’s the most likely completion of the prompt.
$ python3
Python 3.11.4 ..
>>> from langchain.llms import Ollama
>>> llm = Ollama(model="llama2")
llm("who is Oppenheimer")
' J. Robert Oppenheimer was an American theoretical physicist and professor who made significant contributions to...
Not particularly threatening, right?
…that is implemented by a neural network…
An LLM is implemented by a neural network. The details are quite complicated. But the basic idea is that the neural network is trained on a large amount of text to predict the next word (or more accurately a token, which is an encoded word fragment) given an input sequence of words. The entire completion is constructed by iteratively predicting the next token given the input sequence and the previous predicted tokens.
… with numerous NLP use cases
LLMs have numerous use cases, including text generation, summarization, rewriting, classification, entity extraction, semantic search and classification. To learn more about LLM use cases see Large Language Models and Where to Use Them: Part 1 and Hugging Face Transformers.
LLMs are hungry
LLMs are rather large, with billions of parameters, which are the weights of the neural net’s neurons and connections. For example, the GPT-3 LLM has 175 billion parameters. And, Facebook LLamas is a collection of language models ranging from 7B to 65B parameters. Consequently, LLMs involve lots of math and are hungry for computational resources, specifically expensive GPUs.
The resource requirements depend on the phase of LLM’s lifecycle. The three phases of a LLM’s life cycle are:
- training - creating the LLM for scratch
- fine tuning - tailoring the LLM to specific tasks
- inferencing - performing tasks with the LLM
Training and inferencing both require a lot of compute resources. Let’s look at each of these in turn.
Training
Training an LLM from scratch is an extremely computationally intensive task. For example, training the GPT-3 LLM required 355 years of GPU time. The training cost was estimated at $4.6 million. Training is so costly because it requires lots of GPUs that have large amounts of memory, which are expensive. For example, the AWS EC2 p5.48xlarge instances, which have 8 GPUs each with 80G of memory and costs $98.32 per hour. Consequently, most organizations will use a 3rd party, pretrained LLM.
Fine-tuning
Fine tuning an LLM, which tailors an LLM to a specific task by adjusting the parameters, is much less computationally intensive than tuning. As a result it’s fairly inexpensive but it still requires GPUs.
Inferencing
Inferencing with an LLM, which is using the LLM to perform tasks, is less computationally intensive than training, but typically still requires GPUs.
Moreover, each GPU must have sufficient memory to store all of the LLM’s billions of parameters.
By default, each parameter is 16-bits, so 2 bytes per parameter is required.
Sometimes, however, quantization can be applied to use fewer bits per parameter although there’s a trade-off between accuracy and memory usage.
Machines with GPUs that can run LLMs are more expensive than machines without GPUs.
For example, an AWS g5.48xlarge
instance which has 8 GPUs each with 24G of memory costs $16/hour.
A comparable non-GPU instance, such as a m7i.48xlarge
costs $9.6768/hour.
LLMs are also thirsty
Since LLMs are computationally expensive they are also thirsty for water. Lots of computation requires a lot of electricity, which requires a lot of water to cool the data centers. For example, studies estimate that a ChatGPT conversation consumes as much as 500ml of water. So much for the environment!
LLMs and the microservice architecture
Let’s imagine that you want to add an LLM to your enterprise Java application. It’s quite likely that you will want to deploy the LLM inferencing code as a separate service for the following two reasons: efficient utilization of expensive GPU resources and the need to use a different technology stack. Let’s look at each reason in turn.
Efficient utilization of expensive GPU resources
There are two ways to run an LLM: self-hosted or SaaS. If you are self-hosting an LLM, then you are running software that has very distinctive resource requirements. LLMs must run on more expensive machines that have GPUs. Therefore, in order to utilize those resources efficiently you will need to resolve the dark energy force segregate by characteristics, and deploy your LLM as a separate service running on specialized infrastructure. You typically wouldn’t want to run non-GPU services on the same machines as your LLM services since that could result in over-provisioning of GPUs.
Separate technology stack
The second reason to run your LLM related code as a separate service is that it will likely use Python instead of Java. While you might run the LLM using a Java technology stack, it appears that Python has a much richer ecosystem. Furthermore, even if you are using a SaaS-based LLM, you will often write Python code to interact with the LLM. For example, the prompt tuning/engineering code that tailors an ‘off the shelf’ LLM to a specific task is often written in Python. Consequently, in order to resolve the the dark energy force multiple technology stacks you will need to deploy your Python-based LLM logic as a separate service.
What’s next?
In future articles, I’ll explore the topic of LLMs and microservices in more details.
Need help with accelerating software delivery?
I’m available to help your organization improve agility and competitiveness through better software architecture: training workshops, architecture reviews, etc.