Because the demand for generative synthetic intelligence continues to develop, builders and enterprises search extra versatile, cost-effective, and highly effective accelerators to satisfy their wants. Right this moment, we’re excited to announce the provision of G6e executables powered by NVIDIA L40S Tensor Core GPUs on Amazon SageMaker. You possibly can select to configure your nodes with 1, 4, and eight L40S GPU cases, every offering 48 GB of Excessive Bandwidth Reminiscence (HBM). This launch allows organizations to host highly effective open supply base fashions similar to Llama 3.2 11B Imaginative and prescient, Llama 2 13B and Qwen 2.5 14B utilizing single node GPU execution particular person G6e. select. This makes it the right alternative for these seeking to optimize prices whereas sustaining excessive efficiency for inference workloads.
Key highlights of G6e cases embody:
- GPU reminiscence is twice that of G5 and G6 execution items, supporting the deployment of huge language fashions in FP16, as much as:
- 14B parameter mannequin (G6e.xlarge) on a single GPU node
- 72B parameter mannequin (G6e.12xlarge) on 4 GPU nodes
- 90B parameter mannequin on 8 GPU nodes (G6e.48xlarge)
- As much as 400 Gbps community throughput
- As much as 384 GB GPU reminiscence
Use circumstances
G6e cases are perfect for fine-tuning and deploying open giant language fashions (LLMs). Our benchmarks present that G6e delivers increased efficiency and is more cost effective than G5 cases, making it preferrred to be used in low-latency, just-in-time use circumstances similar to:
- Chatbots and conversational synthetic intelligence
- Textual content era and summarization
- Picture era and imaginative and prescient fashions
We additionally observe that G6e performs properly in inference with excessive concurrency and longer context lengths. We offer full benchmarks within the subsequent part.
Efficiency
Within the two figures beneath, we see that for lengthy context lengths of 512 and 1024, the G6e.2xlarge improves latency by 37% and throughput by 60% in comparison with the G5.2xlarge of the Llama 3.1 8B mannequin.
Within the two figures beneath, we see that G5.2xlarge causes CUDA Out of Reminiscence (OOM) when deploying the LLama 3.2 11B Imaginative and prescient mannequin, whereas G6e.2xlarge supplies glorious efficiency.
Within the two figures beneath, we examine G5.48xlarge (8 GPU nodes) with G6e.12xlarge (4 GPU) nodes, that are 35% inexpensive and extra environment friendly. For increased concurrency, we noticed a 60% discount in latency and a 2.5x enhance in throughput with the G6e.12xlarge.
Within the chart beneath, we examine the fee per 1000 tokens when deploying Llama 3.1 70b, which additional highlights the fee/efficiency benefit of utilizing G6e cases in comparison with G5 cases.
Deployment walkthrough
Stipulations
To do this answer utilizing SageMaker, you must meet the next conditions:
deploy
You possibly can clone the repository and use the notebooks supplied right here.
clear up
To keep away from pointless fees, it is strongly recommended to wash up deployed sources after use. You possibly can delete a deployed mannequin utilizing the next code:
predictor.delete_predictor()
in conclusion
G6e cases on SageMaker allow you to cost-effectively deploy a wide range of open supply fashions. With superior reminiscence capability, enhanced efficiency, and cost-effectiveness, these cases present a horny answer for organizations seeking to deploy and scale AI purposes. The flexibility to deal with bigger fashions, help longer context lengths, and preserve excessive throughput makes G6e cases notably useful for contemporary AI purposes. Strive utilizing G6e to deploy your code.
In regards to the writer
Vivek Gangasani is a Senior GenAI Skilled Options Architect at AWS. He helps rising GenAI firms construct modern options utilizing AWS providers and accelerated computing. At the moment, he focuses on growing methods for fine-tuning and optimizing the inference efficiency of huge language fashions. In his free time, Vivek enjoys climbing, watching films, and attempting totally different cuisines.
Tan Allen Is a Senior Product Supervisor at SageMaker, main giant mannequin inference efforts. He’s captivated with making use of machine studying to analytics. Outdoors of labor, he enjoys out of doors actions.
Pawan Kumar Madhuri Is an Affiliate Options Architect at Amazon Internet Companies. He has a robust curiosity in designing modern options for generative synthetic intelligence and is captivated with serving to shoppers harness the facility of the cloud. He earned a grasp’s diploma in data expertise from Arizona State College. Outdoors of labor, he enjoys swimming and watching films.
Michael Nguyen is a Senior Innovation Options Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop enterprise options on AWS. Michael holds 12 AWS certifications and holds bachelor’s/grasp’s levels in electrical/laptop engineering and MBAs from Penn State, Binghamton College, and the College of Delaware.