By Petros Koutoupis, VDURA
With all the buzz round synthetic intelligence and machine studying, it’s straightforward to lose sight of which high-performance computing storage necessities are important to ship actual, transformative worth to your group.
When evaluating a knowledge storage answer, one among the commonest efficiency metrics is enter/output operations per second (IOPS). It has lengthy been the commonplace for measuring storage efficiency, and relying on the workload, a system’s IOPS will be important.
In apply, when a vendor advertises IOPS, they are actually showcasing what number of discontiguous 4 KiB reads or writes the system can deal with beneath the worst-case situation of absolutely random I/O. Measuring storage efficiency by IOPS is solely significant if the workloads are IOPS-intensive (e.g., databases, virtualized environments, or internet servers). However as we transfer into the period of AI, the query stays: does IOPS nonetheless matter?
A Breakdown of your Normal AI Workload
AI workloads run throughout the whole information lifecycle, and every stage places its personal spin on GPU compute (with CPUs supporting orchestration and preprocessing), storage, and information administration assets. Right here are a few of the commonest varieties you’ll come throughout when constructing and rolling out AI options.
AI workflows (supply: VDURA)
Knowledge Ingestion & Preprocessing
Throughout this stage, uncooked information is collected from sources reminiscent of databases, social media platforms, IoT gadgets, and APIs (as examples), then fed into AI pipelines to put together it for evaluation. Earlier than that evaluation can occur, nonetheless, the information have to be cleaned, eradicating inconsistencies, corrupt or irrelevant entries, filling in lacking values, and aligning codecs (such
as timestamps or models of measurement), amongst different duties.
Mannequin Coaching
After the information is prepped, it’s time for the most demanding part: coaching. Right here, giant language fashions (LLMs) are constructed by processing information to spot patterns and relationships that drive correct predictions. This stage leans closely on high-performance GPUs, with frequent checkpoints to storage so coaching can shortly recuperate from {hardware} or job failures. In lots of instances, some extent of fine-tuning or comparable changes might also be a part of the course of.
Tremendous-Tuning
Mannequin coaching usually includes constructing a basis mannequin from scratch on giant datasets to seize broad, normal data. Tremendous-tuning then refines this pre-trained mannequin for a particular process or area utilizing smaller, specialised datasets, enhancing its efficiency.
AI workflows (supply: VDURA)
Mannequin Inference
As soon as educated, the AI mannequin could make predictions on new, relatively than historic, information by making use of the patterns it has realized to generate actionable outputs. For instance, should you present the mannequin an image of a canine it has by no means seen before, it is going to predict: “That is a canine.”
How Excessive-Efficiency File Storage is Affected
An HPC parallel file system breaks information into chunks and distributes them throughout a number of networked storage servers. This permits many compute nodes to entry the information concurrently at excessive speeds. In consequence, this structure has turn into important for data-intensive workloads, together with AI.
Throughout the information ingestion part, uncooked information comes from many sources, and parallel file programs could play a restricted position. Their significance will increase throughout preprocessing and mannequin coaching, the place high-throughput programs are wanted to shortly load and rework giant datasets. This reduces the time required to put together datasets for each coaching and inference.
Checkpointing throughout mannequin coaching periodically saves the present state of the mannequin to shield in opposition to progress loss due to interruptions. This course of requires all nodes to save the mannequin’s state concurrently, demanding excessive peak storage throughput to maintain checkpointing time minimal. Inadequate storage efficiency throughout checkpointing can lengthen coaching occasions and enhance the threat of information loss.
It is evident that AI workloads are pushed by throughput, not IOPS. Coaching giant fashions requires streaming large sequential datasets, usually gigabytes to terabytes in dimension, into GPUs. The actual bottleneck is mixture bandwidth (GB/s or TB/s), relatively than dealing with thousands and thousands of small, random I/O operations per second. Inefficient storage can create bottlenecks, leaving GPUs and different processors idle, slowing coaching, and driving up prices.
Necessities primarily based solely on IOPS can considerably inflate the storage funds or rule out the best suited architectures. Parallel file programs, on the different hand, excel in throughput and scalability. To satisfy particular IOPS targets, manufacturing file programs are usually over-engineered, including value or pointless capabilities, relatively than being designed for optimum throughput.
Conclusion
AI workloads demand high-throughput storage relatively than excessive IOPS. Whereas IOPS has lengthy been a regular metric, fashionable AI — notably throughout information preprocessing, mannequin coaching, and checkpointing — depends on transferring large sequential datasets effectively to maintain GPUs and compute nodes absolutely utilized. Parallel file programs present the needed scalability and bandwidth to deal with these workloads successfully, whereas focusing solely on IOPS can lead to over-engineered, expensive options that do not optimize coaching efficiency. For AI at scale, throughput and mixture bandwidth are the true drivers of productiveness and value effectivity.
Writer: Petros Koutoupis has spent greater than twenty years in the information storage business, working for firms which embrace Xyratex, Cleversafe/IBM, Seagate, Cray/HPE and, now, AI and HPC information platform firm VDURA.
Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.