HPE has lifted the lid on two new AI products, one aimed at enterprises wanting to build and train machine learning (ML) models at scale, and a second that introduces a decentralized ML system to enable distributed or edge deployments to share updates to their models.
Now rebadged as the HPE Machine Learning Development Environment, this is integrated with HPE compute infrastructure to deliver a system that HPE claims can speed up the typical time-to-value from building and training machine models from weeks or months to days.
The speedup is ascribed to the fact that it is delivered as an integrated solution where pre-configured infrastructure is optimized for ML model development, meaning users can get to grips with training ML models straight away, rather than having to worry about configuring the infrastructure, according to HPE.
“We have our end user who is a machine learning engineer or researcher leveraging tools and frameworks like PyTorch and TensorFlow to build at-scale deep learning models,” said Evan Sparks, VP for AI and HPC and lead for the Determined AI team.
“They need tools to help them accelerate this process, and these tools are not just the GPUs but also software tools to help them automatically scale out their workflows, and combining that with lower parts of the stack here, the hardware and also the services that HPE can provide to customers globally really makes this a compelling offering.”
The underlying infrastructure is based on HPE Apollo 6500 Gen10 server nodes, each fitted with eight Nvidia A100 80GB GPUs and interconnected using Nvidia Quantum InfiniBand networking. The Apollo nodes have up to 4TB of memory and 30TB of NVMe local scratch storage, with HPE Parallel File System Storage optional.
There are also ProLiant DL325 servers acting as service nodes to manage the system, with connection to the enterprise network via an Aruba CX 6300M switch.
The system is being sold as four nodes, but customers have the option to scale up, while HPE Pointnext Services provide onsite installation and setup for the customer.
The software stack comprises the Machine Learning Development Environment itself, HPE Performance Cluster Manager for provisioning, management and monitoring of the server nodes, and runs on Red Hat Enterprise Linux.
HPE said it has yet to evaluate the system with the widely used MLPerf benchmark suite, but claimed its own internal tests using customer workloads found that a HPE Machine Learning Development System with 32 GPUs is up to 5.7 times faster at natural language processing compared with a comparable platform with the same GPUs that did not have the optimized interconnect HPE provides.
The HPE Machine Learning Development System is available now worldwide. HPE did not state whether the solution would be available as part of its Greenlake subscription-based purchasing model, but the ability to access a high-performance AI system such as this without incurring capex costs may make it more attractive.
HPE Swarm Learning
HPE’s other AI introduction is HPE Swarm Learning, a decentralized machine learning framework for the edge or distributed sites developed by Hewlett Packard Labs.
The idea behind Swarm Learning is that a bunch of distributed nodes can share any updated parameters that each individual system’s ML model may have learned while operating, rather than having to feed data back to some centralized location such as a datacenter, where a master ML model gets updated and the changes distributed from there.
This latter approach can be inefficient and costly if large volumes of data have to be transmitted back to the mothership, HPE said, and might also fall foul of data privacy and data ownership regulations that restrict data sharing. This might potentially lead to inaccurate and biased models if they are not being trained on all the relevant data.
“The way that we largely work in model development and training, but also in operation, is we tend to take all of the data and collect it and bring it into one core location for model training,” said HPE EVP for HPC & AI Justin Hotard.
“That data is in many cases collected and gathered at the edge, and in some cases, moving that data from the edge to the core has implications for compliance and GDPR, so it’s not trivial to simply move everything to one central location.”
By contrast, HPE Swarm Learning allows models to be trained locally, and it is the learning from those models, not the data, that is shared across nodes.
This effectively involves creating a peer-to-peer network between the different nodes, and ensuring that model parameters can be exchanged securely. The latter is achieved by utilizing blockchain technology, according to HPE, which is widely used in cryptocurrency systems to ensure transactions cannot be tampered with, or that such tampering is immediately evident.
A distributed machine learning system has applications beyond those that might immediately come to mind when thinking of edge deployments. There are many business cases where an ML model might be deployed at a number of widely distributed sites, and a simple way of keeping all the models updated in synchrony would prove valuable.
One such use case is in fraud detection for financial services, and HPE detailed one developer, TigerGraph, which has combined HPE Swarm Learning with its data analytics platform to detect unusual activity in credit card transactions. The two solutions together are able to increase accuracy when training machine learning models from large volumes of financial data from multiple bank branches across a wide area, HPE said.
A more typical edge use case is in manufacturing, where predictive maintenance using ML can avoid unexpected downtime of machinery. Swarm learning could improve the accuracy of the system of by collecting the learning gleaned from sensor data across multiple manufacturing sites, HPE said.
HPE Swarm Learning is provided as part of a Swarm Learning Library that is containerized and can run on Docker, inside virtual machines or on bare metal and is hardware agnostic, HPE claimed. The platform is available now in most countries. ®