Machine learning alumni from Google, Uber, and Apple have started a new company to address errors in unstructured data.
CEO Vikram Chatterji was previously product management lead for Google Cloud AI. CTO Atindriyo Sanyal was engineering leader for Uber AI’s Michelangelo platform and was a founding engineer for SiriKit at Apple. VP of Engineering Yash Sheth led Google’s speech recognition team.
Galileo, their new venture, was founded in November 2021, operating under stealth until today’s announcement.
Chatterji said Galileo was inspired by conversations each of them had with machine learning professionals working with unstructured data, which they say accounts for 80 percent of the data being generated today.
“The biggest bottleneck and time sink for high-quality ML is always around fixing the data they work with.
“This is critical, but prohibitively manual, ad-hoc and slow, leading to poor model predictions and avoidable model biases creeping into production for the business,” Chatterji said.
“We are building Galileo with the goal of being the intelligent data bench for data scientists to systematically and quickly inspect, fix and track their ML data in one place.”
According to Galileo, data scientists waste more than 50 percent of their time tracking down data errors, which is largely a manual process.
Galileo aims to eliminate that wasted time by auto-logging all the data moving through an ML model and then surfacing what it believes to be failure points along with recommendations for correcting the problem.
Fixing problems with ML data is the most time-consuming part of training, but also has the highest ROI. To that end, Galileo says it can save ML teams more than 100 hours a month and claims to be able to fix ML data problems 10 times faster than doing it manually.
To keep costs down, Galileo uses a consumption-based model, though as model sizes and training costs grow, pricing could become prohibitive, depending on how Galileo scales the fees for its service, which it didn’t share.
Galileo described its internal operations as being based on “some advanced statistical algorithms the team has created.”
Chatterji told The Register that Galileo uses an ML model’s own reported understanding of data points to identify which were difficult for a model, versus which were easy. Galileo provides suggestions for addressing those difficulties with what Chatterji said is 95 percent accuracy.
Galileo presents everything in a GUI dashboard that points out differences between runs, lets users add or remove elements of data to see how error potentials adjust, and otherwise tweak ML training without relying on “Python scripts and Excel sheets,” as Chatterji, Sanyal, and Sheth said in their post announcing the company.
When asked how Galileo is deployed in customer environments, Chatterji told The Register that all Galileo deployments are done inside the customer’s own cloud environments (Galileo itself is provider-agnostic), and that Galileo doesn’t send any data back to the company.
“We work with financial services and healthcare enterprises, among others. ML data privacy is critical here,” Chatterji said. ®