Tesla’s Dojo supercomputer cabinets have achieved densities of more than 200kW per cabinet and caused a local substation in California to trip during testing.
Since 2019 Tesla CEO Elon Musk has periodically tweeted about Dojo, a ‘neural network exaflop supercomputer,’ and last year detailed the custom silicon chips and hardware modules that will make up the system.
During its 2022 Tesla AI Day over the weekend, the company revealed more information about its upcoming Dojo supercomputer and its hardware.
Last year the company detailed its first working training tile. This week the company said it has started installing the first Dojo cabinets and is now building out at the rate of one training tile per day. Once complete, the entire system will house around 120 tiles, holding 3,000 custom D1 chips.
The company also noted it was using a custom cooling distribution unit (CDU) in its cabinets and had achieved a 2.3MW system test which caused the local substation in San Jose, California, to trip.
“We knew that we had to re-examine every aspect of the data center infrastructure in order to support our unprecedented power and cooling density. We brought in a fully-custom designed CDU to support Dojo's dense cooling requirements,” Rajiv Kurian, Principal Engineer at Tesla said during the presentation. “Since our Dojo cabinet integrates enough power and cooling to match an entire row of standard IT racks we need to carefully design our cabinet and infrastructure together and we've already gone through several iterations of this cabinet to optimize this.”
“Earlier in this year we started load testing our power and cooling infrastructure and we were able to push it over two megawatts before we tripped our substation and got a call from the city," he added.
The company also revealed its custom system rack to install the tiles in the custom cabinets as well as custom interface processor and host interface. Each tray consists of six training tiles; the company said each 135kg tray offers 54 petaflops (BF16/CFP8) and requires 100kW+ of power.
Each cabinet holds two trays and accompanying interface equipment. At full build-out, 10 cabinets will be connected into one ‘Exapod’ that will be the 1.1 exaflops (BF16/CFP8) Dojo system.
Dojo is currently expected to come into operation in Q1 2023. The company said it plans to install a total of seven ExaPods in Palo Alto, potentially offering 8.8 exaflops (BF16/CFP8).
With regards to the future, during a Q&A session Musk said Tesla may offer Dojo to companies as a service.
“It's probably going to make more sense to have Dojo operate in like an Amazon Web Services manner than to try to sell it to someone else,” he said. “The most efficient way to operate Dojo is to just have it be a service that you can use that's available online and that where you can train your models way faster and for less money.”
During testing, the company also noted some chip failures on the tiles. After debugging, the company found this was due to losing clock outputs from its oscillators, caused by vibrations on the module from piezoelectric effects from nearby capacitors. The solution was to use soft terminal caps on the capacitors to reduce vibrations and update the mems oscillator with a lower Q factor for the outer plane direction.
The company said some of its self-driving learning training sets can take up to a month to process, which is why it has spent so much effort creating a custom machine. It claims a single Dojo training tile offers the same compute performance as six GPU cluster boxes [though it doesn’t specify, some of the performance graphs make reference to Nvidia’s A100 GPUs) and can reduce that month's training time down to a week. The company went on to say 72 GPU racks could be replaced by just four Dojo cabinets.
Even without Dojo, Tesla's existing HPC capabilities are substantial. In June 2021, Andrej Karpathy, senior director of AI at Tesla, discussed the details of an unnamed pre-Dojo GPU supercomputer the company is using.
This cluster, one of three the company currently operates, has 720 nodes, each powered by eight 80GB Nvidia A100 GPUs totaling 5,760 A100s throughout the system. Based on previous benchmarking of A100 performance on Nvidia’s own 63 petaflops Selene supercomputer, 720 sets of eight-A100 nodes could yield around 81.6 Linpack petaflops; which would place the machine fifth on the most recent Top500 list.
At its AI Day last year, the company said it had around 10,000 GPUs across three HPC clusters; as well as the previously mentioned 5670 system which is used for training, it has a second, 4032 GPU system for training and a 1752 GPU system for auto-labeling.