The 451 Take
The details of the Google TPU demonstrate the potential of a specialized computational accelerator for machine-learning inference, and also substantiates the broad usage of machine learning (ML) within Google. It is more difficult, however, to generalize the learning. Google optimized this design based on the actual use and operation of multiple high-value applications, and was able to move from project inception to product deployment in a remarkable 15 months – but only because of its ownership of the stack from applications down to the hardware, and also because Google had already re-hosted many key applications on its second-generation ML platform, TensorFlow. It is not clear what part of this process can be replicated in the traditional, sequential, multi-company innovation supply chain.
The TPU is an ASIC that Google has developed and deployed to accelerate production ML applications that depend on deep neural networks, such as speech recognition, image processing, machine translation and search. The term machine learning covers a broad range of techniques that are important for modern AI applications – computer programs that can do more with imagery, speech and language than was previously possible.
A tensor is a mathematical abstraction (generalization of a vector). Google's TensorFlow is a dataflow language for computing with tensors; the TPU is a computational accelerator for tensor data that is optimized for inference (short fixed-point numbers).
Google is unquestionably one of the most experienced practitioners of commercial ML. From the mid-2000s, it was a visible participant in comparative benchmarks for natural language processing and committed to a technology strategy based on the use of ML (rather than 'programmed' AI), assuming the continuing growth in available training sets that would in turn improve the models and application performance.
Since then, the use of ML and NL has spread broadly across Google businesses. In 2011, Google started the Google Brain project to explore the use of very-large-scale deep neural networks, led by Jeff Dean, the single individual most associated with Google's innovative advances in software. The Google Brain project created DistBelief, Google's first-generation, scalable, distributed training and inference system that was deployed within Google Search, speech recognition, Google Photos, Google Maps and StreetView, Google Translate and You Tube, among others.
Based on that experience, Google created TensorFlow, a second-generation software framework for the implementation and deployment of large-scale ML models. TensorFlow takes computations described using a dataflow-like model, and maps them onto a wide variety of different hardware platforms, ranging from running inference on smartphones, to modest-sized training and inference systems, to large-scale training systems running on hundreds of specialized machines with thousands of GPUs (and now very-large-scale inference systems using specialized hardware as well).
Neural nets and AI
Neural net AI applications use a computational structure very loosely inspired by biological brains (networks of neurons). A 'deep' neural net uses more layers of computation neurons in the model. Machine learning uses large volumes of training data to create the specific neural net solution. Use of the neural net model in an application to make predictions is referred to as 'inference.'
Neural net machine learning has received a great deal of attention recently because of the advances demonstrated in the last five years in prominent applications such as language translation, image understanding and game playing. AI, the larger category, has been pursued as long as commercial computers have existed, with generally disappointing results when compared to the aspirations. The recent advances grabbed attention because of the rate at which progress was occurring, and the degree to which the results approached or even exceeded human capabilities.
Neural nets – a specific computational structure – have also been studied for a long time because they are conceptually analogous to how the human brain works. But until about 10 years ago, neural net approaches were interesting lab toys, producing results inferior to 'engineered learning' alternatives using program logic. The eventual success and dominance of neural nets can be attributed to the declining cost of large-scale computing, the evolution of GPUs into more general-purpose numerical engines, and the growing availability of elastic, pay-for-what-you-use computing platforms.
Neural net and ML solutions process large volumes of data rather than attempting to structurally understand a problem, and use that understanding to create the solution (i.e., the scientific method). They depend on the availability of large training data sets where each record has been annotated with the 'right' answer.
The designer picks the structure of the solution – the number of layers of computation neurons and the general form of the interconnection among the neurons within a layer and between layers. Then the model is 'trained' by having it process the training set with the many variables in the model, adjusted iteratively depending on whether the model gets the right or wrong answer for that specific training record.
The hyperscale service providers like Google played a key role in the process by accumulating large and complex training sets (e.g., search queries), by having unprecedented computational power to bring to bear for experimentation and training, and by having important applications that could be improved materially with cutting-edge ML techniques, such as natural language processing in support of search.
Finally, modern ML methods are computationally demanding. Neural net training – the processing of large data sets to refine the solution – requires large matrix computations with a great deal of floating point computation. Running the resulting models at scale – calculating the output of the neural net when presented with new inputs (inference) – can be computation-intense as well, but is done with short (8- and 16-bit) fixed point computation.
In the same timeframe in which ML and neural nets were growing in importance, the GPUs that were initially developed to accelerate computer graphics evolved so they could be used for general-purpose scientific computing. GPUs are commonly used for neural net training because they significantly accelerate training computation, and make it more power- and cost-effective.
The Tensor Processing Unit
In 2013, the Google Brain project started the effort to create the Google TPU, which began production deployment a remarkably short 15 months later. Google said the call to action was the realization that the growing use of ML computation could double the aggregate computer power it needed, an impractical burden using conventional processors. Up until then, Google had run neural nets on its vast production resources; since then it has added conventional GPUs for training, and developed the TPU for inference.
The TPU is an ASIC fabricated as in 28mm technology. Google says the die size is less than half that of the contemporary 22nm Haswell Xeon CPU, and power consumption less than one-half of the Haswell CPU or NVIDIA K80 GPU. The ASIC is packaged onto a small circuit board that can fit in the space of a SATA drive on a server, and uses PCIe Gen3 x16 to integrate with the server CPU.
Google says the lower power enables higher rack density of accelerated systems. Logically, the TPU is designed as a CPU accelerator (like a floating-point accelerator in the past) and depends on the server CPU to fetch instructions and sequence computation (Google says this design choice was in part pragmatic in support of the rapid insertion of the TPU into product systems and software).
The system-level design of the TPU is quite different from most HPC architectures because of Google's longstanding emphasis on interactive application performance. Google was one of the first web services companies to focus on the long tail of interactive response rather than just average response. Google's inference applications (e.g., natural language processing in search) are optimized to minimize the latency in providing a response to an interactive request.
The TPU also deviates from HPC because of the focus on 8- and 16-bit fixed point computation and the abandonment of the traditional focus on computational accuracy (e.g., IEEE FP precise rounding) – it turns out that accuracy isn't useful for inference if it conflicts with performance.
Google claims that the TPU is 15-30x faster at inference than the NVIDIA K80 GPU and the Haswell CPU, and that the K80 GPU is underutilized in latency-constrained designs and just a little faster than the Haswell CPU. Google also says that if the TPU were revised to utilize the faster GDDR5 memory used on the K80, it would be about 30-50x faster than the contemporary GPU and CPU.
As important, the performance/watt of the TPU is 30-80x that of contemporary products; the revised TPU with K80 memory would be 70-200x faster. The power efficiency is important both for the large (total power consumed in the datacenter) and the small (local cooling requirements and server rack density).
The reason for all this disclosure now is most likely Google's competition with Amazon and Microsoft as a public cloud provider, and Google's positioning as the best cloud-native platform and the leader in advanced AI applications and machine learning. Anecdotally, Google seems to be the clear leader in ML application deployment and value based on its disclosures over the last six months of the progress in natural language processing and image and video recognition and categorization.
TPU is also an important surrogate in the debate about the end of Moore's Law progress, and the growing importance of specialized CPUs. The TPU certainly demonstrates the potential value, but leaves open how that success can translate outside of the hyper-scale system providers that have the competence and commercial value to motivate and succeed in this effort because of their unique capabilities (e.g., the AI applications deployed, their large and growing training sets, and the 15-month timeframe of TPU design to deployment).
NVIDIA has made significant progress in positioning its GPUs as a platform for AI and machine learning. It has the advantage of being an established architecture with software tools and infrastructure, and is already widely deployed in public cloud infrastructure. However, it's still early. For context, in the last reported quarter, NVIDIA reported $296m in datacenter revenue, a segment that has grown to represent 15% of NVIDIA revenue.
Intel datacenter revenue in the same period was $4.7bn. AMD is further behind than NVIDIA, but is also looking at the new opportunity. Intel acquired FPGA vendor Altera for a record $16.7bn in 2015, and is now aiming it at deep learning. It also plans a specialized version of its Knights Landing Xeon Phi for AI workloads, and has the assets it acquired last year from startup Nervana Systems. There are also some independent startups such as BrainChip, Graphcore, Groq, Horizon Robotics, KnuEdge, TeraDeep and Wave Computing
John Abbott covers systems, storage and software infrastructure topics for 451 Research, and over a career that spans more than 25 years has pioneered specialist technology coverage in such areas as Unix, supercomputing, system architecture, software development and storage.
As a Senior Research Associate in 451 Research’s Information Security Channel, Patrick Daly covers emerging technologies in Internet of Things (IoT) security. His research focuses on different industrial disciplines of IoT security, including the protection of critical infrastructure, transportation and medical devices. In addition, Patrick’s coverage spans technological domains, including security for IoT devices, applications, platforms and networks.
Keith Dawson is a principal analyst in 451 Research's Customer Experience & Commerce practice, primarily covering marketing technology. Keith has been covering the intersection of communications and enterprise software for 25 years, mainly looking at how to influence and optimize the customer experience.