This reference architecture series describes how you can design and deploy a high performance online inference system for deep learning models by using an NVIDIA® T4 GPU and Triton Inference Server.Using this architecture, you can create a system that uses machine learning models and can leverage GPU acceleration. Google Kubernetes Engine (GKE) lets you scale the system according to a growing number of clients. You can improve throughput and reduce the latency of the system by applying the optimization techniques that are described in this series.
https://cloud.google.com/architecture/scalable-tensorflow-inference-system