In my previous blog post “How I stopped worrying and embraced docker microservices” I talked about why Microservices are the bees knees for scaling Machine Learning in production. A fair amount of time has passed (almost a year ago, whoa) and it proved that building Deep Learning pipelines in production is a more complex, multi-aspect problem. Yes, microservices are an amazing tool, both for software reuse, distributed systems design, quick failure and recovery, yada yada. But what seems very obvious now, is that Machine Learning services are very stateful, and statefulness is a problem for horizontal scaling.
Context switching latency
An easy way to deal with this issue is understand that ML models are large, and thus should not be context switched. If a model is started on instance A, you should try to keep it on instance A as long as possible. Nginx Plus comes with support for sticky sessions, which means that requests can always be load balanced on the same upstream a super useful feature. That was 30% of the message of my Nginxconf 2017 talk.
— NGINX, Inc. (@nginx) September 6, 2017
The other 70% of my message was urging people to move AWAY from microservices for Machine Learning. In an extreme example, we announced WebTorch, a full-on Deep Learning stack on top of an HTTP server, running as a single program. For your reference, a Deep Learning stack looks like this.
Now consider the two extremes in implementing this pipeline;
- Every stage is a microservice.
- The whole thing is one service.
Both seem equally terrible for different reasons and here I will explain why designing an ML pipeline is a zero-sum problem.
If every stage of the pipeline is a microservice this introduces a huge communication overhead between microservices. This is because very large dataframes which need to be passed between services also need to be
- Compressed (+ Encrypted)
- Decompressed (+ Decrypted)
What a pain, what a terrible thing to spend cycles on. All of these actions need to be repeated every time the microservice limit is crossed. The horror, the terrible end-to-end performance horror!
In the opposite case, you’re writing a monolith which is hard to maintain, probably you’re either using uncomfortable semantics either for writing the HTTP server or the ML part, can’t monitor the in between stages etc. Like I said, writing a ML pipeline for production is a zero-sum problem.
An extreme example; All-in-one deep learning
That’s right, you’ll need to look at your use case and decide where you draw the line. Where does the HTTP server stop and where does the ML back-end start. If only there was a tool that made this decision easy and allowed you to even go to the extreme case of writing a monolith, without sacrificing either HTTP performance (and pretty HTTP server semantics) or ML performance and relevance in the rapid growing Deep Learning market. Now such a tool is here (in alpha) and it’s called WebTorch.
Now of course that doesn’t mean WebTorch is either the best performance HTTP server and/or the best performing Deep Learning framework, but it’s at least worth a look right? So I run some benchmarks, loaded the XOR neural network found at the torch training page. I used another popular Lua tool, wrk to benchmark my server. I’m sending serialized Torch 2D DoubleTensor tensors to my server using POST requests to train. Here’s the results:
So there, plug that into a CUDA machine and see how much performance you squeeze out of that bad baby. I hope I have convinced you that sometimes, mixing two great things CAN lead to something great and that WebTorch is an ambitious and interesting open source project! Check out the Github repo and give it a star if you like the idea.
And hopefully, in due time it will become a fast, production level server which makes it easy for Data Scientists to deploy their models in the cloud (do people still say cloud?) and devOps people to deploy and scale.
Possible applications of such a tool include, but not limited to:
- Classification of streaming data
- Adaptive load balancing
- DDoS attack/intrusion detection
- Detect and adapt to upstream failures
- Train and serve NNs
- Use cuDNN, cuNN and cuTorch inside NGINX
- Write GPGPU code on NGINX
- Machine learning NGINX plugins
- Easily serve GPGPU code
- Rapid prototyping Deep Learning solutions
Maybe your own?