Building AIs to Learn to Move
One of my dreams from a young age is imagining the robotic world of the future. For some it might feel that the fabled future of the robotic revolution never really arrived. People were predicting that we would have robots serving us everywhere by now, however the robot revolution has taken a very different form. Factories, supply chains filled with automated ultra precise equipment programmed down to every single action to efficiently reproduce the exact same behaviour over and over without pause and without thought. These robots are dumb. In fact, they are kept in cages to keep humans safe because they hardly react to their environment. Where are the smart robots at?
Dr. Andrew Ng (a professor at Stanford) gave a talk that showed the performance of existing robots on challenging tasks. They show their robot manipulating household items, stacking lego blocks, pouring a glass of water. It looks like a very promising robot but in reality it was all under the remote control direction of a student with a joystick. The message was fairly clear for me. A lot of the hardware is already at a decent level (although there are certainly exciting directions for it to improve), it is the software that is severely lagging behind.
Still to this day personal and real world robots are mostly confined to research labs. Part of this is attributable to the price tag associated with these robotic devices, but there is another reason. Until very recently, these robots failed in the real world. If you haven’t seen videos of humanoids attempting to navigate the ‘real world’ in the DARPA robotics challenge, go and watch one right now.
There are some good moments where they work too but in all cases these robots seem disappointingly slow and require large processing computers and lots of thinking time to accomplish seemingly mundane tasks such as opening doors and stepping off vehicles. Clearly the controllers we design that work well in controlled lab environments where we can remove all factors of variation are not going to scale well to the reality of the outside world with all its visual and physical noise.
Additionally hand crafting controllers that can work well for one task might not translate well to other tasks. One might spend significant effort and time in developing a controller that would allow a robot to pour a glass of water but this effort would have to be repeated in large part in order to develop controllers that can say open a door or walk up the stairs. The space of real world scenarios is far too plentiful for us to be able to hand design controllers for every single case.
This is where machine learning can play a role. We as humans are not born with the ability to walk or pour glasses of water but develop these as we gain more experience in the world. Currently most of the successes in applying machine learning methods to the robotics world stem from the breakthrough in computer vision that deep learning has afforded. These systems are capable of achieving unmatched performance on visual perception tasks such as object detection and segmentation and have filled the missing puzzle pieces for self-driving cars.
Us as humans use our learning capabilities for more than just perception. We learn how to control and that is something that is still a very open challenge for machines. Deep reinforcement learning is a promising solution but its data inefficiency limits its use on real robots. Another drawback of deep learning based methods is catastrophic forgetting. Training a neural network policy on one task and then training it on a different task will likely result in very quick and very sharp decline in its performance on the first task. This greatly limits the effectiveness of these learning based methods as it impedes scalability to new environments and long term accumulation of skill. This problem was the focus of my research effort during an Undergraduate Student Research Award term with Professor Michiel van de Panne’s group here at UBC. We wanted to study ways to allow these deep reinforcement learning agents to be able to progressively accumulate skills and transfer them onto new environments.
Our testbed problem for researching different methods was a physics-based simulation of a 2D humanoid walking on various types of terrains. The deep learning policy must be able to learn to walk on flat, inclined, stairs, sloped and gaps terrain. Crucially these tasks are assumed to be presented to the agent sequentially and by the end of all of the tasks we wish to be able to perform well on all of the tasks and with little forgetting of previous tasks. We performed some of the first tests on possible algorithms for dealing with this problem in a fairly complex continuous control environment and developed a new method we named PLAID that uses a brief self-replay period after learning each successive task to consolidate its knowledge. If you would like to find out more details, you can see the paper here.
Of course this is only a small step towards the ultimate goal of truly scalable robotic learning and much work is still left to be done and I am very excited for what is to come.
During the summer, I learned a lot about research. One of the big takeaways for me is to be smart in finding and designing the right problems to maximize understanding. Being efficient in knowing which questions are both interesting and easy to answer helped me a lot in building knowledge.
After the summer we published our paper and were accepted as a workshop paper at NIPS 2017 in LA and a conference paper at the International Conference on Learning Representations (ICLR) 2018 here in Vancouver. NIPS and ICLR are both top machine learning conferences and it was a great pleasure to be able to view the great work and contribute to it this year. Although a lot of the work in these conferences are more theoretical, a lot of work was also presented addressing practical problems. Ranging from advancing autonomous cars to automatic protein synthesis to cancer screening and beyond.
I want to thank my collaborators Glen Berseth and Paul Cernek and advisor Professor Michiel van de Panne for their wonderful work, discussions and guidance.
I would also like to thank the Schulich Foundation for enabling me to pursue this opportunity.