Kerbal Space Program — complex environment for Reinforcement Learning
(How to fly to sky with reinforcement learning)
In this post I would like to show you ready and convertible environment for reinforcement learning. At the end of this post you can find short tutorial from which you can start using our own interpreter to communicate with the Kerbal Space Program game.
No Kerbals died during the simulation
There was a time when Open AI wanted to create environments for many modern games, Kerbal Space Program was amongst them. Project was called Universe, but it was abandoned a year ago. They’ve created easier environments based mostly on Atari games. You have probably heard of the Open AI Gym. It is a good starting point for the beginning your journey with reinforcement learning. You can learn more from gym.openai.com/docs
Basically the goal of reinforcement learning is to find a sequence of actions from some states, which are received from environment that lead to bigger reward.
We wanted to gain experience with reinforcement learning for our next commercial project at Whiteaster., so we started a small R&D project. We have choosen the Kerbal Space Program, because Open AI dropped bringing AI to this game and it sounded like so much fun. Of course it brought us a lot of joy, but there were many challenges which we had to face.
Kerbal Space Program
KSP for short is a game where you can build a ship and fly with it into the space with some friendly looking aliens (Kerbals) inside. We also thought of using GANs for generating those ships, but we focused on learning already built rocket how to fly.
There is a great community of KSP players, which thanks to them as well as kRPC mod connecting to game was pretty easy. You just need to downloadthe mod and paste kRPC folder to GameData directory. kRPC creates a server, by which we can send and receive messages from the environment.
Thanks to this mod we can control our rocket via already created unofficial library. We used Python, but mod is available for other languages (C-nano, C++, C#, Java, Lua). We receive states as arrays so we can easily work on them. We can also tell the rocket what to do by sending actions. There is no reward function, so we need to develop it. Also we need to specify the possible ending conditions to reload game each time an epoch is finished (either with positive or negative ending).
For a learning algorithms we’ve choosen between those 3 algorithms: Deep Q-Network (DQN), Augmented Random Search (ARS) and Asynchronous Advantage Actor Critic (A3C), because we already tested them on Open AI Gym and Mujoco environments. DQN and ARS were learning only on one client, so we dropped them quickly and we focused mostly on the A3C which can handle many agents simultaneously.
A3C gains its efficiency, by having one more global critic network which evaluates all the other networks, where each network is a separate worker / agent interacting with its own copy of environment and has its own weights. The agent uses the value from critic to update own policy of making decisions.
A3C with 4 workers[/caption]
On Jaromír Janisch website you can find very good explanation of A3C theory with implementation. Our first choice was his implementation for discrete spaces. Finally, after many tests, we’ve chosen Morvan Zhou implementation of A3C for continuous actions space. Our main purpose was not to write our own algorithm. We wanted to make an learnable environment.
Some of challenges were hardware and software problems. At first we used a cluster of 6 notebooks, connected by WiFi. On each notebook one actor was playing. Algorithm was running on other laptop, but it caused some lags. We switched then to an Xeon server on cable and games were still running on laptops connected by WiFi, it also caused latencies.
Finally we ran 4–6 actors on 4-cores, 8-threads i7 CPU and 1080Ti GPU, 20GB RAM PC connected by cable and algorithm was running on 6-cores, 12-threads Xeon CPU server. We know that using CPU for deep learning is a shame, but we needed more power on running games than to compute our neural networks. It was enough even for 7 separate, smaller or bigger, networks built on TensorFlow — 1 critic valuating results and 6 networks for agents playing games. Of course there is also possibility to run both algorithm and games on same computer.
There is also a problem with game design itself, probably because of some garbage collectors which are not functioning well. I’m not judging anyone, because it is not an easy task to find every possible scenario when writing a game. Who takes into consideration that players will load game a thousand times on one client, simultaneously having 6 of them played. After some 3500 epochs all RAM and SWAP is taken and game slows down incredibly.
We were running the game on Ubuntu, but it is probably worth to try Windows operating system in order to compare it in terms of performance. We can also increase game speed which theoretically should help us, but it caused some problems with game physics and learning abilities.The game is based on Unity engine so if you know how to make it run faster or start a game without GUI let us and everybody else know in the comments. We also run the game without GUI but we could no use the menu.
There are restrictions in the Python library through which we connected to the game — it can communicate with game only in certain places . Once we are in the these specific places in the game like: “Space Center”, “VAB” and “Launchpad”, we can programmatically control what we want to do further, for example reload the game. Unfortunately we found no programming possibilities to get there from the start menu without using real mouse, Python mouse library or PyAutoGUI, which is also based on movements of mouse.
Usage of RAM and SWAP after 2000 epochs during the algorithm running
There is a lot of available states we can get from the game. States describe flight parameters, different velocities, different kind of angles, G-Force, even humidity, wind, pressure and atmosphere density. The game itself is a pretty good physics simulator.
We started from putting above states as they are only normalized (divided by their max value):
We had similar actions to take:
In this discrete action space our AI had the same actions to take as a human player, with restriction it can “press” only one key for each step for duration of 0,1 second.
pitch(-1) -> “w”
pitch(1) -> “s”
yaw(-1) -> “a”
yaw(1) -> “d”
roll(-1) -> “q”
roll(1) -> “e”
Our main goal was to learn the rocket to reach certain altitude (45 km) or to reach orbit. For arriving at MAX_ALT 1000 “points” is added and for losing the episode -500 or -100 reward is given, depending on reason of ending.
There is also additional reward, which is given to the network on each step to speed up learning. At the beginning it was based on change of altitude (velocity, G-Force, we tested different approaches) multiplied by angle which our ship should have at the current altitude.
We start at 90 degrees and to reach the orbit we need to have 0 pitch angle at 45 km altitude. So during that time our pitch should change from 90 to 0.
Now we know the above step reward was not the best idea. At the time we’ve created it all looked fine to us, but we could not see progressive learning curve on such constructed reward function on over 2000 epochs. We had to check what was wrong and learn faster.
We tested giving more of them, different velocities and angles, then we used quaternions instead of given angles, because heading 0 was same as heading equal to 360 (normalized to 1) so it misleaded network. From player view you can see only rocket and Navball which is shown below. That’s why we also created tracker to see and understand current states. We are also printing the states and reward on game screen, when it is played.
Quaternions also did not improved learning, so we tried with rotation matrix. This is a 9-values representation of 3-dimensional space, quite redundant, but it worked out for other 3d environments, unfortunately not for ours.
Finally we thought of representing the deviation on a 2d space. The direction of the deviation is represented by sinus and cosinus of heading, and intensity of deviation was represented by pitch.
Now we got 3 states, describing altitude and angles, which looks like that:
Below you can see our simplified representation of 3 dimensional space represented as a top view in Cartesian coordinate system.
Representation of states on Cartesian coordinate system
Reward function shaping
Also reward could be done better, but to do that we needed to drop two actions: throttle and roll. Learning throttle takes quite a few epochs and roll often misleads the whole neural network.
Reward for each step is given for keeping correct angle with error margin of 10 degrees. The smaller error the higher reward. Agent also gets additional reward for moving closer to the desired angle and gets punished (negative reward) for moving further from desired angle.
Additional reward is also given for ending the epoch, while reaching desired MAX_ALT we give it 1000, when flight ends earlier we discount reward by the part of traveled way up multiplied by 500.
Epoch ending conditions
Important point is setting conditions when to reset the game. Depending on what we want to accomplish we can change the conditions. At the beginning we even allowed the rocket even to fall on the ground. So it learned also from falling down, which was not good for the learning progress. It should learn how to fly to the sky, not to the earth, so now we’re breaking the game if pitch get below lower than 60 degrees at the first 5000 m, then if rocket start facing down. After about 50–60 km we’re running out of fuel so the direction of the rocket does not matter so much.
Network and activation functions comparison
We tried with bigger, smaller networks and a 2 hidden dense layers, each with 64 neurons both for actors and critic turned out to be the best solution. The blue line on below picture shows the max altitude reached by our spaceship for each epoch. The orange line represents earned reward. Many flights ended just before 45 km. It is caused, because we need to have around 0 pitch at 45 km and our flight gets stopped if we start heading to the ground with pitch below -5. Other reason is that at this altitude the air density is lower and rocket starts to swing more.
2 hidden layers with 64 neurons each with tanh for actor and ReLu for critic with limit at 45km
Above training took about 24 hours. The longest single training took 5 days on 4 actors and is shown on below picture. This was when we increased maximal altitude from 45 km to 175 km to see what will happen. Flying so high takes some time. You can see that algorithm started to gain altitude above few thousand meters from 800th epoch and learned flying to 30’000 m till 2500th epoch perfectly. At the tanh activation we can observe the consistent earning of altitude.
5 days training on 2 hidden layers with 64 neurons each with tanh activation for both — actor and critic
Having 2 dense layers with 64 neurons we were changing also activation methods. Here is critic with ReLu activation which learned faster, but then the altitude began to drop. The cause could be that we didn’t took into consideration including in reward some additional values like velocity. Our goal was then to reach 45km, but we set limit to 175km to see what happens. To go further, the better orbit reward should be written. We can also train another model to handle orbital or suborbital flights.
2 hidden layers with 64 neurons each with tanh for actor and ReLu for critic
While having LSTM cells inside actor’s network we didn’t observe any progress.
One LSTM hidden layer with one Dense layer, tanh activation, 64 neurons each
Below is ReLu both for critic and actor and it started learning later, only from around 2800 epoch.
2 hidden layers with 64 neurons each with ReLu activation
Below chart show the progress without giving altitude as a state. Our algorithm cannot reason about current state without that input.
So this was our first environment for reinforcement learning, hope you enjoyed the article. We enjoyed making it a lot. There is possibility of getting better results, but we had limited computing resources and time. We could get like 3500 epochs most, but increasing that number could make work other algorithms and improve results of working one. If you have comments or ideas please leave them below and feel free to commit to better solutions on github. Below you can find all needed files and quick how to connect to KSP environment.
TL;DR version with tutorial:
Install the game Kerbal Space Program (no extensions needed)
Unpack zip, open it go to GameData and copy only folder kRPC to GameData directory where you’ve installed the game, it will propably be:
Kerbal Space Program/game/GameData
Download: https://krpc.github.io/krpc/_downloads/51d10d60684108532ec1a5b93393faab/LaunchIntoOrbit.craft and put this file to:
Kerbal Space Program/game/Ships/VAB
Download our save “kill”, unpack it and paste whole folder to:
Kerbal Space Program/game/saves
(if this save will not work, just create Sandbox game named “kill” go to Launchpad with LanuchIntoOrbit ship and save game with name “revivekerbals” when on the Lanuchpad)
Run the game -> Start game -> Resume Game ->
Select “kill” save -> Load -> Click on the Launchpad ->
-> click twice on Launch Into Orbit (Stock)-> “Launchpad not clear message” appears — select -> Go to LaunchIntoOrbit on Launchpad->
-> click twice on Launch Into Orbit (Stock)-> “Launchpad not clear message” appears — select -> Go to LaunchIntoOrbit on Launchpad->
You can now try to connect to the kRPC server in Python:
You should either allow connection for certain ip or auto-accept new clients in advanced server settings.
And then you should see version of your kRPC library
Clone our repository:
If you got connection you can start a3c_continous.py
If you want to run more agents you can add new servers. Remember to change RPC and stream ports in order to differentiate workers.
If you want to connect to other computers in your local network via IP,
click Edit and change Address: from “localhost” to “Any”.
Lowering graphics as below is also a good idea to improve performance.
Our team at Whiteaster
Creating Process automation system in companies — that’s our goal. We help our clients achieve sales goals thanks to efficiently working Machine Learning modules. Our team is made of graphic designers, UX designers and young and highly motivated developers, led by experienced programmers with many years of experience, open to various forms of cooperation and methods of project implementation, using, mostly preferred by our clients, Agile.