OpenAI Five
Givenalearningalgorithmcapableofhandlinglonghorizons,westillneedtoexploretheenvironment.Evenwithour restrictions,therearehundredsofitems,dozensofbuildings,spells,andunittypes,andalongtailofgamemechanicstolearnabout—manyofwhichyieldpowerfulcombinations.It’snoteasytoexplorethiscombinatorially-vastspace efficiently.
OpenAIFivelearnsfromself-play(startingfromrandomweights),whichprovidesanaturalcurriculumforexploringtheenvironment.Toavoid“strategycollapse”,theagenttrains80%ofitsgamesagainstitselfandtheother20%againstitspastselves.Inthefirstgames,theheroeswalkaimlesslyaroundthemap.Afterseveralhoursoftraining,conceptssuchas laning, farming,orfightingover mid emerge.Afterseveraldays,theyconsistentlyadoptbasichumanstrategies:attempttosteal Bounty runesfromtheiropponents,walktotheir tierone towerstofarm,androtateheroesaroundthemaptogainlaneadvantage.Andwithfurthertraining,theybecomeproficientathigh-levelstrategieslike 5-hero push.
InMarch2017,ourfirst agent defeatedbotsbutgotconfusedagainsthumans.Toforceexplorationinstrategyspace,duringtraining(andonlyduringtraining)werandomizedtheproperties(health,speed,startlevel,etc.)oftheunits,anditbeganbeatinghumans.Lateron,whenatestplayerwasconsistentlybeatingour1v1bot,weincreasedourtrainingrandomizationsandthetestplayerstartedtolose.(Ourroboticsteamconcurrentlyappliedsimilarrandomizationtechniquesto physical robots totransferfromsimulationtothereal world.)
OpenAIFiveusestherandomizationswewroteforour1v1bot.Italsousesanew“laneassignment”one.Atthebeginningofeachtraininggame,werandomly“assign”eachherotosomesubsetof lanes andpenalizeitforstrayingfromthoselanesuntilarandomly-chosentimeinthe game.
Explorationisalsohelpedbyagoodreward. Ourreward consistsmostlyofmetricshumanstracktodecidehowthey’redoinginthegame:networth,kills,deaths,assists,lasthits,andthelike.Wepostprocesseachagent’srewardbysubtractingtheotherteam’saveragerewardtopreventtheagentsfromfindingpositive-sum situations.
Wehardcodeitemandskillbuilds(originallywrittenforour scripted baseline),andchoosewhichofthebuildstouseatrandom. Courier managementisalsoimportedfromthescripted baseline.