博舍

OpenAI Five openaiFive水平

OpenAI Five

Givenalearningalgorithmcapableofhandlinglonghorizons,westillneedtoexploretheenvironment.Evenwithour restrictions,therearehundredsofitems,dozensofbuildings,spells,andunittypes,andalongtailofgamemechanicstolearnabout—manyofwhichyieldpowerfulcombinations.It’snoteasytoexplorethiscombinatorially-vastspace efficiently.

OpenAIFivelearnsfromself-play(startingfromrandomweights),whichprovidesanaturalcurriculumforexploringtheenvironment.Toavoid“strategycollapse”,theagenttrains80%ofitsgamesagainstitselfandtheother20%againstitspastselves.Inthefirstgames,theheroeswalkaimlesslyaroundthemap.Afterseveralhoursoftraining,conceptssuchas laning, farming,orfightingover mid emerge.Afterseveraldays,theyconsistentlyadoptbasichumanstrategies:attempttosteal Bounty runesfromtheiropponents,walktotheir tierone towerstofarm,androtateheroesaroundthemaptogainlaneadvantage.Andwithfurthertraining,theybecomeproficientathigh-levelstrategieslike 5-hero push.

InMarch2017,ourfirst agent defeatedbotsbutgotconfusedagainsthumans.Toforceexplorationinstrategyspace,duringtraining(andonlyduringtraining)werandomizedtheproperties(health,speed,startlevel,etc.)oftheunits,anditbeganbeatinghumans.Lateron,whenatestplayerwasconsistentlybeatingour1v1bot,weincreasedourtrainingrandomizationsandthetestplayerstartedtolose.(Ourroboticsteamconcurrentlyappliedsimilarrandomizationtechniquesto physical robots totransferfromsimulationtothereal world.)

OpenAIFiveusestherandomizationswewroteforour1v1bot.Italsousesanew“laneassignment”one.Atthebeginningofeachtraininggame,werandomly“assign”eachherotosomesubsetof lanes andpenalizeitforstrayingfromthoselanesuntilarandomly-chosentimeinthe game.

Explorationisalsohelpedbyagoodreward. Ourreward consistsmostlyofmetricshumanstracktodecidehowthey’redoinginthegame:networth,kills,deaths,assists,lasthits,andthelike.Wepostprocesseachagent’srewardbysubtractingtheotherteam’saveragerewardtopreventtheagentsfromfindingpositive-sum situations.

Wehardcodeitemandskillbuilds(originallywrittenforour scripted baseline),andchoosewhichofthebuildstouseatrandom. Courier managementisalsoimportedfromthescripted baseline.

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。

上一篇

下一篇