Machine Learning Madness – Can the Bots Predict This Years Cinderella?

As long as I can remember, I have always loved March Madness. Selection Sunday’s came and I watched anxiously as if my season was riding on the committee’s announcement. As the teams were unveiled, I would comb through analyst articles, RPI rankings, and almanacs. To me the seedings were irrelevant, I tried to dive into as many objective factors that I could think of that would allow me to objectively pick the outcome. Crunching through, record vs similar opponents, W-L in the last 10. I would get a sense of who I felt had the strongest ability to play deep into the final four. From there, I tried to get a sense of each team’s style of play. Were they a sharp shooting team that played small like Duke with JJ Redick? Did they have an overpowering inside presence with a dominant big-man like Shaquille O’neal and LSU? Perhaps a team had a little known squad of also-rans with a superstar point-guard like Steph Curry and Davidson. The last and final X factor for me had been coaching. As history will show, teams with tenured coaches that had led tournament teams in the past seemed to have an edge over their respective peers. At the time, I thought I was a budding Dick Vitale. Absent a few buzzer beaters, Villanova 2016, and Butler 2010 I felt pretty good about my tournament prognostication skills.

NCAA Tournament 2018: Looking back at 25 best Cinderella stories in March  Madness history -
Cinderalla Teams from years past (L to R) Loyala Marymount, George Mason, North Carolina State University (Credit – CBS Sports)

This year I decided I was going to take my strategy to the next level and work to understand how the power of machine learning could aid in the prognostication. Turns out that two P.H.D students from Ohio State (Matthew Osborne and Kevin Nowland) had beat me to the punch and wrote a machine learning program built around predicting more upsets then their human counterparts. The algorithms aim to use classification algorithms such as logistic regression, random forest models, and k-nearest neighbors. Each has its own unique way of trying to predict upsets by analyzing the same data set of 2001-2017 first-round games. Surprisingly, like humans the machines were not infallible but provided some noteworthy results. 75% of the time the combined predictions of the models picked the correct outcome, which while not something that makes people immediately go running to their nearest Best Buy to set up an NCAA Bracketology Mining rig, you are rather likely to see this type of machine learning used to tweak the methodology of the selection committee moving forward.

Using machine learning in basketball brackets and beyond | The Network
(Photo Credit Cisco)

More and more, data scientists are competing to improve machine learning’s capabilities. In fact. since 2014. a throng of basketball enthusiasts have competed in Machine Learning Madness. This year 955 competitors are vying for a total of $25,000 in prize money that goes to the five most accurate brackets. Brackets are rated not only on the bracket outcome but also their degree of certainty. Essentially each winning game gets rewarded more points based on a confidence score. Doubly sure that Loyola of Chicago will beat Illinois (don’t feel bad about this one, I don’t think IBM Watson or Stephen Hawking had this one) you put 1 down, not feeling so lucky, put it zero, but beware the leverage applied on the confidence can provide. To date the random forest algorithm, which applies a decision tree like learning method to running simulations has been the most widely used.

With the popularity of Machine Learning rising, the NCAA has decided to remove RPI (Ratings Percentage Index) from its selection criteria in favor of NET. NET aims to incorporate variables into its system for calculating a team’s rating. NET factors strength of schedule, game location, scoring margin, and net offensive and defensive efficiency. Other factors include performance in late-season games, including tournament games. It will be interesting if the use of this criteria will help create a more equitable field and fewer upsets. Only time and cheesy AT&T commercials will tell. If this year’s games are any prediction, then the algo’s have a way to go. It will be an exciting development to watch and will also likely be something that my Jay Bilas like reflexes incorporate into my yearly march madness efforts.

Sources –


  1. ritellryan · ·

    Perfect timing for this post! As my bracket is currently running on fumes I have tried so many different theories to fill it out and am much closer to the bottom than the top. I have read great articles by ESPN on historical precedents by seed matchups and the characteristics of the teams that were able to pull off upsets. It seems like that it would contain a lot of the info you would use for a program, but I guess I wasn’t smart enough to write it…

  2. abigailholler1 · ·

    March madness has definitely benefitted from a more digital world. In 2017, the addition of livestreaming March Madness tournament games provided a more direct relationship with fans, further driving viewership. Additionally, growth in handheld device usage and streaming services meant that more consumers were watching march madness games from their smartphones. This shift gave the march madness steaming services the opportunity to use push notifications to more directly connect with their consumers. They used notifications to remind viewers to tune into games as they were beginning, or to alert users of live games that had a close ending. This tactic was really successful, and led to a doubling of the audience size in the final minutes of the National Championship game.


  3. Scott Siegler · ·

    What a cool topic! I’m fascinated by simulations in sports. When I was younger and played sports video games, I didn’t even actually play the game. For instance, with the Madden NFL Football games, I would go into a “franchise mode,” draft a team, make any trades or acquisitions I thought made sense, and then I would just simulate the whole season. Then I would pore over the outcomes of games and stats of individual players and try to set my team up for a better outcome next season, and then simulate again. My hunch is that these games ran random forest simulations for this process. I could see real GMs of pro teams doing “dry runs”of their team’s seasons like this to identify potential weaknesses in their rosters/personnel, especially as AI improves and starts taking more and more factors into consideration. Honestly, it wouldn’t surprise me if teams were already doing things like this.

  4. therealerindee · ·

    For betting purposes and bragging rights I really like the idea of ML getting smarter based off of characteristics that seem to reappear in the teams that go the farthest in March Madness, but for purposes of picking the Cinderella team I’m not as much of a fan. Maybe it’s just me, but there is something really cool about reading stats and then going off a gut hunch and having that hunch payoff. I feel like ML has the ability to “break the spell” of the magic that surrounds some of these teams like Sister Jean or a 15 seed making it to the Sweet Sixteen. Cool topic and I know ML is making it’s way into all things sports, Fantasy Football being my biggest exposure, but my emotional human brain wants to believe there’s a little more secret sauce than just patterns here.

  5. This is a really interesting topic. I’m someone who loves March Madness but does not pay enough attention during the regular season to understand what teams are actually good or overrated. I could use all the help I can get when filling out my bracket. This seems like a good alternative for the casual fan if they want help on narrowing down their final four. It feels like in recent history (especially this year) there have been bigger upsets by more highly seeded teams (UMBC in 2018 as a 16 seed, Oral Roberts this year as a 15 seed). I wonder if these occurrences will alter how the algorithms function. I also think it will be interesting to see if will flip the other way towards less upsets when the computers are more involved in seeding.

  6. Different data scientists have been at this for years. I confess that I may have won more than one pool by simply relying on their published predictions.

  7. changliu0601 · ·

    Interesting Post.I know that baseball have long used statistics to evaluate players.They use Ai to analyze adar gun data (throwing speed and spin), video tracking (how players move around the field) and swing speed and mechanics from sensor-studded bats.And they use the data to give them better training as well

  8. courtneymba · ·

    Great article! I also enjoy March Madness, although I know nothing about college basketball and the teams I’m betting on. For this reason and my general love of data, I like to look at different data scientists predictions and build from there, typically skewing it towards southern teams. That will be so cool to see how ML evolves and if they can get any better at predicted what seems to be random upsets.

%d bloggers like this: