Predicting Movie Success with Data Analytics

How movie and television studios can use big data to improve greenlighting, budgeting, and marketing


Throughout this article, we will talk about the different ways that data analytics are used by movie and television studios to give each production its best chance at success, but let’s get one thing out of the way: it is not possible to “predict success” in the way that most people mean.  Just like weather models can be thrown off by anomalous meteorological events, so too can the performance and reception of a movie or television show.

Movie data analytics can be unbelievably powerful and can make educated guesses, but it cannot determine an individual project’s fate with absolute certainty. That is in no small part due how we define “success.” For some, it will be ticket sales, for others profit margin, reviews, social chatter, franchise options, or critical awards. 

Research groups in both the private sector and academic institutions have used movies and television as source data to test social and cultural theories. In several studies, researchers are using data analytics to predict Hollywood blockbusters from years past.  In this article alone, we have linked to over 50 sources, including 15 technical papers, relating to emergent technology and predictive analytics in the entertainment industry.

What all of this research has found is that within certain thresholds, yes, you can use data analytics to predict movie success or performance, and there are better indicators of success for some movies than for others.

There are many methods to define what success is, so it is a bit naive to try to predict movie success using analytics. However, before we get too far, let’s cover some basic definitions and concepts.

What are data analytics?

As defined by Technopedia, "data analytics refers to qualitative and quantitative actions used to enhance productivity, opportunity, and business gain. Data is extracted and categorized to identify and analyze behavioral data and patterns."

Several different types of data analytics apply to predictive analytics in movies and television:

What are data analytics

Used in classification algorithms such as Support Vector Machines (SVM) and Logistic Regression to convert nonlinear classification problems into a linear classification problem. A ‘kernel’ transforms data from one space into another space (usually a higher dimensional space), so that data can be separated linearly according to their classes. It's basically a way to uncomplicate very complicated data and treat it as a math problem.

Check out Eric Kim's explainer, "Everything You Wanted to Know About the Kernel Trick (But Were Too Afraid to Ask"

Textual Analysis

As defined by Duke University's library guide, text analysis is a “broad term covering various processes by which text and natural language are organized and described” in order to: categorize the texts into various subgroups, optimize for search, classify according to genres/subgenres, compare to another direct piece or set of content, identify trends in content, codify topics, categorize by place or significant figures, and visualize text.  

There are several methods of text-based analysis:

  • Word frequency
  • Collocation (words that often appear near one another)
  • Concordance (the contexts in which a given set of words appear)
  • N-grams (common two-, three-, etc.- word phrases)
  • Entity recognition (identifying names, places, time periods, etc.)
Network Analysis

Examines the structure of relationships between classified entities. Also referred to as ‘graph theory,’ in which a given network can be defined as a graph and plotted with defined names based on the members/hierarchy of that network. Can identify key players, relationship structure and stability, and action characteristics. Has suitable applications in a variety of fields ranging from particle physics to social networks.

Sentiment Analysis

An application of natural language processing, text analysis, computational linguistics, and biometrics to determine the attitude, mood, or tone corresponding to speech and writing. At the most basic level, text is assigned a positive, negative, or neutral value to determine the polarity of expression--i.e., What is happy, sad, etc. in a given situation.  Individual words can also be tagged with specific emotions if desired.

Most common applications are for customer service evaluations and pre-market testing of marketing materials. Also referred to as ‘opinion mining’ and ‘emotion AI.’

Neural Network

A data model that can handle complex, varied information and produce sophisticated observations/solutions. Designed to mimic human cognition in two essential functions: it acquires knowledge through observation/learning, and it stores knowledge accordingly to the same organization around synaptic connections.

The most common method is the Multilayer Perceptron (MLP), a “supervised network,” which needs direct instruction of the desired output to learn and acquire knowledge. Its primary purpose is to use historical data to correctly correlate cause and effect or input and output so that when the output is actually unknown, the neural net can still yield results.

Bag of Words

Pretty much how it sounds. A piece of text is taken as an unordered set of words. The program throws out the structural rules of language, only paying attention to the actual words used and the number of times those words appear throughout the text to find topic groups/ significant themes/ideas.


How can the movie industry put big data to use?

How can the movie industry put big data to use?

Each year, the entertainment industry produces more movies, bigger movies, and more types of media compete for audience attention. Between 2000 and 2010, just 36% of movies had profitable box-office returns. So missing the mark is more costly than ever, and your every decision takes more into account. Success hinges on your professional experience and the industry insights of those around you. In all, it is a pretty closed system with a lot of opportunity for subjective interpretation.

Analytics gives businesses the quantitative data necessary to make better, more informed decisions and improve the services they provide to their audience. Machine learning and artificial intelligence are becoming more and more common in many industries, including the Entertainment Industry One of the most prolific examples of AI and film data analytics is at Netflix, which uses data analytics so thoroughly that they can offer “33 million different versions of Netflix” to their customers, as told to Kissmetrics. We will look at applications in streaming and web-based entertainment in Part IV, but for now, let’s stick to how the traditional production process can take advantage of movie industry analytics.

Like most industries today, Hollywood has access to more supportive data for decision-making than ever before. Ingesting and interpreting all of that information has until very recently, been an intimidating task. Thankfully, the right visualization of that data is making analysis easier and faster. As discussed by panelists at a Hollywood analytics and data conference hosted by IBM, today’s analytics and listening tools mean that “it doesn't require specialized knowledge to achieve “data literacy.” 

For more realistic predictive analytics for movies, analyze as many films as you can get your hands on.

By analyzing a large group of movies and television shows and using specific rules to define parameters, patterns emerge. In data analytics, quantity is king, and historical data is essential to predicting future success. A study in the Harvard Business Review-Journal noted  that “to effectively leverage historical data, it is vital to look at the past performance of a large volume of films as the basis for revenue forecasts and in the development of any forward-looking financial plans for a particular film.”

By examining thousands of movies and television shows over several decades, analysts, marketers, and producers can detect anomalies and get statistically meaningful results that help provide recommendations. As the saying goes, knowledge is power, and data analytics gives you the ability to turn raw data into valuable knowledge. Those who take full advantage of the data available will gain an immediate edge over those who go only with their gut or industry expertise alone.

Current and developing applications for predictive analytics for movie success combine developments in e-commerce, consumer behavior, digital advertising/SEO, and cognitive science.

Researchers, businesses, and studios have all performed individual studies and found promising correlations with:

  • Character types
  • Dialogue style
  • Plot complexity
  • Movie Distribution
  • Release Date/Seasonality
  • Genre
  • Sequel/Franchise status
  • Star power
  • Reviews
  • Rating
  • Awards/nominations of attached entities
  • Budget
  • ‘Buzz’ (combination of social chatter, reviews, impressions, editorial mention, etc. )

At the time of this writing, no research points to a singular indicative measure; and it is important to acknowledge these are all disparate studies without a common dataset or quantity (they range from 80 to 1000 scripts). Nevertheless, in each instance, a set of points are measured and compared against a given outcome. This could mean the number of reviews of a given movie compared to its box office earnings, or it could indicate the number of high-profile stars in a movie compared to its critical reception.

The capability of data analysis goes beyond just predicting blockbuster success or failure; though it can take a hyper-educated guess. The best way to “predict” success is to integrate the right data analytics approach at each life cycle stage of the movie. The most immediately impactful applications are in development and distribution, though there are innovative applications in production and post-production as well.


Analytics in Movie Development

Analytics in Movie Development

Data analytics can help development departments determine:

  • How topically relevant a story is right now, where geographically it is projected to be most well-received, and if there is a cyclical nature to the topic.
  • How to replace cliches in a script with more nuanced ideas.
  • How to find the right stories for specific audiences and how to best reach them (also useful to the marketing department).
  • What audiences are looking for by analyzing viewing histories, searches, clicks, engagement, and more.

To state the obvious: the movie business is the storytelling business, and the goal of storytelling is to convey a specific experience. Think of data analytics as your tool to focus on storytelling by using data as a driving force.

As any writer will tell you, the villain that looms most significant in the creative process is the blinking cursor of a blank page. There are so many ways to introduce characters and setting, so many ways to get your characters to their end goal, different names that could lead your audience to think one way about a character, for better or for worse. The choices can be overwhelming, so, a lot of the time the decisions are made arbitrarily just to get some text on the page, and over time those choices grow nearer and dearer to the writers’ heart such that changing them seems impossible and disastrous to the story.

It can be the same for producers and studios. Except instead of the blank page, they begin with thousands of possible pages, like some nightmarish "Choose your Own" adventure, except it’s “choose which stack of paper you want to bet thousands of dollars on and you only have about 10 minutes.”

In addition to the writer's narrative concerns, the studio professional must also consider the competitive development landscape, whether or not their productions will resonate with the viewing public, what kind of talent will be best, and how can they allocate the budget, just to name a few.

Again, the choices can be overwhelming, and pitfalls no less damaging.

Methods like text-based predictive data analytics give studios more clarity and context for each of those problems at the development stage. Text mining is able to reveal hidden structures of the story, tell us what the story is about, and explain the mechanics of the actual storytelling. A data scientist/predictive model can correlate these mechanics to box office receipts, critical reception, budget, etc., From this, studios can find the kinds of stories that will resonate, and, importantly, what elements of those stories will provide the maximum return on their investment.

Instead of looking for a needle in a haystack, producers can use data analytics to interpret viewing histories, web searches, or user engagement, to identify the kinds of content people actively seek. , Each element of a script can be measured and compared to external trends like social media engagement to gauge topic relevance and audience size through methods like sentiment analysis.

Could predictive analytics help producers pick another Moonlight, winner of the 2017 Academy Award for Best Picture?From all of this data, studios can build custom models to cherry pick or highlight scripts to review and fast-track through the greenlighting process. If a director is looking to pick up the next Moonlight, they may want scripts that are similar to it. But similar to it how? Because they both have three stages in the development of a young protagonist? Because they both explore budding sexuality in an oppressive social environment? Multiple methods of data analysis give producers full-scale comps based on any given metric (location-based, character-based, final length, shooting script vs. screenplay, etc. ). StoryFit data scientists have designed such experiments based on gender-based stereotypes and performances in movies. In 2018, the company presented their research, “Sex & Text: Breaking Down Movie Stereotypes with AI” at South By Southwest Film Festival, and made samples of their research available online.

Several text-based analytics methods apply to development and greenlight decisions. Even a relatively simple approach like the Bag of Words model--frequency of word detection--detects essential factors for an audience based on the script alone, as illustrated by Wharton School researchers Jehoshua Eliashberg, Sam K. Hui, and Z. John Zhang in, “From Storyline to Box Office: A New Approach for Green-Lighting Movie Scripts:”

“The bag-of-word approach will help us pick up the themes, scenes, and emotions in a script. For instance, the frequent appearance of words such as “guns,” “blood,” “fight,” “car crashes,” and “police” may indicate that the script contains a crime story with action sequences. When this information is coupled with known box office receipts for the movies already made in the recent past, we would know if the movies of this type tend to sell well or not in theatres.”

This method can also give a producer a rough approximation of the MPAA rating for the script should it be produced. Some of the insights available from text-only analysis at StoryFit include scene highlights, character relationships, and pacing: all of which can be correlated to success

StoryFit AI-powered story and audience insights.

Use AI to beat the competition.

See Packages
Research Finding

Gender Dynamics in Hollywood

By studying scripts, StoryFit AI discovered why so many actresses are calling for more dynamic characters. Presented at the 2018 SXSW Film Festival. 

View Study

The research is certainly compelling, but most data scientists--including those at StoryFit-- ardently agree that without the human element, analytics are just expensive numbers. When coupled with production staff expertise, predictive analytics provide studios with informed forecasts of both performance and reception at the earliest stages of development; but the human element is still critical:  

“To take a simple example,” explained Eliashburg et al.,  “the plot ‘the villain kills Superman’ and ‘Superman kills the villain’ will clearly trigger a different emotional response from the audience, even though both sentences contain exactly the same words with the same frequency. Therefore, specific to the analysis of stories, we need to incorporate domain knowledge from screenwriting experts in evaluating a movie script to exhaust the potential of using scripts to predict movie successes.”

Movies with higher counts of distinct topics, characters, and concepts are statistically more successful and praised than more simplistic movies.

In 2016, researchers from Carnegie Mellon, American University of Sharajah, the School of Visual Arts built furthered this study in “Predicting Box Office from the Screenplay: A Text Analytical Approach,” a textual network analysis of 170 American shooting scripts for movies released between 2010 and 2011. Their study sought to predict opening-weekend box-office outcomes and found that the most important indicator of success was the size and complexity of the script’s network, i.e., how many individual relationships, topics, and concepts were identifiable from the script itself. Their work also supports results found in a 2014 network text analysis study of 150 screenplays that found award nominees and winners to have text networks that were over 33% larger than “amateurs.”

The study used a regression model to compare box-office performance to complexity (a.ka. network strength) of each script. Its findings support those of earlier research that found content and genre to be the most reliable predictors of success, as well as films in the thriller or romance genres and movies with early exposition and a strong nemesis. It found negative correlations with vulgar language and an R MPAA rating, and films in the drama genre.

Beyond the Script

University of Iowa researchers Michael T. Lash and Kang Zhao used similar techniques in text mining as well as social network analysis to extract both predictive and prescriptive insights based on the crew, cast, story, and release timing. Their study, “Early Predictions of Movie Success: the Who, What, and When of Profitability,”   used net profit as their benchmark as opposed to strictly box-office receipts.

Lash, and Zhang found an exciting nuance for casting consideration: while they saw moderate correlation of cast “star power” with revenue, the correlation to profit is significantly weaker. “In other words,” wrote Lash and Zhang, “having actors who have earned big box-office revenues in a movie does not necessarily mean more profit for the movie.”

Harvard Business Review example of social analysis for casting purposes, a form of data analytics used to predict movie success or failure.
Harvard Business Review example of social analysis for casting purposes

Their study found fascinating profit correlations throughout the filmmaking process, such as average genre expertise (how much expertise a cast has in a specific movie’s genre), and average actor-director collaboration.


Analytics in Movie Promotion and Distribution

Analytics in Movie Promotion and Distribution

We have all heard stories about those famous scripts and books that companies passed on and thought, “what chumps.” But really, they were probably shrewd. They knew their team, they knew their reach, and they knew what they could do well. A good story without a way to reach the right audience is not going to get made. But with data analytics, marketers can predict the ideal audience down to precise detail and understand the best approach and timing to reach them.

Data analytics can help marketers tailor campaigns based on geographic data; there is even research suggesting a correlation between the number of Facebook likes a given film gets in a location and its chances of selling out at a nearby movie theater. Marketers can also use social listening to optimize release date based on geography, subject matter, or topic. The most successful example of this is 2016: Obama’s America, which debuted on different dates in different regions based on localized political trends.

Integrating Audience Data

A semi-technical explanation of how textual analysis can be used to determine success.

Movies are all about storytelling. Storytelling is all about conveying a specific experience, and successful analytics have to work to drive that experience, not push an agenda or message down someone’s throat. “The crux of predictive analytics,” says Target Marketing Magazine, “is understanding how the customer will respond to your offer.”

We’ve got to understand how each element of a movie impacts the viewer. Take a moment to think about how you as a consumer--not you as a media professional--interacts with any given movie. Each action gives something away about how you are interacting with that story and the way that story was marketed to you. The delivery method, when and where you buy your ticket, when you pause, if you keep watching, if you skip ahead, if you read the description before hitting play: each is a point that tells a story about you. And the facts of those moments--the music, the dialogue, the way the scene was cut--possibly correlate to performance. By collecting data to understand those driving forces, marketers can identify and target even the most specific groups.

Even the old standard

AdAge reported in 2015 on a new spin to the “For your Consideration” print ads seen before Awards season: data-driven social targeting of the Academy of Motion Pictures members. Through third-party companies, studios use geographic and demographic information to find and send tailored messages about given movies. Disney takes it even further by using AI to analyze viewers in real time. And by analyzing the social behavior of those users, the messages themselves can be customized based on their style and engagement.

Post Production

At the movie and television post-production stage, data analytics can improve elements of editing and budgeting, turning eye-crossing spreadsheets into useful information and actionable insight.

We’ve all been to a movie that gave too much away in the trailer and left us feeling like we’ve wasted money and time. Odds are you’ve also been to a movie expecting one kind of experience based on the promotional trailers only to realize the movie was not at all accurately represented. Sometimes this can be a pleasant surprise, but it can also be a huge disappointment.

Taking advantage of sentiment analysis, predictive analytics, and visual analytics, editors can create trailers that are more enticing and more representative of their films. More satisfied viewers are more likely to recommend their friends see the films, increasing your movie’s chance for success.

reboot wolf creek used social analytics to help predict how successful the show would be with new audiences.

By using data analytics platforms to advance test trailers and promos, studios learn more about market reception and can adapt accordingly or gain validation on chance decisions, as Stan Entertainment did in anticipation of its 2016 release of remake Wolf Creek. Stan knew that name recognition alone wouldn’t carry the ten-year-old B Australian slasher movie, so they cast Lucy Fry as the female lead, gambling that she would broaden the show’s appeal. At that point, Fry's most prominent role was her portrayal of Lee Harvey Oswald’s wife in Hulu’s adaptation of 11.22.63 by Stephen King.

To test their theory, before release they distributed a version of the trailer that placed more emphasis on Fry than other characters in a 20-second spot. The results supported their casting and developmental decisions.

Turning to Netflix and a similar story emerges. When embarking on their first attempt at original content production, the streaming giant was not going to leave anything to chance. From casting to the concept and promotional trailers, data led the charge. Netflix made five times as many unique trailers for their debut original series House of Cards as most studios make for prime-time television shows or feature films. Starring Kevin Spacey and Robin Wright and directed by auteur filmmaker David Fincher, each version targeted a specific audience with particular preferences, all based on historical viewer behavior. “If you watched a lot of Kevin Spacey films, you saw a trailer featuring him,” explained Kissmetrics. “Those who watched a lot of movies starring females saw a trailer featuring the women in the show. And David Fincher fans saw a trailer featuring his touch.”

Learning from Past Mistakes

One of the most significant strengths data analytics gives decision-makers is the ability to take a comprehensive but detailed view of the past and apply the lessons found to the present and future. A script could test positively, the story and topics correspond with trends observed in social analytics, and the movie can still fail. However, by using the savant-like recall that data analysis gives us, a movie can be compared to past performance of similar movies based on storyline, budget, star-power, release timing, etc. With those features identified, comparable distribution can be analyzed to surface both successful strategies to replicate and pitfalls to avoid.


Analytics in Streaming Services

Online content providers leading the charge

Analytics in Streaming Services

To be able to predict an outcome accurately with any statistical significance, regressive models are a core necessity. Regressive models are entirely dependent on vast amounts of varied, clean, high-quality data.  Web-based companies are inherently data-rich. Streaming companies like Amazon, Hulu, and Netflix capture and track information about their users on everything from location and browser preference to behavior and attention span. And media companies are intrinsically content-rich, filled to the brim with variations of media to engage their audience with. With data analytics, these streams of information become roadmaps to successful conversion.

There’s no lack of content for producers and distributors to work with: the age-old problem is finding the right person for each piece of content at the time that they are most likely to convert. With Netflix leading the pack, streaming services are using their immense amount of data and resources to power algorithms that solve for timing, as well as use data to develop original content and re-package licensed content.

Netflix was founded in 1997 by Reed Hastings and Marc Randolph as a DVD  home-delivery subscription service. Ten years later, it was one of the first companies to offer affordable web streaming subscription of premium and high-demand content. Just five years later, in 2013 Netflix entered the arena as a bonafide producer with the hit series, House of Cards. With the Kevin Spacey/Robin Wright political thriller, Netflix created a fundamental shift in the way audiences experience, find, and, critically, what audiences expect, from narrative media.

Netflix algorithms know where you are in House of Cards and how likely you are to keep watching based on other viewing habits.
Netflix algorithms know where you are in House of Cards and how likely you are to keep watching based on other viewing habits.

The creation of House of Cards was not a happy accident: it was the result of analyzing Netflix’s vast array of subscriber data and analyzing the content they had to provide. Using machine learning and artificial intelligence, Netflix scours all the data they have to find patterns with meaningful output to their customers. “The scope and scale of AI allow them to do this in an unprecedented fashion,” explained MarketWatch in September of 2017. ”Netflix can detect and analyze underlying scene elements that drive viewer engagement.”

And they use that data in every stage of the consumer experience, from content development and acquisition to user interface and engagement. Features such as autoplay, skip the intro, and Netflix recommendations complete with a match score are all provided by Netflix’s algorithms.

But before they were able to sit in the producer’s chair, Netflix had to create a captive audience by delivering the right content made by Hollywood’s best studios and most beloved indie filmmakers. Licensing deals for such films are expensive, so to prioritize their acquisition budget, they turned to data.  Think of the “because you watched” features on Netflix. There are often movies you’ve never heard of. These are movies that are cheaper to acquire but look similar from a data analytics level to the movies their subscribers already love. Netflix also uses web scraping to track what movies and television shows are being pirated online to make strategic acquisition decisions, using the actors, directors, story types, and kinds of programs most common as a guiding force.

Many users are not aware that algorithms inform each piece of their interaction with Netflix: one subscriber sees a different trailer or cover image based on their viewing history (taking into account both videos watched and videos abandoned) and rating history.

 They work this data-driven magic frame by frame and pixel by pixel by measuring each meaningful consumer action like:

  • At which point you pause, rewind, or skip ahead.
  • What date and day you watch (fun fact! TV shows are mostly watched during the week, while movies are a weekend activity)
  • Your zip code
  • What kind of device you use to view
  • Browsing and scrolling behavior

And more.

As told to Wired by Netflix engineer Charles Smith and Data Architecture Platform Manager Jeff Magnusson, each aspect of a video adds analytic value: even the color spectrum of the cover image. “Analyz­ing colors allows the company to measure the distance between customers,”  they noted.  The company can record the specific range of colors that a given user engages in over a set period. And that’s just one of the things they measure to make bets on success, a practice that has earned them 91 Emmy nominations and a renewal rate three times that of the typical subscription network.

Netflix uses questions to find the data that they believe will drive innovation of experience and creation,  like, “Are some customers leaning towards specific types of covers? If they are, should their recommendations adjust?”

Netflix’s success may be in part due to its core mission, which is guided by three deceptively simple tenets, as explained by Netflix engineer Charles Smith and Data Architecture Platform Manager Jeff Magnusson at the Hadoop Summit in the fall of 2017:

  • Data should be accessible, easy to discover, and easy to process for everyone.
  • Whether your dataset is large or small, being able to visualize it makes it easier to explain.
  • The longer you take to find the data, the less valuable it becomes.

This foundation guides the company, but they do not own the ideas inherent in it. With the right tools, there is no reason that original content production companies cannot do the same or similar things with data analytics to game the success of their productions.



Key Takeaways

  • Not All Data is Created Equal: clean, consistent data is the easiest to work with and produces the best results
  • More data = better outcomes
  • Enact data early on, and you are more nimble throughout the process.
  • Here at StoryFit,  content-based feature analysis and insights are our bread and butter--just one of the many tools in a savvy producers’ analytics toolbox.



Content Insight
Consumer Metrics & Entertainment Analytics
Consumer Behavior
Social Mining/Promotion
Audience Testing


Further Reading

Academic Papers

Data Science and Hollywood


Data Science, Machine Learning, and AI


Discover and the possibilities.

Use AI to beat the competition.

Schedule Demo