Analyze Data Strategy of Spotify Podcast

7 min readDec 29, 2021

Spotify Podcast Introduction

In its latest earnings report, Spotify told investors that podcast growth helped bring in all-time-high Ad revenues of $375 million — a 75% year-over-year growth from Q3 2020. Spotify now has 165 million premium subscribers and 365 million monthly active users(MAUs), while more than 25% of MAUs are involved with podcast listening. Spotify’s interests in podcasts has shown no signs of slowing, as it continues making acquisitions and investments in both podcast contents and technology.

In this article, we are going to look into the data strategies of Spotify podcast, as well as its data management techniques and infrastructure behind its data strategies. This is a group project in the Introduction to Data Mangement Course of the MSBA program at UT Austin.

Spotify Podcast Ecosystem

Spotify monetize podcasts the same way it has music — through subscriptions and ad revenue. Under this logic, audience, podcast creators, and advertisers become essential roles in this ecosystem. They aim to increase revenue by using advertisements within podcasts, as users spend more time on the app than a regular music listener, as well as creating additional income from new subscriptions.

Current Strategies of Spotify Podcast

Spotify is highly motivated to leverage customer data in a way that enhances and further monetizes the user experience and enables advertisers to better target potential customers. With their experience in building community and recommendations around content streaming, it gives them further differentiation.

Spotify Audience Network

Earlier this year, Spotify launched ‘Spotify Audience Network’, which has been described as a potential game changer for the world of podcast monetization, as it serves as a audio advertising marketplace in which advertisers of all sizes will be able to connect with listeners consuming a broad range of content. It is developed based on its previous launches of Streaming Ad Insertion, Ad Studio, and the acquisition of Megaphone. Since the initial launch in April 2021, the total number of podcasts in the network has grown by more than 50%.

With Spotify Audience Network, advertisers can buy podcast ads based on their target audience. This goes beyond classic title-by-title podcast buys with audience-based podcast buying. The Network offers a broad range of easy-to-use targeting tools, including demographic targeting, behavioral targeting, and custom targeting. In addition to targeting with customer attributes, advertisers can do contextual targeting, which advertisers can run ads based on topics relevant to their business — to reach right listeners at the right moments. Powered by NLP, this offering goes beyond keywords to sentiments and contexts.

Originals & Exclusives(‘O&E’)

Spotify has been working to embrace the podcast creator community with its acquisition of podcast-related companies, including several studios and creation tools such as Anchor. In addition to the Joe Rogan Experience, Spotify continues launching more exclusive partnerships with well-known podcast shows and artists this year, including Armchair Expert with Dax Shephard and Call Her Daddy. Spotify said these Originals have been effective in stimulating new user acquisition in emerging markets such as India and Latin America.

Data Management for Spotify Podcast

Based on its current strategies, we designed a simple data management plan for Spotify Podcast. The main goals are to benefit stakeholders of the Spotify Podcast Ecosystem:

Audience: Increase audience retention by offering exclusive membership title or events to top audience
Podcast Creators: Help podcast creators create content easily by gaining more customer insights
Advertisers: Help advertisers accurately target audience through customer attributes and performances of creators

Data Warehouse

Data Warehouse is a collection of methods, techniques, and tools to support knowledge workers — senior managers, directors, and analysts — to conduct data analyses that help with performing decision making processes and improving information resources. The data appropriate for storage in data warehouse is structured data. Here we illustrate 2 suitable schema in which the Spotify Podcast data will be stored.

Entity Relationship Diagram Example

General Podcast Tables:

Podcast
Episode
Podcast_Genre
Genre
Artist

User-Related Tables:

User_Info
Search_History
Listen_History
Visit_History
Share_History
Following_Podcast
User_Episodes
Artists_Following
Users_Following

Relation models are optimized for addition, updating and deletion of data in a real-time online transaction system. The ER diagram is designed based on the strategy to help Spotify expand its podcast ecosystem and provide more customer insights. The first part of the diagram is general podcast entities that store general information of podcasts. The other part is user-related entities, which allow us to track how users interact with podcasts on Spotify. From user-related tables, we can know what contents a user search, how many times a user visit or share a specific podcast. We can also deep dive into more details such as how long a user listen to an episode, which parts they start and stop. Moreover, tables recording lists of podcasts, episodes, artists and other users a person is following will help enhance customer segmentation and recommendation.

Dimensional Modeling Example

Fact Tables:

Customer Retention
Customer Engagement
Customer Segmentation

Dimension Tables:

User
Podcast
Episode
Artist
Country
Time
Genre

A dimensional model in data warehouse is designed to read, summarize, analyze numeric information like values, balances, counts, etc. Fact tables contain measurements, metrics or facts about a business process, while dimension tables are companion tables to the fact tables and contain descriptive attributes to be used as query constraining. From the above diagram, users are able to query data at their interests to get information such as: customer bounce rate(visiting duration<1 min/# visits), customer active rate, and podcast market share, etc. Users can analyze these data by different dimensions (eg: countries, regoins, podcast genres) to acquire the information that can support their decision making.

Spotify’s Event Delivery & ETL Process

ETL stands for Extraction, Transformation, Loading. The ETL process, which often defined as a reconciled process, takes data from the operational databases, transforms it into a appropriate format, and loads it into the data warehouse. With ETL, data from different sources can be grouped into a single data warehouse for analytics programs to act on and realize key insights for Spotify Podcast.

The ETL process is crucial for Spotify’s Event Delivery system. Event Delivery system is one the foundational pieces of Spotify’s data infrastructure. Every event containing data about users, the actions they take, or operational logs from hundreds of systems is a key component for Spotify to understand its users and serve them the personally tailored content they love.

Extraction

In the extraction phase, relevant data is obtained from multiple sources . For example, logs and data can be extracted trough Spotify API from AWS or multiple partitions of a SQL database.

Transformation

Transformation is the core of the reconciliation phase. Data extracted from Spotify’s operational databases will have to be converted to a specific data warehouse format before loading into the data warehouse. For instance, Pyspark, Luigi, and Python are common tools for users to undergo transformation.

Loading

Loading is the processes of updating transformed data to the data warehouse. For instance, transformed data can be loaded into Postgresql using SQLAlchemy engine.

Spotify’s Event Delivery

Since 2017, Spotify has moved from Apache Kafka system to Google Cloud Platform (GCP). When on Kafka system, an ETL job is used to transform data to Avro format load to HDFS (Hadoop Distributed File System) and Hive. Now, they are replacing HDFS with Cloud Storage, and Hive with BigQuery. Moreover, Dataflow is used to write the ETL job, while before Apache Crunch was used.

Spotify’s Event Delivery System Flow with Google Cloud Platform (source: Spotify)

Recently, Spotify has redesigned its Event Delivery Infrastructure (EDI) to solve some problems the initial EDI encountered: Spotify’s Event Delivery Migration

Conclusion

The proposed data model will contribute to the growth in profitability of Spotify podcasts. An offensive strategy will best benefit Spotify, as MVOTs (Multiple Versions of Truth) will quickly help with required analyses. Nevertheless, a modicum of defense with a relatively robust SSOT (Single Source of Truth) is also critical to ensure data integrity. Spotify is also very aware of making its system and products keep up the technology trends. Its decision to merge its data infrastructure to google cloud in 2016–2017 clearly set themselves on the path towards utilizing data-focused tools to improve customer service.

Next Steps

To continue expanding its podcast business, we think of some ideas Spotify can implement. First, Spotify can carry out audience reviews as they will enable a ‘two-way feedback loop’ between creators and listeners, while helping creators generate interests and bring in new listeners due to ratings. Furthermore, it is promising for Spotify to have more exclusive partnerships with targeting up & comping and popular artists in order to produce more quality podcasts, which will definitely be a good way to improve reach.

References

[1] Spotify Technology S.A. Announces Financial Results for Third Quarter 2021 [2] Spotify’s Event Delivery — Life in the Cloud [3] Spotify’s Event Delivery — The Road to the Cloud Part I, Part II, Part III

ABOUT ME

Thank you so much for reading my article! You are welcome to follow me and give claps to me if you find it inspiring :) I am Branda, a current student studying in MSc Business Analytics at UT Austin. Don’t hesitate to email me at branda.huang@utexas.edu or connect me on Linkedin to discuss more interesting ideas!