
The New Equation

Executive leadership hub - What’s important to the C-suite?

Tech Effect

Shared success benefits
Loading Results
No Match Found

Data analytics case study data files
Inventory analysis case study data files:.
Beginning Inventory
Purchase Prices
Vendor Invoices
Ending Inventory
Inventory Analysis Case Study Instructor files:
Instructor guide
Phase 1 - Data Collection and Preparation
Phase 2 - Data Discovery and Visualization
Phase 3 - Introduction to Statistical Analysis

Stay up to date
Subscribe to our University Relations distribution list

Julie Peters
University Relations leader, PwC US

© 2017 - 2023 PwC. © 2017 PwC. All rights reserved. PwC refers to the US member firm or one of its subsidiaries or affiliates, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details.
- Privacy Shield
- Cookie info
- Terms and conditions
- Site provider
- Your Privacy Choices

Data Analytics Case Study Guide
What are data analytics case study interviews.
When you’re trying to land a data analyst job, the last thing to stand in your way is most commonly the data analytics case study interview.
One reason they’re so challenging is because case studies don’t typically have a right or wrong answer.
Instead, case study interviews require you to come up with a hypothesis for an analytics question, and then produce data to support or validate your hypothesis. In other words, it’s not just about just your technical skills; you’re also being tested on creative problem-solving and your ability to communicate with stakeholders.
How to Solve Data Analytics Case Questions
Check out our video below on How to solve a Data Analytics case study problem:

With data analyst case questions, you will need to answer two key questions:
- What metrics should I propose?
- How do I write a SQL query to get the metrics I need?
In short, to ace a data analytics case interview you not only need to brush up on case questions, but you also should be adept at writing all types of SQL queries and have strong data sense.
These questions are especially challenging to answer, if you don’t have a framework or know how to answer them. To help you prepare, we created this step-by-step guide to answering data analytics case questions.
We show you how to use a framework to answer case questions, provide example analytics questions, and help you understand the difference between analytics case studies and product metrics case studies.
Data Analytics Cases vs Product Metrics Questions
Product case questions sometimes get lumped in with data analytics cases.
Ultimately, the type of case question you are asked will depend on the role. Product analysts for example will likely face more product-oriented questions.
Product metrics cases tend to focus on a hypothetical situation. You might be asked to:
Investigate Metrics - One of the most common types will ask you to investigate a metric, usually one that’s going up or down. For example, “Why are Facebook friend requests falling by 10 percent?”
Measure Product/Feature Success - A lot of analytics cases revolve around measurement of product success and feature changes. For example, “We want to add X feature to product Y. What metrics would you track to make sure that’s a good idea?”
With product data cases, the key difference is that you may or may not be required to write the SQL query to find the metric.
Instead, these interviews are more theoretical and are designed to assess your product sense and ability to think about analytics problems from a product perspective. Product metrics questions may also show up in the data analyst interview , but likely only for product data analyst roles.
Data Analytics Case Study Question: Sample Solution

Let’s start with an example data analytics case question :
You’re given a table that represents search results from searches on Facebook. The query column is the search term, position column represents each position the search result came in, and the rating column represents the human rating from 1 to 5 where 5 is high relevance and 1 is low relevance.
Each row in the search_events table represents a single search with the has_clicked column representing if a user clicked on a result or not. We have a hypothesis that the CTR is dependent on the search result rating.
Write a query to return data to support or disprove this hypothesis.
search_results table:
search_events table
Step 1: With Data Analytics Case Studies, Start by Making Assumptions
Hint: Start by making assumptions and thinking out loud. With this question, focus on coming up with a metric to support the hypothesis. If the question is unclear or if you think you need more information, be sure to ask.
Answer. The hypothesis is that CTR is dependent on search result rating. Therefore, we want to focus on the CTR metric, and we can assume:
- If CTR is high when search result ratings are high, and CTR is low when the search result ratings are low, then the hypothesis is correct.
- If CTR is low when the search ratings are high, or there is no proven correlation between the two, then our hypothesis is not proven.
Step 2: Provide a Solution for the Case Question
Hint: Walk the interviewer through your reasoning. Talking about the decisions you make and why you’re making them shows off your problem-solving approach.
Answer. One way we can investigate the hypothesis is to look at the results split into different search rating buckets. For example, if we measure the CTR for results rated at 1, then those rated at 2, and so on, we can identify if an increase in rating is correlated with an increase in CTR.
First, I’d write a query to get the number of results for each query in each bucket. We want to look at the distribution of results that are less than a rating threshold, which will help us see the relationship between search rating and CTR.
This CTE aggregates the number of results that are less than a certain rating threshold. Later, we can use this to see the percentage that are in each bucket. If we re-join to the search_events table, we can calculate the CTR by then grouping by each bucket.
Step 3: Use Analysis to Backup Your Solution
Hint: Be prepared to justify your solution. Interviewers will follow up with questions about your reasoning, and ask why you make certain assumptions.
Answer. By using the CASE WHEN statement, I calculated each ratings bucket by checking to see if all the search results were less than 1, 2, or 3 by subtracting the total from the number within the bucket and seeing if it equates to 0.
I did that to get away from averages in our bucketing system. Outliers would make it more difficult to measure the effect of bad ratings. For example, if a query had a 1 rating and another had a 5 rating, that would equate to an average of 3. Whereas in my solution, a query with all of the results under 1, 2, or 3 lets us know that it actually has bad ratings.
Product Data Case Question: Sample Solution

In product metrics interviews, you’ll likely be asked about analytics, but the discussion will be more theoretical. You’ll propose a solution to a problem, and supply the metrics you’ll use to investigate or solve it. You may or may not be required to write a SQL query to get those metrics.
We’ll start with an example product metrics case study question :
Let’s say you work for a social media company that has just done a launch in a new city. Looking at weekly metrics, you see a slow decrease in the average number of comments per user from January to March in this city.
The company has been consistently growing new users in the city from January to March.
What are some reasons why the average number of comments per user would be decreasing and what metrics would you look into?
Step 1: Ask Clarifying Questions Specific to the Case
Hint: This question is very vague. It’s all hypothetical, so we don’t know very much about users, what the product is, and how people might be interacting. Be sure you ask questions upfront about the product.
Answer: Before I jump into an answer, I’d like to ask a few questions:
- Who uses this social network? How do they interact with each other?
- Has there been any performance issues that might be causing the problem?
- What are the goals of this particular launch?
- Has there been any changes to the comment features in recent weeks?
For the sake of this example, let’s say we learn that it’s a social network similar to Facebook with a young audience, and the goals of the launch are to grow the user base. Also, there have been no performance issues and the commenting feature hasn’t been changed since launch.
Step 2: Use the Case Question to Make Assumptions
Hint: Look for clues in the question. For example, this case gives you a metric, “average number of comments per user.” Consider if the clue might be helpful in your solution. But be careful, sometimes questions are designed to throw you off track.
Answer: From the question, we can hypothesize a little bit. For example, we know that user count is increasing linearly. That means two things:
- The decreasing comments issue isn’t a result of a declining user base.
- The cause isn’t loss of platform.
We can also model out the data to help us get a better picture of the average number of comments per user metric:
- January: 10000 users, 30000 comments, 3 comments/user
- February: 20000 users, 50000 comments, 2.5 comments/user
- March: 30000 users, 60000 comments, 2 comments/user **
One thing to note: Although this is an interesting metric, I’m not sure if it will help us solve this question. For one, average comments per user doesn’t account for churn. We might assume that during the three-month period users are churning off the platform. Let’s say the churn rate is 25% in January, 20% in February and 15% in March.
Step 3: Make a Hypothesis About the Data
Hint: Don’t worry too much about making a correct hypothesis. Instead, interviewers want to get a sense of your product initiation and that you’re on the right track. Also, be prepared to measure your hypothesis.
Answer. I would say that average comments per user isn’t a great metric to use, because it doesn’t reveal insights into what’s really causing this issue.
That’s because it doesn’t account for active users, which are the users who are actually commenting. A better metric to investigate would be retained users and monthly active users.
What I suspect is causing the issue is that active users are commenting frequently and are responsible for the increase in comments month-to-month. New users, on the other hand, aren’t as engaged and aren’t commenting as often.
Step 4: Provide Metrics and Data Analysis
Hint: Within your solution, include key metrics that you’d like to investigate that will help you measure success.
Answer: I’d say there are a few ways we could investigate the cause of this problem, but the one I’d be most interested in would be the engagement of monthly active users.
If the growth in comments is coming from active users, that would help us understand how we’re doing at retaining users. Plus, it will also show if new users are less engaged and commenting less frequently.
One way that we could dig into this would be to segment users by their onboarding date, which would help us to visualize engagement and see how engaged some of our longest-retained users are.
If engagement of new users is the issue, that will give us some options in terms of strategies for addressing the problem. For example, we could test new onboarding or commenting features designed to generate engagement.
Step 5: Propose a Solution for the Case Question
Hint: In the majority of cases, your initial assumptions might be incorrect, or the interviewer might throw you a curveball. Be prepared to make new hypotheses or discuss the pitfalls of your analysis.
Answer. If the cause wasn’t due to a lack of engagement among new users, then I’d want to investigate active users. One potential cause would be active users commenting less. In that case, we’d know that our earliest users were churning out, and that engagement among new users was potentially growing.
Again, I think we’d want to focus on user engagement since the onboarding date. That would help us understand if we were seeing higher levels of churn among active users, and we could start to identify some solutions there.
Tip: Use a Framework to Solve Data Analytics Case Questions
Analytics case questions can be challenging, but they’re much more challenging if you don’t use a framework. Without a framework, it’s easier to get lost in your answer, to get stuck, and really lose the confidence of your interviewer. Find a helpful framework for data analytics questions in our data science course .
Once you have the framework down, what’s the best way to practice? Mock interviews with coaches are very effective, as you’ll get feedback and helpful tips as you answer.
Finally, if you’re looking for sample data analytics case questions and other types of interview questions, see our guide on the top data analyst interview questions .
Start Your First Project
Learn By Doing

10 Real World Data Science Case Studies Projects with Example
Top 10 Data Science Case Studies Projects with Examples and Solutions in Python to inspire your data science learning in 2021. Last Updated: 02 Feb 2023
Data science has been a trending buzzword in recent times. With wide applications in various sectors like healthcare, education, retail, transportation, media, and banking -data science applications are at the core of pretty much every industry out there. The possibilities are endless: analysis of frauds in the finance sector or the personalization of recommendations on eCommerce businesses. We have developed ten exciting data science case studies to explain how data science is leveraged across various industries to make smarter decisions and develop innovative personalized products tailored to specific customers.

Walmart Sales Forecasting Data Science Project
Last Updated : 2023-02-01 08:03:47
Downloadable solution code | Explanatory videos | Tech Support
Table of Contents
Data science case studies in retail , data science case studies in entertainment industry , data science case studies in travel industry , data science case studies in social media , data science case studies in healthcare, data science case studies in oil and gas, 10 most interesting data science case studies with examples.

So, without much ado, let's get started with data science business case studies !
With humble beginnings as a simple discount retailer, today, Walmart operates in 10,500 stores and clubs in 24 countries and eCommerce websites, employing around 2.2 million people around the globe. For the fiscal year ended January 31, 2021, Walmart's total revenue was $559 billion showing a growth of $35 billion with the expansion of the eCommerce sector. Walmart is a data-driven company that works on the principle of 'Everyday low cost' for its consumers. To achieve this goal, they heavily depend on the advances of their data science and analytics department for research and development, also known as Walmart Labs. Walmart is home to the world's largest private cloud, which can manage 2.5 petabytes of data every hour! To analyze this humongous amount of data, Walmart has created 'Data Café,' a state-of-the-art analytics hub located within its Bentonville, Arkansas headquarters. The Walmart Labs team heavily invests in building and managing technologies like cloud, data, DevOps, infrastructure, and security.

Walmart is experiencing massive digital growth as the world's largest retailer . Walmart has been leveraging Big data and advances in data science to build solutions to enhance, optimize and customize the shopping experience and serve their customers in a better way. At Walmart Labs, data scientists are focused on creating data-driven solutions that power the efficiency and effectiveness of complex supply chain management processes. Here are some of the applications of data science at Walmart:
i) Personalized Customer Shopping Experience
Walmart analyses customer preferences and shopping patterns to optimize the stocking and displaying of merchandise in their stores. Analysis of Big data also helps them understand new item sales, make decisions on discontinuing products, and the performance of brands.
ii) Order Sourcing and On-Time Delivery Promise
Millions of customers view items on Walmart.com, and Walmart provides each customer a real-time estimated delivery date for the items purchased. Walmart runs a backend algorithm that estimates this based on the distance between the customer and the fulfillment center, inventory levels, and shipping methods available. The supply chain management system determines the optimum fulfillment center based on distance and inventory levels for every order. It also has to decide on the shipping method to minimize transportation costs while meeting the promised delivery date.
iii) Packing Optimization
Also known as Box recommendation is a daily occurrence in the shipping of items in retail and eCommerce business. When items of an order or multiple orders for the same customer are ready for packing, Walmart has developed a recommender system that picks the best-sized box which holds all the ordered items with the least in-box space wastage within a fixed amount of time. This Bin Packing problem is a classic NP-Hard problem familiar to data scientists .
Whenever items of an order or multiple orders placed by the same customer are picked from the shelf and are ready for packing, the box recommendation system determines the best-sized box to hold all the ordered items with a minimum of in-box space wasted. This problem is known as the Bin Packing Problem, another classic NP-Hard problem familiar to data scientists.
Here is a link to a sales prediction project to help you understand the applications of Data Science in the real world. Walmart Sales Forecasting Project uses historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and you must build a model to project the sales for each department in each store. This data science project aims to create a predictive model to predict the sales of each product. You can also try your hands-on Inventory Demand Forecasting Data Science Project to develop a machine learning model to forecast inventory demand accurately based on historical sales data.
Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects
Amazon is an American multinational technology-based company based in Seattle, USA. It started as an online bookseller, but today it focuses on eCommerce, cloud computing , digital streaming, and artificial intelligence . It hosts an estimate of 1,000,000,000 gigabytes of data across more than 1,400,000 servers. Through its constant innovation in data science and big data Amazon is always ahead in understanding its customers. Here are a few data science applications at Amazon:
i) Recommendation Systems
Data science models help amazon understand the customers' needs and recommend them to them before the customer searches for a product; this model uses collaborative filtering. Amazon uses 152 million customer purchases data to help users to decide on products to be purchased. The company generates 35% of its annual sales using the Recommendation based systems (RBS) method.
Here is a Recommender System Project to help you build a recommendation system using collaborative filtering.
ii) Retail Price Optimization
Amazon product prices are optimized based on a predictive model that determines the best price so that the users do not refuse to buy it based on price. The model carefully determines the optimal prices considering the customers' likelihood of purchasing the product and thinks the price will affect the customers' future buying patterns. Price for a product is determined according to your activity on the website, competitors' pricing, product availability, item preferences, order history, expected profit margin, and other factors.
Check Out this Retail Price Optimization Project to build a Dynamic Pricing Model.
iii) Fraud Detection
Being a significant eCommerce business, Amazon remains at high risk of retail fraud. As a preemptive measure, the company collects historical and real-time data for every order. It uses Machine learning algorithms to find transactions with a higher probability of being fraudulent. This proactive measure has helped the company restrict clients with an excessive number of returns of products.
You can look at this Credit Card Fraud Detection Project to implement a fraud detection model to classify fraudulent credit card transactions.
New Projects
2023-02-16 15:05:32
2022-12-06 09:59:56
2023-03-02 10:53:37
2022-12-24 12:58:46
2023-02-09 12:01:07
2023-01-27 13:00:51
2023-01-27 12:45:12
2023-02-09 12:00:19
View all New Projects
Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Netflix started as a DVD rental service in 1997 and then has expanded into the streaming business. Headquartered in Los Gatos, California, Netflix is the largest content streaming company in the world. Currently, Netflix has over 208 million paid subscribers worldwide, and with thousands of smart devices which are presently streaming supported, Netflix has around 3 billion hours watched every month. The secret to this massive growth and popularity of Netflix is its advanced use of data analytics and recommendation systems to provide personalized and relevant content recommendations to its users. The data is collected over 100 billion events every day. Here are a few examples of how data science is applied at Netflix :
i) Personalized Recommendation System
Netflix uses over 1300 recommendation clusters based on consumer viewing preferences to provide a personalized experience. Some of the data that Netflix collects from its users include Viewing time, platform searches for keywords, Metadata related to content abandonment, such as content pause time, rewind, rewatched. Using this data, Netflix can predict what a viewer is likely to watch and give a personalized watchlist to a user. Some of the algorithms used by the Netflix recommendation system are Personalized video Ranking, Trending now ranker, and the Continue watching now ranker.
ii) Content Development using Data Analytics
Netflix uses data science to analyze the behavior and patterns of its user to recognize themes and categories that the masses prefer to watch. This data is used to produce shows like The umbrella academy, and Orange Is the New Black, and the Queen's Gambit. These shows seem like a huge risk but are significantly based on data analytics using parameters, which assured Netflix that they would succeed with its audience. Data analytics is helping Netflix come up with content that their viewers want to watch even before they know they want to watch it.
iii) Marketing Analytics for Campaigns
Netflix uses data analytics to find the right time to launch shows and ad campaigns to have maximum impact on the target audience. Marketing analytics helps come up with different trailers and thumbnails for other groups of viewers. For example, the House of Cards Season 5 trailer with a giant American flag was launched during the American presidential elections, as it would resonate well with the audience.
Here is a Customer Segmentation Project using association rule mining to understand the primary grouping of customers based on various parameters.
Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization
In a world where Purchasing music is a thing of the past and streaming music is a current trend, Spotify has emerged as one of the most popular streaming platforms. With 320 million monthly users, around 4 billion playlists, and approximately 2 million podcasts, Spotify leads the pack among well-known streaming platforms like Apple Music, Wynk, Songza, amazon music, etc. The success of Spotify has mainly depended on data analytics. By analyzing massive volumes of listener data, Spotify provides real-time and personalized services to its listeners. Most of Spotify's revenue comes from paid premium subscriptions. Here are some of the Data science models used my Spotify to provide enhanced services to its listeners:
i) Personalization of Content using Recommendation Systems
Spotify uses Bart or Bayesian Additive Regression Trees to generate music recommendations to its listeners in real-time. Bart ignores any song a user listens to for less than 30 seconds. The model is retrained every day to provide updated recommendations. A new Patent granted to Spotify for an AI application is used to identify a user's musical tastes based on audio signals, gender, age, accent to make better music recommendations.
Spotify creates daily playlists for its listeners, based on the taste profiles called 'Daily Mixes,' which have songs the user has added to their playlists or created by the artists that the user has included in their playlists. It also includes new artists and songs that the user might be unfamiliar with but might improve the playlist. Similar to it is the weekly 'Release Radar' playlists that have newly released artists' songs that the listener follows or has liked before.
ii) Targetted marketing through Customer Segmentation
With user data for enhancing personalized song recommendations, Spotify uses this massive dataset for targeted ad campaigns and personalized service recommendations for its users. Spotify uses ML models to analyze the listener's behavior and group them based on music preferences, age, gender, ethnicity, etc. These insights help them create ad campaigns for a specific target audience. One of their well-known ad campaigns was the meme-inspired ads for potential target customers, which was a huge success globally.
iii) CNN's for Classification of Songs and Audio Tracks
Spotify builds audio models to evaluate the songs and tracks, which helps develop better playlists and recommendations for its users. These allow Spotify to filter new tracks based on their lyrics and rhythms and recommend them to users like similar tracks ( collaborative filtering). Spotify also uses NLP ( Natural language processing) to scan articles and blogs to analyze the words used to describe songs and artists. These analytical insights can help group and identify similar artists and songs and leverage them to build playlists.
Here is a Music Recommender System Project for you to start learning. We have listed another music recommendations dataset for you to use for your projects: Dataset1 . You can use this dataset of Spotify metadata to classify songs based on artists, mood, liveliness. Plot histograms, heatmaps to get a better understanding of the dataset. Use classification algorithms like logistic regression, SVM, and Principal component analysis to generate valuable insights from the dataset.
Explore Categories
Airbnb was born in 2007 in San Francisco and has since grown to 4 million Hosts and 5.6 million listings worldwide who have welcomed more than 1 billion guest arrivals in almost every country across the globe. Airbnb is active in every country on the planet except for Iran, Sudan, Syria, and North Korea. That is around 97.95% of the world. Using data as a voice of their customers, Airbnb uses the large volume of customer reviews, host inputs to understand trends across communities, rate user experiences, and uses these analytics to make informed decisions to build a better business model. The data scientists at Airbnb are developing exciting new solutions to boost the business and find the best mapping for its customers and hosts. Airbnb data servers serve approximately 10 million requests a day and process around one million search queries. Data is the voice of customers at AirBnB and offers personalized services by creating a perfect match between the guests and hosts for a supreme customer experience.
i) Recommendation Systems and Search Ranking Algorithms
Airbnb helps people find 'local experiences' in a place with the help of search algorithms that make searches and listings precise. Airbnb uses a 'listing quality score' to find homes based on the proximity to the searched location and uses previous guest reviews. Airbnb uses deep neural networks to build models that take the guest's earlier stays into account and area information to find a perfect match. The search algorithms are optimized based on guest and host preferences, rankings, pricing, and availability to understand users’ needs and provide the best match possible.
ii) Natural Language Processing for Review Analysis
Airbnb characterizes data as the voice of its customers. The customer and host reviews give a direct insight into the experience. The star ratings alone cannot be an excellent way to understand it quantitatively. Hence Airbnb uses natural language processing to understand reviews and the sentiments behind them. The NLP models are developed using Convolutional neural networks .
Practice this Sentiment Analysis Project for analyzing product reviews to understand the basic concepts of natural language processing.
iii) Smart Pricing using Predictive Analytics
The Airbnb hosts community uses the service as a supplementary income. The vacation homes and guest houses rented to customers provide for rising local community earnings as Airbnb guests stay 2.4 times longer and spend approximately 2.3 times the money compared to a hotel guest. The profits are a significant positive impact on the local neighborhood community. Airbnb uses predictive analytics to predict the prices of the listings and help the hosts set a competitive and optimal price. The overall profitability of the Airbnb host depends on factors like the time invested by the host and responsiveness to changing demands for different seasons. The factors that impact the real-time smart pricing are the location of the listing, proximity to transport options, season, and amenities available in the neighborhood of the listing.
Here is a Price Prediction Project to help you understand the concept of predictive analysis.
Uber is the biggest global taxi service provider. As of December 2018, Uber has 91 million monthly active consumers and 3.8 million drivers. Uber completes 14 million trips each day. Uber uses data analytics and big data-driven technologies to optimize their business processes and provide enhanced customer service. The Data Science team at uber has been exploring futuristic technologies to provide better service constantly. Machine learning and data analytics help Uber make data-driven decisions that enable benefits like ride-sharing, dynamic price surges, better customer support, and demand forecasting. Here are some of the Data science-driven products used by uber:
i) Dynamic Pricing for Price Surges and Demand Forecasting
Uber prices change at peak hours based on demand. Uber uses surge pricing to encourage more cab drivers to sign up with the company, to meet the demand from the passengers. When the prices increase, the driver and the passenger are both informed about the surge in price. Uber uses a predictive model for price surging called the 'Geosurge' ( patented). It is based on the demand for the ride and the location.
ii) One-Click Chat
Uber has developed a Machine learning and natural language processing solution called one-click chat or OCC for coordination between drivers and users. This feature anticipates responses for commonly asked questions, making it easy for the drivers to respond to customer messages. Drivers can reply with the clock of just one button. One-Click chat is developed on Uber's machine learning platform Michelangelo to perform NLP on rider chat messages and generate appropriate responses to them.
iii) Customer Retention
Failure to meet the customer demand for cabs could lead to users opting for other services. Uber uses machine learning models to bridge this demand-supply gap. By using prediction models to predict the demand in any location, uber retains its customers. Uber also uses a tier-based reward system, which segments customers into different levels based on usage. The higher level the user achieves, the better are the perks. Uber also provides personalized destination suggestions based on the history of the user and their frequently traveled destinations.
You can take a look at this Python Chatbot Project and build a simple chatbot application to understand better the techniques used for natural language processing. You can also practice the working of a demand forecasting model with this project using time series analysis. You can look at this project which uses time series forecasting and clustering on a dataset containing geospatial data for forecasting customer demand for ola rides.
Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro
7) LinkedIn
LinkedIn is the largest professional social networking site with nearly 800 million members in more than 200 countries worldwide. Almost 40% of the users access LinkedIn daily, clocking around 1 billion interactions per month. The data science team at LinkedIn works with this massive pool of data to generate insights to build strategies, apply algorithms and statistical inferences to optimize engineering solutions, and help the company achieve its goals. Here are some of the products developed by data scientists at LinkedIn:
i) LinkedIn Recruiter Implement Search Algorithms and Recommendation Systems
LinkedIn Recruiter helps recruiters build and manage a talent pool to optimize the chances of hiring candidates successfully. This sophisticated product works on search and recommendation engines. The LinkedIn recruiter handles complex queries and filters on a constantly growing large dataset. The results delivered have to be relevant and specific. The initial search model was based on linear regression but was eventually upgraded to Gradient Boosted decision trees to include non-linear correlations in the dataset. In addition to these models, the LinkedIn recruiter also uses the Generalized Linear Mix model to improve the results of prediction problems to give personalized results.
ii) Recommendation Systems Personalized for News Feed
The LinkedIn news feed is the heart and soul of the professional community. A member's newsfeed is a place to discover conversations among connections, career news, posts, suggestions, photos, and videos. Every time a member visits LinkedIn, machine learning algorithms identify the best exchanges to be displayed on the feed by sorting through posts and ranking the most relevant results on top. The algorithms help LinkedIn understand member preferences and help provide personalized news feeds. The algorithms used include logistic regression, gradient boosted decision trees and neural networks for recommendation systems.
iii) CNN's to Detect Inappropriate Content
To provide a professional space where people can trust and express themselves professionally in a safe community has been a critical goal at LinkedIn. LinkedIn has heavily invested in building solutions to detect fake accounts and abusive behavior on their platform. Any form of spam, harassment, inappropriate content is immediately flagged and taken down. These can range from profanity to advertisements for illegal services. LinkedIn uses a Convolutional neural networks based machine learning model. This classifier trains on a training dataset containing accounts labeled as either "inappropriate" or "appropriate." The inappropriate list consists of accounts having content from "blocklisted" phrases or words and a small portion of manually reviewed accounts reported by the user community.
Here is a Text Classification Project to help you understand NLP basics for text classification. You can find a news recommendation system dataset to help you build a personalized news recommender system. You can also use this dataset to build a classifier using logistic regression, Naive Bayes, or Neural networks to classify toxic comments.
Get confident to build end-to-end projects.
Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.
Pfizer is a multinational pharmaceutical company headquartered in New York, USA. One of the largest pharmaceutical companies globally known for developing a wide range of medicines and vaccines in disciplines like immunology, oncology, cardiology, and neurology. Pfizer became a household name in 2010 when it was the first to have a COVID-19 vaccine with FDA. In early November 2021, The CDC has approved the Pfizer vaccine for kids aged 5 to 11. Pfizer has been using machine learning and artificial intelligence to develop drugs and streamline trials, which played a massive role in developing and deploying the COVID-19 vaccine. Here are a few applications of Data science used by Pfizer :
i) Identifying Patients for Clinical Trials
Artificial intelligence and machine learning are used to streamline and optimize clinical trials to increase their efficiency. Natural language processing and exploratory data analysis of patient records can help identify suitable patients for clinical trials. These can help identify patients with distinct symptoms. These can help examine interactions of potential trial members' specific biomarkers, predict drug interactions and side effects which can help avoid complications. Pfizer's AI implementation helped rapidly identify signals within the noise of millions of data points across their 44,000-candidate COVID-19 clinical trial.
ii) Supply Chain and Manufacturing
Data science and machine learning techniques help pharmaceutical companies better forecast demand for vaccines and drugs and distribute them efficiently. Machine learning models can help identify efficient supply systems by automating and optimizing the production steps. These will help supply drugs customized to small pools of patients in specific gene pools. Pfizer uses Machine learning to predict the maintenance cost of equipment used. Predictive maintenance using AI is the next big step for Pharmaceutical companies to reduce costs.
iii) Drug Development
Computer simulations of proteins, and tests of their interactions, and yield analysis help researchers develop and test drugs more efficiently. In 2016 Watson Health and Pfizer announced a collaboration to utilize IBM Watson for Drug Discovery to help accelerate Pfizer's research in immuno-oncology, an approach to cancer treatment that uses the body's immune system to help fight cancer. Deep learning models have been used recently for bioactivity and synthesis prediction for drugs and vaccines in addition to molecular design. Deep learning has been a revolutionary technique for drug discovery as it factors everything from new applications of medications to possible toxic reactions which can save millions in drug trials.
You can create a Machine learning model to predict molecular activity to help design medicine using this dataset . You may build a CNN or a Deep neural network for this task.
Access Data Science and Machine Learning Project Code Examples
Shell is a global group of energy and petrochemical companies with over 80,000 employees in around 70 countries. Shell uses advanced technologies and innovations to help build a sustainable energy future. Shell is going through a significant transition as the world needs more and cleaner energy solutions to be a clean energy company by 2050. It requires substantial changes in the way in which energy is used. Digital technologies, including AI and Machine Learning, play an essential role in this transformation. These include efficient exploration and energy production, more reliable manufacturing, more nimble trading, and a personalized customer experience. Using AI in various phases of the organization will help achieve this goal and stay competitive in the market. Here are a few applications of AI and data science used in the petrochemical industry:
i) Precision Drilling
Shell is involved in the processing mining oil and gas supply, ranging from mining hydrocarbons to refining the fuel to retailing them to customers. Recently Shell has included reinforcement learning to control the drilling equipment used in mining. Reinforcement learning works on a reward-based system based on the outcome of the AI model. The algorithm is designed to guide the drills as they move through the surface, based on the historical data from drilling records. It includes information such as the size of drill bits, temperatures, pressures, and knowledge of the seismic activity. This model helps the human operator understand the environment better, leading to better and faster results will minor damage to machinery used.
ii) Efficient Charging Terminals
Due to climate changes, governments have encouraged people to switch to electric vehicles to reduce carbon dioxide emissions. However, the lack of public charging terminals has deterred people from switching to electric cars. Shell uses AI to monitor and predict the demand for terminals to provide efficient supply. Multiple vehicles charging from a single terminal may create a considerable grid load, and predictions on demand can help make this process more efficient.
iii) Monitoring Service and Charging Stations
Another Shell initiative trialed in Thailand and Singapore is the use of computer vision cameras, which can think and understand to watch out for potentially hazardous activities like lighting cigarettes in the vicinity of the pumps while refueling. The model is built to process the content of the captured images and label and classify it. The algorithm can then alert the staff and hence reduce the risk of fires. You can further train the model to detect rash driving or thefts in the future.
Here is a project to help you understand multiclass image classification. You can use the Hourly Energy Consumption Dataset to build an energy consumption prediction model. You can use time series with XGBoost to develop your model.
Most Watched Projects
2023-02-25 23:55:52
2023-02-18 17:49:08
2023-03-01 23:08:20
2022-12-01 06:54:00
2023-03-02 11:08:41
View all Most Watched Projects
Zomato was founded in 2010 and is currently one of the most well-known food tech companies. Zomato offers services like restaurant discovery, home delivery, online table reservation, online payments for dining, etc. Zomato partners with restaurants to provide tools to acquire more customers while also providing delivery services and easy procurement of ingredients and kitchen supplies. Currently, Zomato has over 2 lakh restaurant partners and around 1 lakh delivery partners. Zomato has closed over ten crore delivery orders as of date. Zomato uses ML and AI to boost their business growth, with the massive amount of data collected over the years from food orders and user consumption patterns. Here are a few applications developed by the data scientists at Zomato:
i) Personalized Recommendation System for Homepage
Zomato uses data analytics to create personalized homepages for its users. Zomato uses data science to provide order personalization, like giving recommendations to the customers for specific cuisines, locations, prices, brands, etc. Restaurant recommendations are made based on a customer's past purchases, browsing history, and what other similar customers in the vicinity are ordering. This personalized recommendation system has led to a 15% improvement in order conversions and click-through rates for Zomato.
You can use the Restaurant Recommendation Dataset to build a restaurant recommendation system to predict what restaurants customers are most likely to order from, given the customer location, restaurant information, and customer order history.
ii) Analyzing Customer Sentiment
Zomato uses Natural language processing and Machine learning to understand customer sentiments using social media posts and customer reviews. These help the company gauge the inclination of its customer base towards the brand. Deep learning models analyze the sentiments of various brand mentions on social networking sites like Twitter, Instagram, Linked In, and Facebook. These analytics give insights to the company, which helps build the brand and understand the target audience.
iii) Predicting Food Preparation Time (FPT)
Food delivery time is an essential variable in the estimated delivery time of the order placed by the customer using Zomato. The food preparation time depends on numerous factors like the number of dishes ordered, time of the day, footfall in the restaurant, day of the week, etc. Accurate prediction of the food preparation time can help make a better prediction of the Estimated delivery time, which will help delivery partners less likely to breach it. Zomato uses a Bidirectional LSTM-based deep learning model that considers all these features and provides food preparation time for each order in real-time.
Data scientists are companies' secret weapons when analyzing customer sentiments and behavior and leveraging it to drive conversion, loyalty, and profits. These 10 data science case studies projects with examples and solutions show you how various organizations use data science technologies to succeed and be at the top of their field! To summarize, Data Science has not only accelerated the performance of companies but has also made it possible to manage & sustain their performance with ease.

Success Stories

Data Governance for a US-based Automobile Manufacturer
- a.prlst-para')[this.getAttribute('data-index')].href, encodeURIComponent(this.getAttribute('data-title')));" class="share">

AI/ML Engine that predicts spot-market broking related market price for Transportation and Logistics leader

Keeping up with an Innovation Culture amidst an Unpredictable Market

Estimated Savings of $16 Million Annually with Accurate Forecasting for American Technology Leader

Success Story
Infosys Modernizes Toyota’s Vehicle Data Warehouse on AWS

Next Gen Cloud data platform for US based department store

Next generation Data Platform for Americas leading insurance client

Modernized Big Data Platform for Americas leading Banking client

Infosys Makes Global Customers Analytics and AI-Ready on Azure

Unlocking the value of data: GSTN adopts advanced analytics to improve tax compliance

Infosys helps major healthcare player to improve operational efficiency and achieve faster time to market

Testimonial
Telenet’s digital transformation journey with data and insights

Infosys enables USG Boral to make better and faster decisions by improving their operational efficiency!

Infosys partnered with BMG to navigate through the digital disruption journey by monetizing data

When Online Met Offline

Data is talking about the future of retail demand. Are you listening?

Weather forecasts. Sales Forecasts. Let data show you the connection

Caring for Your Mission-Critical Taskforce: Your Machine Maintenance

Data Analysis Case Study: Learn From These Winning Data Projects
Lillian Pierson, P.E.
Got data? Great! Looking for that perfect data analysis case study to help you get started using it? You’re in the right place.
If you’ve ever struggled to decide what to do next with your data projects, to actually find meaning in the data, or even to decide what kind of data to collect, then KEEP READING…
Deep down, you know what needs to happen. You need to initiate and execute a data strategy that really moves the needle for your organization. One that produces seriously awesome business results.
But how you’re in the right place to find out..
As a data strategist who has worked with 10 percent of Fortune 100 companies, today I’m sharing with you a case study that demonstrates just how real businesses are making real wins with data analysis. In the post below, we’ll look at:
- A shining data success story;
- What went on ‘under-the-hood’ to support that successful data project; and
- The exact data technologies used by the vendor, to take this project from pure strategy to pure success
If you prefer to watch this information rather than read it, it’s captured in the video below:
Here’s the url too: https://youtu.be/xMwZObIqvLQ
An Inspirational Data Case Collection
It also includes 21 business use cases offering nitty-gritty specifics on the processes and people that took their programs from data project to data success story.
But more on this resource a little later.
Right now, I want to share with you the winning data case collection – aka ; the tip of the iceberg when it comes to my signature methodology that will take your data program from ineffective to successful – from losing you money to run-away profits.
Yes, it’s proven..
My signature data strategy method ( of which I’m sharing a portion with you today ) has already been a roaring success for my data strategy clients here at Data-Mania. Clients like Vince Lee, from the Central Bank of Malaysia …
– who used it to improve return on CBM’s investments by successfully (and strategically) employing data to predict financial distress in institutions – BEFORE handing out loans to them!
Of course, it’s not just the Central Bank of Malaysia that has used my data strategy method to achieve incredible results.
I’ve shared my signature data strategy method in training courses and workshops across the world. In doing so, I’ve been able to help such varied initiatives as a micro-lending organization in Uganda and the largest oil company in Saudi Arabia.
And I can help you, too.
3 action items you need to take.
To actually use the data analysis case study you’re about to get – you need to take 3 main steps. Those are:
- Reflect upon your organization as it is today (I left you some prompts below – to help you get started)
- Review winning data case collections (starting with the one I’m sharing here) and identify 5 that seem the most promising for your organization given it’s current set-up
- Assess your organization AND those 5 winning case collections. Based on that assessment, select the “QUICK WIN” data use case that offers your organization the most bang for it’s buck
Step 1: Reflect Upon Your Organization
Whenever you evaluate data case collections to decide if they’re a good fit for your organization, the first thing you need to do is organize your thoughts with respect to your organization as it is today.
Before moving into the data analysis case study, STOP and ANSWER THE FOLLOWING QUESTIONS – just to remind yourself:
- What is the business vision for our organization?
- What industries do we primarily support?
- What data technologies do we already have up and running, that we could use to generate even more value?
- What team members do we have on hand to support a new data project? And what are their data skillsets like?
- What type of data are we mostly looking to generate value from? Structured? Semi-Structured? Un-structured? Real-time data? Huge data sets? What are our data resources like?
Jot down some notes while you’re here. Then keep them in mind as you read on to find out how one company, Humana, used its data to achieve a 28 percent increase in customer satisfaction. Also include its 63 percent increase in employee engagement! (That’s such a seriously impressive outcome, right?!)
Or feel free to allow me to serve it to you via video presentation below:

Data Analysis Case Study - How Humana Scored This Major DATA WIN!
Step 2: review winning data case collections (starting with this one…).
Here we are, already at step 2. It’s time for you to start reviewing winning data case collections (starting with the one I’m sharing here). I dentify 5 that seem the most promising for your organization given its current set-up.
Humana’s Automated Data Analysis Case Study
That’s why my series of 21 case collections are highly targeted and grouped by field. This way, you can easily find an approach that offers a useful model for your own business.
Let’s start with one to demonstrate the kind of value you can glean from these kinds of success stories.
Humana has provided health insurance to Americans for over 50 years. It is a service company focused on fulfilling the needs of its customers. A great deal of Humana’s success as a company rides on customer satisfaction, and the frontline of that battle for customers’ hearts and minds is Humana’s customer service center.
Call centers are hard to get right. A lot of emotions can arise during a customer service call, especially one relating to health and health insurance. Sometimes people are frustrated. At times, they’re upset. Also, there are times the customer service representative becomes aggravated and the overall tone and progression of the phone call goes downhill. This is of course very bad for customer satisfaction.
Humana wanted to use artificial intelligence to improve customer satisfaction (and thus, customer retention rates & profits per customer).
Humana wanted to find a way to use artificial intelligence to monitor their phone calls and help their agents do a better job connecting with their customers in order to improve customer satisfaction (and thus, customer retention rates & profits per customer ).
In light of their business need, Humana worked with a company called Cogito, which specializes in voice analytics technology.
Cogito offers a piece of AI technology called Cogito Dialogue. It’s been trained to identify certain conversational cues as a way of helping call center representatives and supervisors stay actively engaged in a call with a customer.
The AI listens to cues like the customer’s voice pitch.
If it’s rising, or if the call representative and the customer talk over each other, then the dialogue tool will send out electronic alerts to the agent during the call.
Humana fed the dialogue tool customer service data from 10,000 calls and allowed it to analyze cues such as keywords, interruptions, and pauses, and these cues were then linked with specific outcomes. For example, if the representative is receiving a particular type of cues, they are likely to get a specific customer satisfaction result.
The Outcome
Customers were happier, and customer service representatives were more engaged..
This automated solution for data analysis has now been deployed in 200 Humana call centers and the company plans to roll it out to 100 percent of its centers in the future.
The initiative was so successful, Humana has been able to focus on next steps in its data program. The company now plans to begin predicting the type of calls that are likely to go unresolved, so they can send those calls over to management before they become frustrating to the customer and customer service representative alike.
What does this mean for you and your business?
Well, if you’re looking for new ways to generate value by improving the quantity and quality of the decision support that you’re providing to your customer service personnel, then this may be a perfect example of how you can do so.
Humana’s Business Use Cases
Humana’s data analysis case study includes two key business use cases:
- Analyzing customer sentiment; and
- Suggesting actions to customer service representatives.
Analyzing Customer Sentiment
First things first, before you go ahead and collect data, you need to ask yourself who and what is involved in making things happen within the business.
In the case of Humana, the actors were:
- The health insurance system itself
- The customer, and
- The customer service representative

As you can see in the use case diagram above, the relational aspect is pretty simple. You have a customer service representative and a customer. They are both producing audio data, and that audio data is being fed into the system.
Humana focused on collecting the key data points, shown in the image below, from their customer service operations.

By collecting data about speech style, pitch, silence, stress in customers’ voices, length of call, speed of customers’ speech, intonation, articulation, silence, and representatives’ manner of speaking, Humana was able to analyze customer sentiment and introduce techniques for improved customer satisfaction.
Having strategically defined these data points, the Cogito technology was able to generate reports about customer sentiment during the calls.
Suggesting actions to customer service representatives.
The second use case for the Humana data program follows on from the data gathered in the first case.
In Humana’s case, Cogito generated a host of call analyses and reports about key call issues.
In the second business use case, Cogito was able to suggest actions to customer service representatives, in real-time , to make use of incoming data and help improve customer satisfaction on the spot.
The technology Humana used provided suggestions via text message to the customer service representative, offering the following types of feedback:
- The tone of voice is too tense
- The speed of speaking is high
- The customer representative and customer are speaking at the same time
These alerts allowed the Humana customer service representatives to alter their approach immediately , improving the quality of the interaction and, subsequently, the customer satisfaction.
The preconditions for success in this use case were:
- The call-related data must be collected and stored
- The AI models must be in place to generate analysis on the data points that are recorded during the calls
Evidence of success can subsequently be found in a system that offers real-time suggestions for courses of action that the customer service representative can take to improve customer satisfaction.
Thanks to this data-intensive business use case, Humana was able to increase customer satisfaction, improve customer retention rates, and drive up profits per customer.
The Technology That Supports This Data Analysis Case Study
I promised to dip into the tech side of things. This is especially for those of you who are interested in the ins and outs of how projects like this one are actually rolled out.
Here’s a little rundown of the main technologies we discovered when we investigated how Cogito runs in support of its clients like Humana.
- For cloud data management Cogito uses AWS, specifically the Athena product
- For on-premise big data management, the company used Apache HDFS – the distributed file system for storing big data
- They utilize MapReduce, for processing their data
- And Cogito also has traditional systems and relational database management systems such as PostgreSQL
- In terms of analytics and data visualization tools, Cogito makes use of Tableau
- And for its machine learning technology, these use cases required people with knowledge in Python, R, and SQL, as well as deep learning (Cogito uses the PyTorch library and the TensorFlow library)
These data science skill sets support the effective computing, deep learning , and natural language processing applications employed by Humana for this use case.
If you’re looking to hire people to help with your own data initiative, then people with those skills listed above, and with experience in these specific technologies, would be a huge help.

Step 3: S elect The “Quick Win” Data Use Case
Still there? Great!
It’s time to close the loop.
Remember those notes you took before you reviewed the study? I want you to STOP here and assess. Does this Humana case study seem applicable and promising as a solution, given your organization’s current set-up…
YES ▶ Excellent!
Earmark it and continue exploring other winning data use cases until you’ve identified 5 that seem like great fits for your businesses needs. Evaluate those against your organization’s needs, and select the very best fit to be your “quick win” data use case. Develop your data strategy around that.
NO , Lillian – It’s not applicable. ▶ No problem.
Discard the information and continue exploring the winning data use cases we’ve categorized for you according to business function and industry. Save time by dialing down into the business function you know your business really needs help with now. Identify 5 winning data use cases that seem like great fits for your businesses needs. Evaluate those against your organization’s needs, and select the very best fit to be your “quick win” data use case. Develop your data strategy around that data use case.
What’s Next?
This post is merely a taste of the inspiration you can enjoy while evaluating which successful data use case you should use to orient your upcoming data strategy plans.
So many of my clients and students were inspired by, and learned from, the success of other real-life companies and their data projects, that I finally brought them all together into one (online) bundle o’ goodness !
Maybe your business is far removed from the call center example, and you’d like to read more about data projects in a different field.
My 21 case collections in the Winning with Data product suite are geared toward finding the best route for you and your company – whether that be:
Operational improvements ..
If you need to increase the efficiencies within the operations of your business (in order to spend less – and increase profits), you’re going to want to check out these case collections first.
MARKETING IMPROVEMENTS .
If you know your organization could enjoy some major windfalls by using data to improve profits generated from marketing efforts, these cases are for you.
DECISION-SUPPORT .
This is where you need to look first if the business leaders at your organization are not getting regular and reliable reporting on the key metrics they need to see in order to make the best decisions possible on behalf of the business.
Check out this section if you want to use data and data innovation to strategically improve returns within your finance department.
DATA MONETIZATION.
This is for you if you’d like to learn new ways to generate new revenue streams from data that your organization already owns.
Each of the 21 cases in this collection is supported by a business use case that provides detailed specifics about the processes, people, and technologies that made each project a success.
Winning With Data Case Collections
So, if you’re ready to move your company into a new wave of success and enjoy all the accolades that come with a data initiative done well, then sign-up for this value-packed digital resource and make a start!
In just a few strategic steps (and implementation to back it up), your organization could be raking in the kinds of profits of which it’d previously only dreamed.


Join the 25,000 other tech leaders & founders who've discovered powerful startup, growth, product & marketing tips that we only share inside our community newsletter...
Zero spam, guaranteed. unsubscribe anytime., data-mania newsletter.
A newsletter exclusively created for technology leaders & entrepreneurs…
Hi, I’m Lillian Pierson, Data-Mania’s founder. We welcome you to our little corner of the internet. Our mission is to help technology leaders & entrepreneurs make more money with less effort and hassle.

Join the newsletter to get our Data & Technology Entrepreneur’s Toolkit
Newsletter subscribers get all kinds of exclusive, special free goodies that we don’t give out to anyone else. Join today to get our Data & Technology Entrepreneur’s Toolkit – a collection of 32 tools & processes that’ll actually grow your business!

Simplest Data Business Models for New Data Freelancers and Entrepreneurs WITHOUT INVESTORS

How to Become a Freelance Data Scientist

What does a data product manager do? 3 types of work I do

Machine Learning Security: Protecting Networks and Applications in Your ML Environment

Leveraging Content Marketing for Startup Growth: What Every New Founder Needs to Know (Incl. Tech Startup Marketing Budget Details)

Hiring for Your Tech Startup: Freelancers vs. Employees
Leave a reply cancel reply.
This site uses Akismet to reduce spam. Learn how your comment data is processed .
GET CONNECTED
Strategies. startups. growth., proven, data-driven strategies that help tech startup founders & leaders make more money with less effort & hassle., © data-mania, 2012 - 2022+, all rights reserved - terms & conditions - privacy policy | designed by kelly creative co. | products protected by copyscape, privacy overview.
Tech Stars Rejoice!
Sign up for emails & get an exclusive, 20% off & lifetime access, with your next order, (cannot be combined with other discounts - 1 per person), want more opportunity from your data career, take my fun, 45-second quiz and i will give you personalized data career path recommendations showing you exactly what you need to focus on to get more opportunity, impact, and earnings ..

Case Studies
Data analysis is a process
PART 1: DATA EXPLORATION PART 2: REGRESSION ANALYSIS PART 3: PREDICTION PART 4: CAUSAL ANALYSIS
PART I: DATA EXPLORATION
Ch01a finding a good deal among hotels: data collection.
Vienna, Austria is a popular tourist destination for business and leisure. From the hundreds of places that offer accommodation, we want to pick a hotel that is underpriced relative to its location and quality for a weekday in November 2017. Can we use data to help this decision? What kind of data would we need, and how could we get it?
This case study illustrates how to collect appropriate data from the web on multiple offers. It describes what we want from such data and what data source we would need. The data is collected by web scraping , and it results in a single data table. The case study discusses the data quality from the perspective of the question to answer and how data quality is determined by the way the data was born. There is no dataset to analyze in this case study in this chapter. Subsequent case studies (2A, 3A, 7A, 8A, 9B, 10B) will use the data desctibed here to illustrate steps of data analysis that lead to ultimately answering the main question.

CH01B Comparing online and offline prices: data collection
Do online and offline prices of the same products tend to be the same? To answer that question, we need data on both the online and offline (in store) price of many products. Such data was collected as part of the Billion Prices Project (BPP; http://www.thebillionpricesproject.com), an umbrella of multiple projects that collect price data for various purposes using various methods.
This case study illustrates how to combine different data collection methods and what the challenges are with such data collection. It discusses how products were selected and how prices were measured, and what those methods imply for coverage of observations and reliability of variables. There is no dataset to analyze in this case study. Case study 6A will use the data described here to investigate whether online and offline prices tend to be the same.
CH01C Management quality: data collection
How different are firms and other organizations in the terms of their management practices? Is the quality of management related to how large the firms are? Is it affected by whether the owners are the company founders or their families? To answer these, and many related, questions, we need data on management quality. Such data was collected by the World Management Survey (WMS; https://worldmanagementsurvey.org/), an international research intitative to measure the differences in management practices across organizations and countries.
This case study illustrates how to collect data by surveys . It discusses sampling and its practical issues, and how to use a set of survey questions to measure and abstract concept such as the quality of management. This case study, similarly to the other case studies in this chapter, illustrates the choices and trade-offs data collection involves, practical issues that may arise during implementation, and how all that may affect data quality. There is no dataset to analyze in this case study. Case studies 4A and 21A will use the data described here to investigate how management quality is related to firm size and how it is affected by ownership.
CH02A Finding a good deal among hotels: data preparation
Continuing with our search for a hotel that is underpriced relative to its location and quality in Vienna, we have scraped data from the web, and we’ve got a data table. But how should we start working with this data? In particular, how should we identify hotels, how should we make sure each hotel features only once in the data, and how should we select the variables we would consider for our future analysis?
This case study uses the hotels-vienna dataset to illustrate how to find problems with observations and variables. It illustrates the various types of variables . It shows how to create a tidy data table and how to deal with missing values and duplicates . It allows instructors to demonstrate the importance of data cleaning and the common steps of data wrangling . We described data collection and quality in case study 1A, and we will use the data in case studies 3A, 7A, 8A, 9B, and 10B to illustrate steps of data analysis that lead to finding good deals.
Code : Stata or R or Python or ALL . Data : hotels-vienna . Graphs : .png or .eps
CH02B Displaying immunization rates across countries
Immunization against measles is an effective way to prevent the disease and may save the lives of children. But how do various countries fare in terms of their immunization rates? In particular, how should we structure and use data from many countries and many years to analyze immunization rates across countries and years?
This short case study illustrates how to store multi-dimensional data . It uses the world-bank-immunization dataset with data from the World Development Indicators data website maintained by the World Bank to look at countries’ annual immunization rate and GDP per capita. The case study illustrates the structure of xt panel data data with a cross-sectional and time series dimension (country and year), with two corresponding ID variables and two other variables (immunization rate and GDP per capita). It allows instructors to demonstrate xt panel data tables in long format and wide format . Case study 23B will use the data described here to investigate the effect of immunization on the survival chances of children.
Code : Stata or R or Python or ALL . Data : world-bank-immunization . Graphs : .png or .eps
Ch02C Identifying successful football managers
The English Premier League (EPL) is the top football (soccer) division in England. Team managers, as coaches are known in football, arguably play a very important role in the success of their teams. How can we use two separate data tables on games and managers to identify the most successful football manager in the EPL?
This case study uses the football dataset that covers all games played in the EPL and data on managers, including which team they worked at and when. We create a data table by joining two different data tables, define the measure of success as average points per game, and identify the most successful managers. This case study illustrates how to prepare data for analysis and illustrates linking data tables with different kinds of observations and common problems that can arise while doing so. It is a good example of entity resolution , and how to work with relational data . Case study 24B will use this data to uncover the effect of replacing managers of underperfoming teams on subsequent team performance.
Code : Stata or R or Python or ALL Data : football ). Graphs : .png or .eps

Ch03A Finding a good deal among hotels: data exploration
Further continuing our search for a good deal (a hotel in Vienna that is underpriced for its location and quality), we’ve got a clean data table and identified the variables we want to analyze. How should we start the analysis? In particular, how should we explore the most important variables, why should we do that, and what conclusions can we draw from such exploratory analysis?
This case study uses the hotels-vienna dataset to illustrate how to describe the distribution of variables and how to use the findings to identify potential problems in the data, such as extreme values . The case study also illustrate how to make decisions about extreme values , guided by the ultimate question of the analysis. Along the way, it introduces guidelines for data visualization in general, and the design of histograms in particular. Case studies 1A and 2A describe data collection and cleaning, and we will use the data in case studies 7A, 8A, 9B, and 10B to illustrate further steps of data analysis that lead to finding good deals.
Ch03B Comparing hotel prices in Europe: Vienna vs London
How can we compare hotel markets over Europe and learn about characteristics of hotel prices? Can we visualize two distributions on one graph? What descriptive statistics would best describe each distribution and their differences? Can we visualize descriptive statistics?
This case study uses the hotels-europe dataset and selects 3-4 star hotels in Vienna and London to compare the distribution of prices for a weekday in November 2017. It illustrates the comparison of distributions and the use of histograms and density plots . It illustrates the use of some of the most important descriptive statistics for quantitative variables and their visualizations, box plots and violin plots .
Code : Stata or R or Python or ALL . Data : hotels-europe . Graphs : .png or .eps
Ch03C Measuring home team advantage in football
Is there such a thing as home team advantage in professional football (soccer)? That is, do teams that play in their home stadium tend to perform better? And how should we measure better performance?
This case study uses the football dataset, with data on the games played in the English Premier League (EPL) during the 2016/17 season. The case study shows the use of exploratory data analysis to answer a substantive question and introduces guidelines to present statistics in a good table.
Code : Stata or R or Python or ALL . Data : football ). Graphs : .png or .eps
Ch03D Distributions of body height and income
Are the distributions of body heigh and family income well approximated by theoretical distributions? Answering these questions can help characterize their distributions and provide guidance for future analysis on how to use these variables.
In this very short case study, we examine survey data collected by the Health and Retirement Study in the U.S.A. in 2014 ( height-income-distributions dataset). We show that the height of women aged 55-60 can be described by the normal distribution , whereas the income of their households is reasonably well characterized by the lognormal distribution .

Code : Stata or R or Python or ALL . Data : height-income-distributions . Graphs : .png or .eps
Ch03U1 Size distribution of Japanese cities
What is the size distribution of Japanese cities? Looking at cities with at least 150,000 inhabitants, it follows a power law.
Code : Stata or R or Python or ALL . Data : height-income-distributions .
Ch04A Management quality and firm size: describing patterns of association
Are larger companies better managed? We want to explore the association between management quality and firm size in a particular country (Mexico). To answer this question we need to define the y and x variables in this comparison. In particular, we need to assess how the variables in the dataset correspond to the abstract concepts of management quality and firm size.
This case study uses the Mexican subsample of the World Management Survey dataset ( wms-management-survey ) from 2013. It illustrates how we can measure latent variables by proxy variables in the data and uncover patterns of association betewen those variables. It also illustrates the concepts of conditional probability , conditional distribution , and joint distribution . The case study introduces informative ways to visualize various aspects of patterns of association, such as the stacked bar chart , the scatterplot , the bin scatter , and comparing box plots and violin plots . We have introduced the data used here in case study 1C.

Code : Stata or R or Python or ALL . Data : wms-management-survey . Graphs : .png or .eps
CH05A What likelihood of loss to expect on a stock portfolio?
Can we find out the future likelihood of a large loss on a stock portfolio based on data from the past? We choose the S&P 500 stock market index as our investment portfolio, and we defining a large loss as an at least 5% drop in returns from one day to another. We can easily calculate the proportion of such days in the data, but we are interested in future losses not past ones. To answer our question we need to make generalizations from our data. Such generalizations are bound to bring uncertainty, and we would like to quantify that uncertainty, too.
This case study uses the sp500 dataset that covers day-to-day returns on the S&P 500 stock market index for 11 years to illustrate how we can generalize an estimated statistic from a particular dataset to the population , or general pattern , it represents, and beyond, to the general pattern we are interested in. The case study illustrates the concept of repeated samples . It shows how to estimate the standard error by bootstrap or using a formula, and how to construct and interpret a confidence interval . It also illustrates how to think about external validity . Case study 6B will use the same data to answer a related, but slightly different question.

Code : Stata or R or Python or ALL . Data : sp500 . Graphs : .png or .eps
CH06A Comparing online and offline prices: testing the difference
Do online and offline prices of the same products tend to be the same? Answering this question can help make better purchase choices, understand the business practices of retailers, and it can inform whether we can use online data in approximating offline prices for policy analysis.
This case study uses the billion-prices dataset. We examine online and offline prices of retail products in the U.S. in 2015-16. The case study illustrates how to translate a more abstract question into an inquiry about a statistic (here the average difference). It shows how to formulate a null hypothesis and an alternative hypothesis and how to carry out a hypothesis test in two ways, by calculating the t-statistic and comparing it to an appropriate critival value , or, alternatively, by using the p-value . The case study also illustrates the perils of testing multiple hypotheses and p-hacking . We have introduced the data used here in case study 1B.
Code : Stata or R or Python or ALL . Data : billion-prices . Graphs : .png or .eps {:target=”_blank”}
CH06B Testing the likelihood of loss on a stock portfolio
Will our investment portfolio suffer a large loss with a higher chance than what we can accept? When we want to know what’s the likelihood of large future losses on our portfolio, we can use the confidence interval to quantify the uncertainty from estimating it from data on past returns. But we can ask a more pointed question, too: whether our stock portfolio is will suffer large future losses more often than we can accept. To answer that question we need a different procedure: testing a hypothesis.
This case study uses the sp500 dataset that covers day-to-day returns for 11 years to illustrate how we can test whether a likelihood is greater or less than a specified value. It illustrates testing proportions and how to formulate and carry out a one-sided hypothesis test . The case study is a continuation of case study 5A, using the same data.
PART II: REGRESSION ANALYSIS
Ch07a finding a good deal among hotels with simple regression.
How can we find the hotels that are underpriced relative to their distance from the city center? Continuing the previous case studies that resulted in a clean data table ready for analysis and explored the main variables, we need to uncover how hotel price is related to distance to the city center to know what price to expect at what distances. Then can we identify hotels that are the most underpriced compared to their expected price.
This case study uses the hotels-vienna dataset to illustrate regression analysis with one right-hand-side variable. It shows the use of bin scatters and lowess non-parametric regressions that reveal qualitative patterns of association. In order to find out the quantitative relationship between distance and average price, we apply simple linear regression . The case study illustrates the use of predicted values and regression residuals
CH08A Finding a good deal among hotels with non-linear function
Continuing our search for the best hotel deals in Vienna, we would like to uncover the shape of the price-distance association to get at the best estimates of expected prices at various distances. But what’s the best way to compare prices? Should we compare their absoulte values, or should we aim for a relative comparison, such as percent differences? And how can we do the latter in a regression using cross-sectional data?
This short case study again uses the hotels-vienna dataset, to illustrate linear regression analysis with the use of logarithms . It shows whether and why it may make sense to take logs of the variables in the regression, and how to estimate, and interpret the results of, and choose from level-log regressions, log-level regressions, and log-log regressions.
CH08B How is life expectancy related to the average income of a country?
People tend to live longer in richer countries. How long people live is usually measured by life expectancy; how rich a country is usually captured by its yearly income, measured by GDP. But should we use total GDP or GDP per capita? And what’s the shape of the patterns of association? Is the same percent difference in income related to the same difference in how long people live among richer countries and poorer countries? Finding the shape of the association helps benchmarking life expectancy among countries with similar levels of income and identify countries where people tend to live especially long or especially short lives for their income.
This case study uses the worldbank-lifeexpectancy dataset based on the World Development Index database available at the World Bank webside. It examines cross-sectional data from a single year, 2017, for 182 countries. The case study illustrates the choice between total and per capita measures (here GDP), regressions with variables in logs , and two ways to model nonlinear patterns in the framework of the linear regression: piecewise linear splines , and polynomials . It also illustrates whether and how to use weights in regression analysis, and what that choice implies for the correct interpretation of the results. The case study also shows how to use informative visualization to present the results of regressions.
Code : Stata or R or Python or ALL . Data : worldbank-lifeexpectancy . Graphs : .png or .eps
CH08C Measurement error in hotel ratings
When we search for a good deal among hotels, we care about hotel quality as well as distance to the city center. Online price comparison websites collect customer ratings and publish the average of those ratings, which can serve as a measure of quality. But some averages are based on very few ratings while others are based on hundreds or thousands of ratings. Should we be concerned about ratings coming from very few customers? In particular, what are the consequences of that feature of the data on the results of regression analysis?
This short case study again uses the hotels-vienna dataset, to illustrate the consequences of measurement error for regression analysis. In particular, it shows the effect of classical measurement error in the right-hand-side variable on the estimated slope of a simple linear regression.
CH09A Estimating gender and age differences in earnings
Do women working in the same occupation tend to earn the same as men? And what are the differences in earnings by age? Understanding these differences may help students know what to expect when choosing a particular career.
This case study uses the cps-morg dataset, a cross-section based on the Current Population Survey (CPS) of the U.S. in 2014. It focuses on a single occupation potentially relevant for many students of data analysis, “Market research analysts and marketing specialists”. The case study illustrates how to estimate the standard error of regression coefficients and how to construct and interpret confidence intervals . It also shows how to test hypotheses about regression coefficients and the standard way of presenting regression results in tables. We will ues a larger subsample of the same data in case study 10A to uderstand the sources of gender difference in earnings.
Code : Stata or R or Python or ALL . Data : cps-morg . Graphs : .png or .eps
CH09B How stable is the hotel price–distance to center relationship?
We have uncovered the average price - distance association among hotels in a particular city on a particular date. How generalizable is this pattern to other dates, to other cities, and to other types of accommodations?
This case study uses the hotels-europe data from Vienna, Amsterdam and Barcelona. It illustrates the various kinds of issues with external validity , first focusing on time (different dates), then space (different cities), and groups of observations (different kinds of accommodations).
CH10A Understanding the gender difference in earnings
Women earn less, on average, than man with similar qualifications. How large is that difference among employees with a graduate degree? How does that difference vary with age? And how much do characteristics of the employers and family circumstances of the employees explain of the difference? Understanding the magnitude, patterns, and causes of gender differences in earnings is important from the viewpoint of social equity as well as efficient allocation of labor.
This short case study uses the cps-morg dataset to illustrate the use of multiple regression analysis to help understand the sources of differences between groups of observations. The data is a cross-section based on the Current Population Survey (CPS) of the U.S. in 2014, and the sample is restricted to employees with a graduate degree. The case study illustrates how to estimate and intepret the results of a multiple regression . It shows how to include qualitative right-hand-side variables and interactions in the regression, how to interpret their results, and how to use visualization to present estimtes of nonlinear patterns. The case study illustrates the difficulty of uncovering causal relationships from the results of multiple regression analysis using cross-sectional observational data.
CH10B Finding a good deal among hotels with multiple regression
We return to estimating a good deal among hotels for the last time. We want to find the hotels that are underpriced for their quality and distance to the city center. To do so we first need to uncover expected prices at various levels of distance and quality in a way that reflects all important patterns in the data. Then can we look for hotels that are the most underpriced relative to their expected price.
This case study uses hotels-vienna dataset to illustrate the use of multiple regression analyis for prediction within a sample and residual analysis . It uses the susample of 3-4 star hotels for a single night in Vienna in November 2017. It illustrates the use of a nonlinear specification within a multiple regression and how to identify observations with the largest negative residuals . It also illustrates the use of the y -hat - y plot to visualize the prediction within the sample and the residuals from the predicted values.
CH11A Does smoking pose a health risk?
Are smokers less likely to remain healthy than non-smokers? How about former smokers who quit?
This case study uses the share-health data from the SHARE survey (Survey for Health, Aging and Retirement in Europe). We focus on people who were 50 to 60 years old and said to be in good health in 2011. We look at how they rated their health in 2015 and see who remained healthy ahd who changed their answer to not healthy. This case study illustrates probability models. It shows how to estimate and interpret the results of a linear probability model and the uses of logit and probit models. It compares the linear probability estimates to the estimated marginal differences from logit and probit. Finally, it illustrates when and how the different models may result in different predicted probabilities and how to compare their fit using Brier-score and other measures of fit.
Code : Stata or R or Python or ALL . Data : share-health . Graphs : .png or .eps
CH11B Are Australian weather forecasts well-calibrated?
Should we take an umbrella when weather forecast predicts rain? In particular, how should we trust the weather forecast when it predicts a certain the likelihood of rain? For example, is it true that it rains on 20 percent of the days when it says the likelihood is 20 percent?
This short case study uses the australia-weather-forecast data covering 350 days in 2015/16 and looks at rain forecast and actual rain for the Northern Australian city of Darwin. The case study illustrates how to construct and interpret a calibration curve .
Code : Stata or R or Python or ALL . Data : australia-weather-forecast . Graphs : .png or .eps
CH12A Returns on a company stock and market returns
How do monthly returns on a company stock move together with monthly market returns? The strength of this association is a good measure of how risky the company stock is.
This case study uses the stocks-sp500 dataset covering 21 years of daily data of many company stocks, focusing on the Microsoft stock and the S&P 500 stock market index. We construct monthly time series of percent returns as the percent change in closing price on the last day of each month. The case study illustrates the use of a simple time series regression in changes, focusing on the interpretation and visualization of the results.
Code : Stata or R or Python or ALL . Data : stocks-sp500 . Graphs : .png or .eps
CH12B Electricity consumption and temperature
How does temperature affect residential electricity consumption? Answering this question can help planning for electricity production and assess the potential effects of climate on electricity use.
This case study uses the arizona-electricity dataset that that covers 17 years of monthly electricity consumption data from the state of Arizona in the USA and monthly temperature data from a weather station in its largest city, Phoenix. Using transformed variables of average “cooling degrees” and average “heating degrees” per month, we estimate time series regressions in changes and with and without season dummies. This case study illustrates how to estimate and intepret the results of times series regressions specified in changes . It shows how to handle and interpret seasonality and lagged associations , and how to use Newey-West standard errors or include lagged dependent variables to estimate standard errors that are tobust to serial correlation in time series regressions.
Code : Stata or R or Python or ALL . Data : arizona-electricity . Graphs : .png or .eps
# PART III: PREDICTION
CH13A Predicting used car value with linear regressions
For how much can we expect to sell our used car? And what could price we expect if we waited a year or more? With appropriate data on similar used cars we can estimate various regression models to predict expected price as a function of its features. But how should we select the best regression model for prediction?
This case study uses the used-cars dataset with data from classified ads of used cars from various cities of the U.S.A. in 2018. We select a single model and a single city. The variables include the ask price and various features (age, odometer, cylinders, condition, etc.). We specify several linear regression models to predict the expected price as a function of car features. This case study illustrates the basic logic of carrying out predictive data analysis and model selection , emphasizing the need to achieve a good fit in the live data by selecting a model using the original data and avoiding both underfitting and overfitting the data. It illustrates the use of a loss function such as mean squared error (MSE) as a measure of fit, and it discusses alternative model selection strategies such as the BIC , the training-test split , and its improved version, k-fold cross-validation .
Code : Stata or R or Python or ALL . Data : used-cars . Graphs : .png or .eps
CH14A Predicting used car value: log prices
Continuing with our example of predicting used car prices, how should we decide on whether to transform our target variable? In particular, we can speficy regression models with log price instead of price as the target variable. How to make predictions about price when the target variable is in logs, and how to choose between models with log price versus price as the target variable?
This short case study uses the same used-cars dataset as case study 13A with used car data from several cities in the USA in 2018. The case study illustrates prediction with a target variable in logs . In particular, it shows how to apply log correction to predict a y variable when the model is specified in ln(y) and how to construct appropriate prediction intervals . The case study is a continuation of case study 13A, using the same data, and case study 15A uses the same data, too, to illustrate an alternative predictive model.
CH14B Predicting AirBnB apartment prices: selecting a regression model
London, UK is a popular tourist destination for business and leisure. We want to predict the rental price of an apartment offered by AirBnB in Hackney, a London borough. The results of this prediction can help tourists choose an offer that is underpriced for its features or apartment owners to deciding on what price they could expect if they rented out their apartment on AirBnB.
This case study uses the airbnb dataset that includes rental prices for one night in March 2017 in greater London, and selects a specific borough. After sample design, we specify linear regressions of varing complexity and a model with LASSO. The case study illustrates the various methods of building regression models , including LASSO , and the use of a holdout sample for evaluating the prediction using the best model.
Code : Stata-prep , Stata-study or R-prep , R-study or Python or ALL . Data : airbnb . Graphs : .png or .eps
CH15A Predicting used car value with regression trees
Further continuing with our example of predicting used car prices, is there a better method for prediction than regression? Ideally, such a method would be better than linear regression at capturing the most important nonlinear patterns and interactions between feature variables and arrive at better predictions. The regression tree promises to be such an alternative, but how does it compare to linear regression in an actual prediction?
This case study uses the used-cars dataset from 2018 and its combined Chcicago and Los Angeles subsamples on a specific model, to illustrate regression trees. We grow several regression trees and compare their predictive performance with the performance of linear regressions. This case study illustrates how we can grow a regression tree with the help of the CART algorithm , why we can think of a regression tree as a nonparametric regression , and how such a regression tree could overfit the original data even with stopping rules or pruning . The case study is a continuation of case studies 13A and 14a, using the same data source but a larger subsample of the observations.
CH16A Predicting apartment prices with random forest
Continuing with our question of how to predict AirBnB apartment prices in London, UK, we want to build the best model for prediction. In particular, we want to see how two different methods that combine many regression trees compare to each other, to the single regression tree, and to linear regressions.
We use the airbnb dataset that includes rental prices for one night in March 2017 from the area of Greater London. Using apartment location and various features of accommodation as predictors, we carry out feature engineering and build random forest models and gradient boosting machine method (GBM) models, both ((ensemble methods** that use many regression trees . This case study illustrates prediction with random forest and boosting and the evaluation of such predictions. It shows how to carry out necessary feature engineering , how to set various tuning parameters for the different methods and how those affect the predictions. It also illustrates the use of variance importance plots and partial dependence plots to help understand the patterns of association that drive the predicitons in these black box models . The case study is a continuation of case study 14B, using the same data source but the entire London sample instead of a single borough.
Code : Stata or R-prep , R-study or Python or ALL . Data : airbnb . Graphs : .png or .eps
CH17A Predicting firm exit: probability and classification
Many companies have relationships with other companies, as suppliers or clients. Whether those other companies stay in business in the future or exit is an important question for them. How can we use data on many companies across the years to predict the probability of their exit? And can we classify them into two groups, companies that are likely to exit and companies that are likely to stay in business?
This case study uses the bisnode-firms dataset, a panel dataset with a large number of companies from specific industries in a European country, to illustrate probability prediction and classification. After a good deal of feature engineering we estimate several logit models to predict the probablity of firm exit and compare their performance by 5-fold cross-validation, choose the best model to describe how well it predicts the probabilities on a holdout sample, and use the predicted probabilities and two alternative methods for classification. This case study illustrates how to carry out probability predictions , how to evaluate their goodness of fit and other aspects of predictive performance, how to find an optimal classification threshold with the help of a loss function usign a formula or model-dependent cross-validation, and how to use expected loss and the confusion table to evaluate classifications. It illustrates how the ROC curve visualizes the trade-offs of false positive and negative decisions at various classification thresholds, and how to use random forest for probaility prediction and classification . The case study is also a good example of potential issues with external validity of predictions and how we may detect the possibility of such issues in the original data.
Code : Stata or R-prep , R-study or Python or ALL . Data : bisnode-firms . Graphs : .png or .eps
CH18A Forecasting daily ticket sales for a swimming pool
How can we use transaction data to predict the daily volume of sales? In particular, how can we use data on sales terminal data on tickets sold to a swimming pool to predict the number of tickets sold on each day next year?
This case study uses the swim-transactions dataset with transaction-level data from all swimimng pools for many years in Albuquerque, New Mexico, USA, and selects a single swimming pool. The case study illustrates long-term forecasts. We aggregate the data to daily frequency, discuss data issues and how to solve them, specify several regression models, and select the best by cross-validation. The case study illustrates the use of transaction data in predictive analytics, cross-validation with time series data , the use of trend and, especially, seasonality in making long-term predictions and the use of the autmated Prophet algorithm. It is an example of how evaluating predictions can detect problems that further data work and analysis may solve.
Code : Stata or R or Python or ALL . Data : swim-transactions . Graphs : .png or .eps
CH18B Forecasting a house price index
How can we use data on past home prices, and possibly other variables, to predict how home prices will change in a particular city in the next months?
This case study uses the case-shiller-la dataset with monthly observations on the Case-Shiller home price index for the city of Los Angeles, California, USA between 2000 and 2017. The dataset also contains monthly time series of the unemployment rate and employment rate. After exploratory data analysis we estimate various ARIMA time series models that use the price index, as well as VAR models that use the unemployment and employment rates as well, and we use appropriate cross-validation to select the best model. The case study illustrates how to make use of serial correlation to make short-term forecasts with the help of ARIMA models , how to use other variables and their forecasted values in a vector autoregression (VAR) model, and how to select the best model by cross-validation with time series data that preserves the serial correlation in the data.
Code : Stata or R or Python or ALL . Data : case-shiller-la . Graphs : .png or .eps
PART IV: CAUSAL ANALYSIS
Ch19a food and health.
Does eating a lot of fruit and vegetables helps remain healthy? Can we use available data on people’s eating habits and health to uncover those effects? What are the most important problems with using such data to answer our question, and can we do anything about them?
This case study uses the food-health dataset, cross-sectional data collected on the health and eating habits of people as part of the National Health and Nutrition Examination Survey (NHANES, USA); we use data from years 2009-2013. We focus on the subsample of people aged 30-59 years old. The case study illustrates how to define an effect using the potential outcomes framework , how to use causal maps to visualize our assumptions about the causal relationships between variables, how to translate latent variables into their measured proxy variables that can be used in actual analysis, how to think about the sources of variation in the causal variable, and what variables we should condition on in an analysis that attempts to uncover the effect. The case study also illustrates the difficulty of uncovering effects from cross-sectional observational data.
Code : Stata-prep , Stata-study or R-prep , R-study or Python-prep , Python-study or ALL . Data : food-health . Graphs : .png or .eps
CH20A Working from home and employee performance
What is the effect of working from home on employee performance? How can we design an experiment that could measure this effect? Once the data is collected from the experiment, how should we assess its quality, estimate the effect, and evaluate the internal and external validity of the results?
This case study uses the working-from-home data, from an experiment that was carried out at a large travel agency in China. The case study illustrates how to design a field experiment , what are potential issues with internal validity and how to address them in the design or the analysis of the experiment, and how to analyze experimental data. It shows how to check covariate balance and how to interpret its results, how to assess compliance , and how to use regression analysis to estimate the effects of the experiment. The case study also illustrates how the results of the experiment can be used in business decisions , and what issues may arise with the external validity of the results.
Code : Stata or R or Python or ALL . Data : working-from-home . Graphs : .png or .eps
CH20B Fine tuning social media advertising
There are many choices to make when designing an online advertisement, inlcuding text content and details of appearance. Having alternative versions of these details, how can we select the version that would yield the most return?
This case study describes an A/B testing that we carried out on a social media platform. We tested two versions of a text advertising a data analysis program and measured the number of clicks on the ad and the number of actions (leaving one’s email address). The case study illustrates the steps of designing an A/B test in general, and power calculation or sample size calculation in particular. There is no dataset for this case study.
Code : Stata or R or Python or ALL . Data : ab-test-social-media .
CH21A Founder/family ownership and quality of management
Many firms are owned by their founder or family members of their founder. Are such founder/family owned firms as well managed as other kinds of firms and, if there is a difference, how much of that that is due to their ownership as opposed to something else? Can we uncover that effect using cross-sectional observational data on firms and their management practices?
This case study uses the wms-survey-management dataset that we introduced in case study 1C. It is a large multi-country multi-sector survey of companies, measuring their management practices and other company characteristics. We use the cross-sectional sample collected from 24 countries between 2004 and 2015. The case study illustrates the use of thought experiments to clarify what effect we want to measure, how to think about what variables to condition on , and how we may sign the omitted variables bias . Besides multiple regression , it illustrates exact matching and matching on the propensity score , discussing their feasibility, advantages and disadvantages, and comparing their results. The case study is another example illustrating the difficulty to uncover an effect using cross-sectional observational data.
Code : Stata or R-prep , R-study or Python-prep , Python-study or ALL . Data : wms-survey-management . Graphs : .png or .eps
CH22A How does a merger between airlines affect prices?
When two companies merge, the new firm has more market power, and it may use that power to increase price or decrease quality. How can we measure the effect of a merger between two firms on the price they charge? How can we use panel data from many markets to uncover this effect?
This case study uses the US-airlines dataset that is based on 10 percent of all tickets sold on the U.S. market, collected and maintained by the U.S. Department of Transportation. We use this data to evaluate the efect of the merger of American Airlines and US Airways. We define markets and aggregate the data to market-year level and compare price changes across markets with and without the two airlines before the merger. The case study illustrates the use of transaction data to carry out a market-level analysis, the difficulties of defining markets , and using difference-in-differences analysis to estimate an effect. It shows how to examine pre-intervention trends to assess the parallel trends assumption , and how to estimate generalized versions of difference-in-differences analysis adding covariates or using a quantitative treatment variable .
Code : Stata or R-prep , R-study or Python-prep , Python-study or ALL . Data : US-airlines . Graphs : .png or .eps
CH23A Import demand and industrial production
How does import demand of a large country affect the industrial production of a medium-sized open economy? With time series data on imports of the large receiving country and indistrual production of the smaller country, we can estimate a time series regression to uncover the effect. But the the typical time series we can use are not very long, leading to uncertain estimates with wide confidence intervals. How can we use comparable data from other, similar countries to get more precise estimates?
This case study uses the asia-industry dataset with monthly time series of imports to the USA and industrial production in several Asian countries. The case study illustrates the use of time series regression to uncover an effect, including contemporaneous effects , lagged effects and their sum, cumulative effects . It then shows how we can use pooled time series , time series of the same varables from similar subjects (here countries), to arrive at more precise estimates of the same effect.
Code : Stata or R or Python or ALL . Data : asia-industry . Graphs : .png or .eps
CH23B Immunization against measles and saving children
Immunization against measles is an effective way to prevent the disease and may save the lives of children. How can we use data from many countries and several years with immunization and child mortality rates to uncover the effect of immunization on the survival chances of children?
This case study uses the world-bank-immunization dataset with data from the World Development Indicators data website maintained by the World Bank to look at countries’ annual immunization rate and GDP per capita. The case study illustrates panel data regressions with fixed-effects (FE) and estimated in first differences (FD) . It shows how the inclusion of time dummies can condition on aggregate trends of any form, the need to estimate clustered standard errors that are robust to heteroskedasticity as well as serial correlation. It shows that the inclusion of lagged right-hand-side variables can help capture lagged effects and, in the case of FD models, estimate cumilative effects , and the inclusion of lead terms of the right-hand-side variables can capture pre-intervention trends . It also shows how including unit-specific cosntants in an FD model can help capture time trends specific for cross-sectional units . The case study compares the results of FE and FD regressions and discusses their differences.
CH24 Estimating the effect of the 2010 Haiti earthquake on GDP
In January 2010, a strong earthquake hit the Caribbean island country Haiti, with an epicenter very close to the country’s capital. What was the effect of the earthquake on the Haitian economy in the short and the longer run? We can easily measure how total GDP changed in the year of the earthquake and how it evolved in the following years. However, to estimate the effect of the earthquake we need to estimate the counterfactual: how total GDP would have changed if Haiti hadn’t experienced an earthquake. How van we estimate such a counterfactual?
This case study uses the haiti-earthquake dataset with yearly observations of several macro variables for many countries. The case study illustrates comparative case studies and how to construct a synthetic control observation (here country) from data from other countries to estimate the counterfactual. It shows how to select donor pool of observatons similar to the case study observation (Haiti), how to select the variables whose pre-intervention values we want to be similar between the case study observation and the synthetic control observation, and how to use the algorightm of the synthetic control method to assign weights to each observation in the donor pool to construct the synthetic control observation. The case study also illustrates the visualization of the results of synthetic control analysis and the potential issues with the method to uncover the counterfactual.
Code : Stata or R or Python or ALL . Data : haiti-earthquake . Graphs : .png or .eps
CH24 Estimating the impact of replacing football team managers
Success in team sports depends on many things, and the work of the coach, or manager, is likely one of them. When a team performs below expectations, replacing the manager is one of the options teams can consider. How can we use data on all games for several seasons from a professional football (soccer) league and their managers to show how team performance tends to change after a manager is replaced? And how can we use the same data to estimate the counterfactual: how how the performance of low-performing teams would have changed if the manager hadn’t been replaced?
This case study uses the football dataset with all games of the English Premier League (EPL) in 11 seasons and who the team manager was at each game. It illustrates the event study method to estimate contemporaneous and lagged effects with xt panel data . It shows how we can select a control group from all observations that is similar, on average, in pre-intervention variables (here team performance) to estimate the counterfactual post-intervention outcomes, and how to define and select pseudo-interventions that are necessary to define the control group. We used the same dataset in case study 2B.
Case Study Example 1:An eCommerce Company Evaluation
+ Click for more information
Why did we do this? A lot of interviews have a take-home case component. This is different than a lot of the prepping we provide because this is challenging you to think end-to-end, on what is probably a real business problem the organization you're interviewing for is facing.
How can I use this? We have provided the data (downloaded from Kaggle, thank you Kaggle!), a mock prompt (similar to ones experienced at top-tier companies), a potential presentation (which could be thought of as a solution), and the work we did to get to the solution. We hope that if you run into a similar scenario or want to extra practice, you can use our work to guide you in the right direction!
The data It was really hard to find a real dataset that we could use as a "practice case." We ended up finding data on Kaggle! You can download the data there, or you can visit the links below. We didn't use all of the data provided in this practice set by Kaggle, so below we only include ones we used to get to the solution.
- Orders dataset : Provide information for each item ordered
- Order items dataset : Information for items within each order and the cost to ship and price broken out for each item within an order.
- Order payments dataset : Provides information regarding payments made on each order. Make sure you aggregate total payment for each order to get unique order price.
- Product dataset : Provides information about the product.
- Product category name translated dataset : This dataset is Brazilian, so all categories are in Portuguese, join category name on this table to get the translated category.
- Order reviews dataset : This table has review information for each order.
- Customers dataset dataset : This dataset has information regarding customer_id, which links directly to order_id in the orders dataset. However, to get the unique customer id for each order you need to link to this table.
You’re a Data Scientist / Business Analyst working for a new eCommerce company called A&B Co. (similar to Amazon) and you’ve been asked to prepare a presentation for the Vice President of Sales and the Vice President of Operations that summarizes sales and operations thus far. The summary should include (at a minimum) a summary of current state the business, current customer satisfaction, and a proposal of 2-3 areas where the company can improve. Here are some facts:
- It’s currently September 2018 (e.g., you can ignore all data after September 2018)
- The company’s inception was January 2017 (so you can ignore all data before January 2017)
- Company is US-based, but launched in Brazil (which is why some information is in Portuguese)
- You can assume all orders are delivered (so ignore the order state field)
Your presentation should not have more than 10 slides of content, and the presentation itself should only take ~15 minutes.
Note all data was provided by Kaggle . Feel free to read about each file on Kaggle (or by clicking the "Click for more infomation" button above), however you can download the data by clicking on each link below.
- Orders dataset
- Order items dataset
- Order payments dataset
- Product dataset
- Product category name translated dataset
- Order reviews dataset
- Customers dataset dataset
This is a potential solution to address the problems outlined in the prompt. Note this is not the only solution, nor is it necessarily the best. The goal of providing you this is to give you an idea of how we would go about solving this problem and how we would present this to an interview pannel.
Below are the code from our iPython notebook, the work we did in Google Sheets (how we created most of the charts), and some thoughts we had on solving case.
Python notebook You can view the Jupyter notebook we used here.
Google sheets A link to the spreadsheet .
Most of the charts / quick side analysis I did was using Google sheets. Python charts are nice, but I find Google sheets generally quicker/more flexible for basic charts. For a case, you typically get 24-48 hours and you might end up spending a lot of time trying to make charts look "pretty". I found it's best to use Python for data aggregation/manipulation and use a spreadsheet software to make the more simple charts (unless otherwise specified in prompt).
Our thought process
Build a framework to answer the questions. If you’re not sure what the questions are, create questions for yourself to answer. It makes the process of digging for data so much easier. It’s hard to come up with an answer if you don’t know what the questions are. This point seems like a no-brainer, but it’s good to make sure that you’ve created a structure to answer the key points. If you’re not sure what the question at hand is, you might need to play with the data a bit to understand what seems to be the problem at hand. For this particular case the ask is really clear, we need to create a summary which includes current state the business, current customer satisfaction, and a proposal for an area where the company can improve. The questions we came up with (and some answers listed below) given the ask are:
What should be shown to reflect current state of business? (Probably something to do with product growth and amount of money)
Revenue → how much money are we making?
Volume of sales → how many orders are we getting?
Customer summary based on spend + behavior
How should we present customer satisfaction?
Is customer satisfaction a problem? Do initial analysis to figure it out?
Does customer satisfaction reflect our current business state? (e.g. if business is down, does customer satisfaction also go down?, if business is boomin’ is customer satisfaction higher?)
Should I focus on sales area of improvement or customer satisfaction area of improvement?
This depends on what bullet points 1 + 2 yield, also depends on customer churn
Questions don’t always have analytical answers, sometimes questions lead to more questions. This is of course a high-level framework, but the above should give you and idea of how we went through solving this! (the details of course are provided in the code/presentation material)
Have questions or want to contact us?
- Code Snippets
© 2023 Data Interview Questions.

- Data Science
Top 8 Data Science Case Studies for Data Science Enthusiasts
Read it in 15 Mins
- 8 Data Science Case Studies
- Data Science in Hospitality Industry
- Data Science in Healthcare
- Covid 19 and Data Science
- Data Science in Ecommerce
- Data Science in Supply Chain Management
- Data Science in Meteorology
- Data Science in Entertainment Industry
- Data Science in Banking and Finance
- Where to Find Full Data Science Case Studies?
- What Are the Skills Required for Data Scientists?
- Frequently Asked Questions(FAQs)

Data science has become popular in the last few years due to its successful application in making business decisions. Data scientists have been using data science techniques to solve challenging real-world issues in healthcare, agriculture, manufacturing, automotive, and many more. For this purpose, a data enthusiast needs to stay updated with the latest technological advancements in AI. An excellent way to achieve this is through reading industry case studies. Check out Knowledgehut Data Science With Python course syllabus to start your data science journey.
Let’s discuss some case studies that contain detailed and systematic data analysis of people, objects, or entities focusing on multiple factors present in the dataset. Aspiring and practising data scientists can motivate themselves to learn more about the sector, an alternative way of thinking, or methods to improve their organization based on comparable experiences. Almost every industry uses data science in some way. You can learn more about data science fundamentals in this data science course content . Data scientists may use it to spot fraudulent conduct in insurance claims. Automotive data scientists may use it to improve self-driving cars. In contrast, e-commerce data scientists can use it to add more personalization for their consumers—the possibilities are unlimited and unexplored.
We will take a look at the top eight data science case studies in this article so you can understand how businesses from many sectors have benefitted from data science to boost productivity, revenues, and more. Read on to explore more, or use the following links to go straight to the case study of your choice.
Know more about measures of dispersion .

Hospitality
- Airbnb focuses on growth by analyzing customer voice using data science
- Qantas uses predictive analytics to mitigate losses
Healthcare
- Novo Nordisk is Driving innovation with NLP
- AstraZeneca harnesses data for innovation in medicine
Covid 19
- Johnson and Johnson use s d ata science to fight the Pandemic
Ecommerce
- Amazon uses data science to personalize shop p ing experiences and improve customer satisfaction
Supply chain management
- UPS optimizes supp l y chain with big data analytics
Meteorology
- IMD leveraged data science to achieve a rec o rd 1.2m evacuation before cyclone ''Fani''
Entertainment Industry
- Netflix u ses data science to personalize the content and improve recommendations
- Spotify uses big data to deliver a rich user experience for online music streaming
Banking and Finance
- HDFC utilizes Big D ata Analytics to increase income and enhance the banking experience
8 Data Science Case Studies
1. data science in hospitality industry.
In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing , tracking market trends, and many more.
Airbnb focuses on growth by analyzing customer voice using data science.
A famous example in this sector is the unicorn '' Airbnb '', a startup that focussed on data science early to grow and adapt to the market faster. This company witnessed a 43000 percent hypergrowth in as little as five years using data science. They included data science techniques to process the data, translate this data for better understanding the voice of the customer, and use the insights for decision making. They also scaled the approach to cover all aspects of the organization. Airbnb uses statistics to analyze and aggregate individual experiences to establish trends throughout the community. These analyzed trends using data science techniques impact their business choices while helping them grow further.
Travel industry and data science
Predictive analytics benefits many parameters in the travel industry. These companies can use recommendation engines with data science to achieve higher personalization and improved user interactions. They can study and cross-sell products by recommending relevant products to drive sales and increase revenue. Data science is also employed in analyzing social media posts for sentiment analysis, bringing invaluable travel-related insights. Whether these views are positive, negative, or neutral can help these agencies understand the user demographics, the expected experiences by their target audiences, and so on. These insights are essential for developing aggressive pricing strategies to draw customers and provide better customization to customers in the travel packages and allied services. Travel agencies like Expedia and Booking.com use predictive analytics to create personalized recommendations, product development, and effective marketing of their products. Not just travel agencies but airlines also benefit from the same approach. Airlines frequently face losses due to flight cancellations, disruptions, and delays. Data science helps them identify patterns and predict possible bottlenecks, thereby effectively mitigating the losses and improving the overall customer traveling experience.
How Qantas uses predictive analytics to mitigate losses
Qantas , one of Australia's largest airlines, leverages data science to reduce losses caused due to flight delays, disruptions, and cancellations. They also use it to provide a better traveling experience for their customers by reducing the number and length of delays caused due to huge air traffic, weather conditions, or difficulties arising in operations. Back in 2016, when heavy storms badly struck Australia's east coast, only 15 out of 436 Qantas flights were cancelled due to their predictive analytics-based system against their competitor Virgin Australia, which witnessed 70 cancelled flights out of 320.
2. Data Science in Healthcare
The Healthcare sector is immensely benefiting from the advancements in AI. Data science, especially in medical imaging, has been helping healthcare professionals come up with better diagnoses and effective treatments for patients. Similarly, several advanced healthcare analytics tools have been developed to generate clinical insights for improving patient care. These tools also assist in defining personalized medications for patients reducing operating costs for clinics and hospitals. Apart from medical imaging or computer vision, Natural Language Processing (NLP) is frequently used in the healthcare domain to study the published textual research data.
Pharmaceutical
Driving innovation with NLP: Novo Nordisk
Novo Nordisk uses the Linguamatics NLP platform from internal and external data sources for text mining purposes that include scientific abstracts, patents, grants, news, tech transfer offices from universities worldwide, and more. These NLP queries run across sources for the key therapeutic areas of interest to the Novo Nordisk R&D community. Several NLP algorithms have been developed for the topics of safety, efficacy, randomized controlled trials, patient populations, dosing, and devices. Novo Nordisk employs a data pipeline to capitalize the tools' success on real-world data and uses interactive dashboards and cloud services to visualize this standardized structured information from the queries for exploring commercial effectiveness, market situations, potential, and gaps in the product documentation. Through data science, they are able to automate the process of generating insights, save time and provide better insights for evidence-based decision making.
How AstraZeneca harnesses data for innovation in medicine
AstraZeneca is a globally known biotech company that leverages data using AI technology to discover and deliver newer effective medicines faster. Within their R&D teams, they are using AI to decode the big data to understand better diseases like cancer, respiratory disease, and heart, kidney, and metabolic diseases to be effectively treated. Using data science, they can identify new targets for innovative medications. In 2021, they selected the first two AI-generated drug targets collaborating with BenevolentAI in Chronic Kidney Disease and Idiopathic Pulmonary Fibrosis.
Data science is also helping AstraZeneca redesign better clinical trials, achieve personalized medication strategies, and innovate the process of developing new medicines. Their Center for Genomics Research uses data science and AI to analyze around two million genomes by 2026. Apart from this, they are training their AI systems to check these images for disease and biomarkers for effective medicines for imaging purposes. This approach helps them analyze samples accurately and more effortlessly. Moreover, it can cut the analysis time by around 30%.
AstraZeneca also utilizes AI and machine learning to optimize the process at different stages and minimize the overall time for the clinical trials by analyzing the clinical trial data. Summing up, they use data science to design smarter clinical trials, develop innovative medicines, improve drug development and patient care strategies, and many more.
Wearable Technology
Wearable technology is a multi-billion-dollar industry. With an increasing awareness about fitness and nutrition, more individuals now prefer using fitness wearables to track their routines and lifestyle choices.
Fitness wearables are convenient to use, assist users in tracking their health, and encourage them to lead a healthier lifestyle. The medical devices in this domain are beneficial since they help monitor the patient's condition and communicate in an emergency situation. The regularly used fitness trackers and smartwatches from renowned companies like Garmin, Apple, FitBit, etc., continuously collect physiological data of the individuals wearing them. These wearable providers offer user-friendly dashboards to their customers for analyzing and tracking progress in their fitness journey.
3. Covid 19 and Data Science
In the past two years of the Pandemic, the power of data science has been more evident than ever. Different pharmaceutical companies across the globe could synthesize Covid 19 vaccines by analyzing the data to understand the trends and patterns of the outbreak. Data science made it possible to track the virus in real-time, predict patterns, devise effective strategies to fight the Pandemic, and many more.
How Johnson and Johnson uses data science to fight the Pandemic
The data science team at Johnson and Johnson leverages real-time data to track the spread of the virus. They built a global surveillance dashboard (granulated to county level) that helps them track the Pandemic's progress, predict potential hotspots of the virus, and narrow down the likely place where they should test its investigational COVID-19 vaccine candidate. The team works with in-country experts to determine whether official numbers are accurate and find the most valid information about case numbers, hospitalizations, mortality and testing rates, social compliance, and local policies to populate this dashboard. The team also studies the data to build models that help the company identify groups of individuals at risk of getting affected by the virus and explore effective treatments to improve patient outcomes.
4. Data Science in Ecommerce
In the e-commerce sector , big data analytics can assist in customer analysis, reduce operational costs, forecast trends for better sales, provide personalized shopping experiences to customers, and many more.
Amazon uses data science to personalize shopping experiences and improve customer satisfaction. Amazon is a globally leading eCommerce platform that offers a wide range of online shopping services. Due to this, Amazon generates a massive amount of data that can be leveraged to understand consumer behavior and generate insights on competitors' strategies. Amazon uses its data to provide recommendations to its users on different products and services. With this approach, Amazon is able to persuade its consumers into buying and making additional sales. This approach works well for Amazon as it earns 35% of the revenue yearly with this technique. Additionally, Amazon collects consumer data for faster order tracking and better deliveries.
Similarly, Amazon's virtual assistant, Alexa, can converse in different languages; uses speakers and a camera to interact with the users. Amazon utilizes the audio commands from users to improve Alexa and deliver a better user experience.
5. Data Science in Supply Chain Management
Predictive analytics and big data are driving innovation in the Supply chain domain. They offer greater visibility into the company operations, reduce costs and overheads, forecasting demands, predictive maintenance, product pricing, minimize supply chain interruptions, route optimization, fleet management , drive better performance, and more.
Optimizing supply chain with big data analytics: UPS
UPS is a renowned package delivery and supply chain management company. With thousands of packages being delivered every day, on average, a UPS driver makes about 100 deliveries each business day. On-time and safe package delivery are crucial to UPS's success. Hence, UPS offers an optimized navigation tool ''ORION'' (On-Road Integrated Optimization and Navigation), which uses highly advanced big data processing algorithms. This tool for UPS drivers provides route optimization concerning fuel, distance, and time. UPS utilizes supply chain data analysis in all aspects of its shipping process. Data about packages and deliveries are captured through radars and sensors. The deliveries and routes are optimized using big data systems. Overall, this approach has helped UPS save 1.6 million gallons of gasoline in transportation every year, significantly reducing delivery costs.
6. Data Science in Meteorology
Weather prediction is an interesting application of data science . Businesses like aviation, agriculture and farming, construction, consumer goods, sporting events, and many more are dependent on climatic conditions. The success of these businesses is closely tied to the weather, as decisions are made after considering the weather predictions from the meteorological department.
Besides, weather forecasts are extremely helpful for individuals to manage their allergic conditions. One crucial application of weather forecasting is natural disaster prediction and risk management.
Weather forecasts begin with a large amount of data collection related to the current environmental conditions (wind speed, temperature, humidity, clouds captured at a specific location and time) using sensors on IoT (Internet of Things) devices and satellite imagery. This gathered data is then analyzed using the understanding of atmospheric processes, and machine learning models are built to make predictions on upcoming weather conditions like rainfall or snow prediction. Although data science cannot help avoid natural calamities like floods, hurricanes, or forest fires. Tracking these natural phenomena well ahead of their arrival is beneficial. Such predictions allow governments sufficient time to take necessary steps and measures to ensure the safety of the population.
IMD leveraged data science to achieve a record 1.2m evacuation before cyclone ''Fani''
Most d ata scientist’s responsibilities rely on satellite images to make short-term forecasts, decide whether a forecast is correct, and validate models. Machine Learning is also used for pattern matching in this case. It can forecast future weather conditions if it recognizes a past pattern. When employing dependable equipment, sensor data is helpful to produce local forecasts about actual weather models. IMD used satellite pictures to study the low-pressure zones forming off the Odisha coast (India). In April 2019, thirteen days before cyclone ''Fani'' reached the area, IMD (India Meteorological Department) warned that a massive storm was underway, and the authorities began preparing for safety measures.
It was one of the most powerful cyclones to strike India in the recent 20 years, and a record 1.2 million people were evacuated in less than 48 hours, thanks to the power of data science.
7. Data Science in Entertainment Industry
Due to the Pandemic, demand for OTT (Over-the-top) media platforms has grown significantly. People prefer watching movies and web series or listening to the music of their choice at leisure in the convenience of their homes. This sudden growth in demand has given rise to stiff competition. Every platform now uses data analytics in different capacities to provide better-personalized recommendations to its subscribers and improve user experience.
How Netflix uses data science to personalize the content and improve recommendations
Netflix is an extremely popular internet television platform with streamable content offered in several languages and caters to various audiences. In 2006, when Netflix entered this media streaming market, they were interested in increasing the efficiency of their existing ''Cinematch'' platform by 10% and hence, offered a prize of $1 million to the winning team. This approach was successful as they found a solution developed by the BellKor team at the end of the competition that increased prediction accuracy by 10.06%. Over 200 work hours and an ensemble of 107 algorithms provided this result. These winning algorithms are now a part of the Netflix recommendation system.
Netflix also employs Ranking Algorithms to generate personalized recommendations of movies and TV Shows appealing to its users.
Spotify uses big data to deliver a rich user experience for online music streaming
Personalized online music streaming is another area where data science is being used. Spotify is a well-known on-demand music service provider launched in 2008, which effectively leveraged big data to create personalized experiences for each user. It is a huge platform with more than 24 million subscribers and hosts a database of nearly 20million songs; they use the big data to offer a rich experience to its users. Spotify uses this big data and various algorithms to train machine learning models to provide personalized content. Spotify offers a "Discover Weekly" feature that generates a personalized playlist of fresh unheard songs matching the user's taste every week. Using the Spotify "Wrapped" feature, users get an overview of their most favorite or frequently listened songs during the entire year in December. Spotify also leverages the data to run targeted ads to grow its business. Thus, Spotify utilizes the user data, which is big data and some external data, to deliver a high-quality user experience.
8. Data Science in Banking and Finance
Data science is extremely valuable in the Banking and Finance industry . Several high priority aspects of Banking and Finance like credit risk modeling (possibility of repayment of a loan), fraud detection (detection of malicious or irregularities in transactional patterns using machine learning), identifying customer lifetime value (prediction of bank performance based on existing and potential customers), customer segmentation (customer profiling based on behavior and characteristics for personalization of offers and services). Finally, data science is also used in real-time predictive analytics (computational techniques to predict future events).
How HDFC utilizes Big Data Analytics to increase revenues and enhance the banking experience
One of the major private banks in India, HDFC Bank , was an early adopter of AI. It started with Big Data analytics in 2004, intending to grow its revenue and understand its customers and markets better than its competitors. Back then, they were trendsetters by setting up an enterprise data warehouse in the bank to be able to track the differentiation to be given to customers based on their relationship value with HDFC Bank. Data science and analytics have been crucial in helping HDFC bank segregate its customers and offer customized personal or commercial banking services. The analytics engine and SaaS use have been assisting the HDFC bank in cross-selling relevant offers to its customers. Apart from the regular fraud prevention, it assists in keeping track of customer credit histories and has also been the reason for the speedy loan approvals offered by the bank.
Where to Find Full Data Science Case Studies?
Data science is a highly evolving domain with many practical applications and a huge open community. Hence, the best way to keep updated with the latest trends in this domain is by reading case studies and technical articles. Usually, companies share their success stories of how data science helped them achieve their goals to showcase their potential and benefit the greater good. Such case studies are available online on the respective company websites and dedicated technology forums like Towards Data Science or Medium.
Additionally, we can get some practical examples in recently published research papers and textbooks in data science.
What Are the Skills Required for Data Scientists?
Data scientists play an important role in the data science process as they are the ones who work on the data end to end. To be able to work on a data science case study, there are several skills required for data scientists like a good grasp of the fundamentals of data science, deep knowledge of statistics, excellent programming skills in Python or R, exposure to data manipulation and data analysis, ability to generate creative and compelling data visualizations, good knowledge of big data, machine learning and deep learning concepts for model building & deployment. Apart from these technical skills, data scientists also need to be good storytellers and should have an analytical mind with strong communication skills.
Conclusion
These were some interesting data science case studies across different industries. There are many more domains where data science has exciting applications, like in the Education domain, where data can be utilized to monitor student and instructor performance, develop an innovative curriculum that is in sync with the industry expectations, etc.
Almost all the companies looking to leverage the power of big data begin with a swot analysis to narrow down the problems they intend to solve with data science. Further, they need to assess their competitors to develop relevant data science tools and strategies to address the challenging issue. This approach allows them to differentiate themselves from their competitors and offer something unique to their customers.
With data science, the companies have become smarter and more data-driven to bring about tremendous growth. Moreover, data science has made these organizations more sustainable. Thus, the utility of data science in several sectors is clearly visible, a lot is left to be explored, and more is yet to come. Nonetheless, data science will continue to boost the performance of organizations in this age of big data.

Devashree Madhugiri
Devashree holds an M.Eng degree in Information Technology from Germany and a background in Data Science. She likes working with statistics and discovering hidden insights in varied datasets to create stunning dashboards. She enjoys sharing her knowledge in AI by writing technical articles on various technological platforms. She loves traveling, reading fiction, solving Sudoku puzzles, and participating in coding competitions in her leisure time.
Avail your free 1:1 mentorship session.
Something went wrong
Frequently Asked Questions (FAQs)
A case study in data science requires a systematic and organized approach for solving the problem. Generally, four main steps are needed to tackle every data science case study:
- Defining the problem statement and strategy to solve it
- Gather and pre-process the data by making relevant assumptions
- Select tool and appropriate algorithms to build machine learning /deep learning models
- Make predictions, accept the solutions based on evaluation metrics, and improve the model if necessary.
Getting data for a case study starts with a reasonable understanding of the problem. This gives us clarity about what we expect the dataset to include. Finding relevant data for a case study requires some effort. Although it is possible to collect relevant data using traditional techniques like surveys and questionnaires, we can also find good quality data sets online on different platforms like Kaggle, UCI Machine Learning repository, Azure open data sets, Government open datasets, Google Public Datasets, Data World and so on.
Data science projects involve multiple steps to process the data and bring valuable insights. A data science project includes different steps - defining the problem statement, gathering relevant data required to solve the problem, data pre-processing, data exploration & data analysis, algorithm selection, model building, model prediction, model optimization, and communicating the results through dashboards and reports.

IMAGES
VIDEO
COMMENTS
Data analytics and case studies files. ... Inventory Analysis Case Study Instructor files: Instructor guide · Phase 1 - Data Collection and Preparation.
Data Analytics Cases vs Product Metrics Questions · Investigate Metrics - One of the most common types will ask you to investigate a metric, usually one that's
Today I'm tackling a data analytics problem asked in data engineering and data science interviews. For most of these questions
To achieve this goal, they heavily depend on the advances of their data science and analytics department for research and development, also
Success Stories · Data Governance for a US-based Automobile Manufacturer · AI/ML Engine that predicts spot-market broking related market price for Transportation
But how? You're in the right place to find out. · Humana's Automated Data Analysis Case Study · The Need · The Action · The AI listens to cues like the customer's
It describes what we want from such data and what data source we would need. The data is collected by web scraping, and it results in a single data table. The
A case study may incorporate a variety of other audit techniques, including interviews, surveys, questionnaires, data analysis, document reviews, and.
You're a Data Scientist / Business Analyst working for a new eCommerce company called A&B Co. (similar to Amazon) and you've been asked to prepare...
In the hospitality sector, data analytics assists hotels in better pricing strategies, customer analysis, brand marketing, tracking market