Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

SQL practice on a sample IMDB database.

BhargavTumu/SQL-Practice-Questions

Folders and files, repository files navigation, sql practice questions.

This repository consists of a sample IMDB database along with some sql questions and their solutions. Solving these help as a quick sql refresher since these questions cover all the commonly used scenarios.

Sql is the most important and underrated skill for any data scientist and in the feild of machine learning as we spend 75% of the time in cleaning and analyzing data.

One of the challenges in the given DB is that the sata is not completely cleaned, please make sure to analyze before writing your queries.

Please refer to the solutions if you are stuck anyhwere, i have tried to provide explanations and the thought process behind each solution, hopefully that helps and even better if you can come up with a more effecient way of solving.

Please create a pull request if you find a amore effecient way or if there are any corrections needed in the solutions.

List of files :

  • Database Scema Diagram - Provides a schematic of all the tables in the database and their realtionships.
  • Db-IMDB.db - Sample IMDB database that we would be using.
  • Questions.pdf - Provides the list of questions to be solved.
  • Solutions-Jupyter notebook.ipynb -- Ipython notebbok with all the solutions.
  • Solutions-PDF.pdf - PDF version of the ipython notebook.
  • Solution.sql - List of all the solutions saved as a sql file.

We would be using python pandas library in a ipython notebook to coonect to the given database and run our sql queries. The installation process and how to run queries using pandas can be found below.

Required Softwares :

  • Python 3 : - Please install the latest version of python 3 from here . At the end of the installation don't forget to click add python to path.
  • Anaconda - Anaconda is an open source distribution of python, it consists of all the frequently used python packages that we need. Install it from here . Please choose the python 3 version.

That's it , you should have all the softwares you need to run.

If you have never used a jupyter notebook, don't worry it's pretty straight forward, you can find a quick overview here .

Steps to connect to the database using pandas :

  • Create a new hupyter notebook, preferably in the same folder where you put the Db-IMDB.db file.
  • We need to import couple of libraries pandas and sqllite3.
  • Make a coonection to the sample imdb database.
  • Once we have the connection we can use pandas to write sql queries and see the results.The below query gives all the tables in the database

Contributions :

The sample IMDB database and the questions are provided by the Applied AI Team as part of their machine learning course.

  • Jupyter Notebook 100.0%

SOLVING QUERIES ON IMDB DATASET USING SQL

select   (select count(*) from directors) as directors_rows_count,   (select count(*) from directors_genres) as director_genres_rows_count,   (select count(*) from movies) as movies_rows_count,   (select count(*) from movies_directors) as movie_directors_rows_count,   (select count(*) from roles) as roles_rows_count;

sql assignment on imdb data github applied ai

select * from directors;

sql assignment on imdb data github applied ai

-- 0 for 'No' and 1 for 'Yes' as we have used aggregate function sum().     select sum(case   when id is Null then 1 else 0 end ) as id,   (case   when name is Null then 1 else 0 end ) as name,   (case   when year is Null then 1 else 0 end ) as year,   (case   when rankscore is Null then 1 else 0 end ) as rankscore   from movies;

sql assignment on imdb data github applied ai

select name as movie, year from movies group by year order by year;

sql assignment on imdb data github applied ai

select year,count(name) from movies group by year order by year;

sql assignment on imdb data github applied ai

select year,count(name) as movies_count from movies group by year order by movies_count desc limit 10;

sql assignment on imdb data github applied ai

select movies.name, movies.rankscore from movies   inner join (select name, max(rankscore) as maxrank from movies) movies2 on movies.rankscore=movies2.maxrank;

sql assignment on imdb data github applied ai

create temporary table table1 (select md.director_id, md.movie_id , dr.director_name ,mv.movie_name from movies_directors as md inner join   (select id as director_id,concat(first_name," ",last_name) as director_name from directors) dr   on md.director_id=dr.director_id   inner join (select id as movie_id, name as movie_name from movies) mv on md.movie_id = mv.movie_id);   select * from table1;

sql assignment on imdb data github applied ai

select director_name , count(movie_name) as movies_count from table1 group by director_name order by movies_count desc limit 20 ;

sql assignment on imdb data github applied ai

select genre, count(movie_id) as movies_count from movies_genres group by genre order by movies_count desc;

sql assignment on imdb data github applied ai

with directors_with_most_genres as( select director_id, count(genre) as genre_count from directors_genres group by director_id order by genre_count desc)   select concat(directors.first_name,directors.last_name) as director_name, directors.id as directors_id, directors_with_most_genres.genre_count from directors inner join directors_with_most_genres   on directors.id=directors_with_most_genres.director_id;

sql assignment on imdb data github applied ai

with rolescount as   (select movie_id,count(role) as number_of_roles from roles group by movie_id )   select movies.name, rolescount.number_of_roles from movies inner join rolescount   on movies.id=rolescount.movie_id order by number_of_roles desc;

sql assignment on imdb data github applied ai

select * from movies where name like 'An%' having rankscore>9;

sql assignment on imdb data github applied ai

select * from movies where name like 'Fig%ub'and length(name)=10 and year=1999;     -- Note: Space(" ") is also is a charecter.

sql assignment on imdb data github applied ai

select * from movies having year between 1800 and 2000 and rankscore>9.5 order by year;

sql assignment on imdb data github applied ai

select lower(name),rankscore,year from movies where name in ('top gun','blade runner','border');

sql assignment on imdb data github applied ai

IMDB-Data-Analysis-in-SQL

This project was carried out to answer a set of analytical questions to suggest a movie production house on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie..

glow (1)

Table of Content (TOC)

  • Database Creation for the Project
  • Table Creation
  • Data Insertion

Data Analysis

  • EXECUTIVE SUMMARY AND RECOMMENDATIONS

1. Overview

This analysis is carried out to support RSVP Movies with a well-analyzed list of global stars to plan a movie for the global audience in 2022.

With this, we will be able to answer a set of analytical questions to suggest RSVP Production House on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie.

IMDB Data Analysis in MySQL

RSVP Movies is an Indian film production company that has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

Why this Analysis?

The production company wants to plan its every move analytically based on data and has approached for help with this new project.

We have been provided with the data of the movies that have been released in the past three years. Let’s analyze the data set and draw meaningful insights that can help them start their new project.

We will use SQL to analyze the given data and give recommendations to RSVP Movies based on the insights.

We will be carrying out the entire analytics process into four segments, where each segment leads to significant insights from different combinations of tables.

2. Database Creation for the Project

A. check the list of database.

  • The very first step of any MySQL analysis is to access the database and check if related data is available or not.
  • Use show databases; to access the list of databases:

b. Create Database

  • Create a new database for this project.
  • Use Create database IMDB;
  • Use show databases; to confirm the list of databases:

c. Use Database

  • Instruct the system to use *IMDB Database* by running use imdb;

3. Table Creation

Steps to follow before creating the table:.

  • Download the IMDb dataset. And try to understanding every table and its importance.
  • Understand the ERD and the table details. Study them carefully and understand the relationships between the table.

image

  • Inspect each table given in the subsequent tabs and understand the features associated with each of them.
  • Draft your table with the correct Data Type and Constraints in a paper or note file.
  • Open your MySQL Workbench and start writing the DDL and DML commands to create the database.

Create Table

For this project we need a total of 6 tables:

a. Create Table Movie

B. create table genre, c. create table director_mapping.

| Table Name: director_mapping | Column Description | | ———– | ———– | | movie_id | Movie Id of the movie directed by a director | | name_id | Name ID of the director |

d. Create Table role_mapping

E. create table names, f. create table ratings.

Now, Run show tables; to ensure that all the six tables are created.

4. Data Insertion

In the previous steps, we created six tables. Now, we will insert the data into these tables. Here, we will be showing the syntax of 5 rows insertion into each table. (The complete data insertion syntax is available in the Repository)

a. Inserting data into Movie Table

B. inserting data into genre table, c. inserting data into director_mapping table, d. inserting data into role_mapping table, e. inserting data into names table, f. inserting data into ratings table, checking tables for inserted values:.

Select * from Movie;

Select * from Genre;

Select * from Director_Mapping;

Select * from Role_Mapping;

Select * from Names;

Select * from Ratings;

All the sample data inserted looks good. SO, we can go ahead with insertion of complete data. For insertion to work smoothly, lets drop all data from tables using TRUNCATE :

Insert Complete data

Run the command to insert complete data: IMDB File 3 Insert all data

1. Find the total number of rows in each table of the schema?

Alternative 1:.

Number of Rows after ignoring the Null Rows

Alternative 2:

Rows count inclusive of Null Rows:

TABLE_NAME Tables_in_imdb director_mapping 3867 genre 14662 movie 8519 names 23714 ratings 8230 role_mapping 15173

2. Which columns in the movie table have null values?

id_null title_null year_null date_null duration_null country_null world_null language_null production_null 0 0 0 0 0 20 3724 194 528

3.1. Find the total number of movies released each year?

Movies per year:, 3.2. find the total number of movies released each year, movies per month, 4.1 find the count of indian movies., 4.2 find the count of movies from usa, 4.3 find the count of movies which are either from india or usa, 4.4 find the count of movies that are either from india or usa and released in 2019., 5. find the unique list of the genres present in the data set, 6.1 find the movies count for each genre., 6.2 find the genre with the maximum number of movies., 6.3 find the genre with minimum number of movies., 6.4 find the top-3 genre with the maximum number of movies., 6.4 find the movies count for action genre., 6.5 find the genre count for each movie., 6.6 find the list of indian movies that belongs to 3 genre., 6.7 longest indian movie tagged with 3 genre..

‘tt6200656’, ‘Kammara Sambhavam’, ‘182’, ‘3’

6.8 Which genres are tagged with ‘Kammara Sambhavam’ movie.

genre Action Comedy Drama

7.1. How many movies belong to only one genre?

Create a list of Movies with a genre count
Restrict the list to Genre count = 1
Count the total number of rows

7.2. How many movies belong to two genres?

7.3. how many movies belong to three genres, 8.1. what is the average duration of movies in each genre, 8.2. rank the genre by the average duration of movies in each genre., 9. what is the rank of the ‘thriller’ genre of movies among all the genres in terms of the number of movies produced, 10. find the minimum and maximum values in each column of the rating table except the movie_id column, 11. which are the top 10 movies based on average rating, 12. summarize the ratings table based on the movie counts by median ratings., 13. which production house has produced the most number of hit movies (average rating > 8).

Create list of production house with count of movies where average rating > 8 and Ranked over “Movies count”
Applied CTE to pull the production house with Rank = 1
NOTE: applied (production_company IS NOT NULL) as there are few movies without production house name

14. How many movies released in each genre during March 2017 in the USA had more than 1,000 votes?

15. find movies of each genre that start with the word ‘the’ and which have an average rating > 8, 16. of the movies released between 1 april 2018 and 1 april 2019, how many were given a median rating of 8, 17. do german movies get more votes than italian movies, q18. which columns in the names table have null values, 19. who are the top three directors in the top three genres whose movies have an average rating > 8.

Pull the Top three Genre by Movie count where avg_rating > 8

Pull the Directors with Movie count where avg_rating > 8

Keeping “top_3_genres” as CTE, restrict the 2nd code to avg_rating > 8 and directors of top_3_genre

Trying Row_Number() function:

20. who are the top two actors whose movies have a median rating >= 8, 21. which are the top three production houses based on the number of votes received by their movies, 22. rank actors with movies released in india based on their average ratings. which actor is at the top of the list.

– Note: The actor should have acted in at least five Indian movies.

ALTERNTIVE 1 (Using Rank Window Function):

Alternative 2 (using cte):, 23.find out the top five actresses in hindi movies released in india based on their average ratings.

– Note: The actresses should have acted in at least three Indian movies.

24. Select thriller movies as per avg rating and classify them in the following category:

Rating > 8: Superhit movies
Rating between 7 and 8: Hit movies
Rating between 5 and 7: One-time-watch movies
Rating < 5: Flop movies

——————————————————————————————–*/

EXECUTIVE SUMMARY AND RECOMMENDATIONS {##-EXECUTIVE-SUMMARY-AND-RECOMMENDATIONS}

1. insights.

Based on 7,997 released and recorded on IMDB between 2017 and 2019, a summary of audience interest and recommendations are mentioned as below:

  • Average Duration: 103.89359
  • Total number of Actors: 12611 (7445 actor & 5166 Actress)

1. Year and Month wise Movie Release Pattern:

  • A year wise record of movies indicates a slight decrease in number of movies from 3052 movies in 2017 to 2001 movies in 2019.
  • Maximum number of movies were released in March, followed by September, October, and January. While more interesting fact is about the least number of movies being released in mid-year and end of year months, could be because of more people prefer vacation and family time in this time of year.

2. Geographical Region Distribution

  • USA and India produced 1059 movies together in 2019 alone, way above half of total movies released (2001) in the year.

3. Genre Popularity

  • Movies were tagged with genre tags as Drama, Fantasy, Thriller, Comedy, Horror, Family, Romance, Adventure, Action, Sci-Fi, Crime, and Mystery.
  • Drama is most popular genre among all the genre with 4285 tags across three years, followed by Comedy and Thriller.
  • There were 3289 movies with only one genre tags, while remaining were tagged with multiple genres.

4. The average duration of movies are around 103.89359 minutes, and even genre vise average revolves around the same figure.

5. top production houses.

  • Marvel Studios rules the best Production House category with 551245 votes based on the number of votes received by the movies they have produced, followed by Syncopy, and New Line Cinema.
  • Star Cinema, and Twentieth Century Fox are the top 2 multi-Lingual production house based on the most number of superhit movies.

6. Top Director

  • James Mangold has given most number of Superhit Movies, followed by Soubin Shahir, Joe Russo, and Anthony Russo.
  • A.L. Vijay, Andrew Jones, and Chris Stokes are the top directors based on number of movies.

7. Top Actors and Actress

  • Mammootty with 8 Superhit movies is most successful actor followed by Mohanlal with 5 Superhits.
  • There are quite a few number of actors with 4 Superhit movies under their name, which include Amrinder Gill, Amit Sadh, Johnny Yong Bosch, Tovino Thomas, Dulquer Salmaan, Siddique, Rajkummar Rao, Fahadh Faasil, Pankaj Tripathi, Dileesh Pothan, Joju George, and Ayushmann Khurrana.
  • Vijay Sethupathi, Fahadh Faasil, and Yogi Babu are the top three Indian actors who have acted atleast in five movies.
  • Taapsee Pannu, Divya Dutta, and Kriti Kharbanda are the top three Hindi Speaking actress who have acted at least in three movies.
  • Parvathy Thiruvothu, Susan Brown, and Amanda Lawrence are the best rated actresses in Drama genre.

8. Top-10 movies based on average rating are: Kirket, Love in Kilnerry, Gini Helida Kathe, Runam, Fan, Android Kunjappan Version 5.25, Yeh Suhaagraat Impossible, Safe, The Brighton Miracle, and Shibu

  • Based on Median rating counts, most of the movies are rated between 5 and 8, and falls under hit movie categories.

9. Top Grossing Movies

The highest-grossing movies of each year are:

i. Thank You for Your Service, a comedy movie released in 2017

ii. The Villain, a thriller movie released in 2018

iii. Joker, a drama movie released in 2019

2. Recommendation:

Based on Insights, the recommendations for RSVP are as following:

  • Concentrate on multi-genre drama-comedy movies with a pinch of thriller, keeping an average duration of around 104 minutes.
  • Plan for release of movie between January to March. Focus on multilingual movies which can be launched in India and USA as preferred audience market.
  • Rope in either Star Cinema or Twentieth Century Fox as the production house, under the directorial of James Mangold with assistance of A.L. Vijay.
  • Mammootty and Mohanlal can be the lead actors along with assistance from other side actors. Inclusion of Vijay Sethupathi would act as stardom promotion for the movie.
  • Parvathy Thiruvothu is one of the most rated drama actresses to be brought in.

Use SQL on a Movie Database to Decide What to Watch

Author's photo

We’ll demonstrate how to use SQL to parse large datasets and gain valuable insights, in this case, to help you choose what movie to watch next using an IMDb dataset.

In this article, we’ll be downloading a dataset directory from IMDb. Not sure what to watch tonight? Are you browsing Netflix endlessly? Decide what to watch using the power of SQL! We’ll be loading an existing movie IMDb dataset into SQL. We’ll analyze the data in different ways like sorting movies by their rating, by what actors star in the movie, or by other similar criteria.

As mentioned in this blog post on how to practice SQL , the best way to practice SQL is by gaining hands-on experience in solving real-world problems, which is exactly what we’ll be doing.

If you have a basic knowledge of SQL, you should be able to follow this article easily. If you have no IT experience whatsoever, consider starting with this SQL A to Z Learning Track designed for people who have no experience in IT and want to start their adventure with SQL.

Let’s get started by learning how to get the movie data into our SQL database.

Completing the SQL Movie Database Download

Let’s walk through the process of downloading our data and loading it into a database management system (DBMS), step by step. Common DBMSs include MySQL, Oracle DB, PostgreSQL, and SQL Server.

Although this article focuses on movie data, you can choose an entirely different dataset. Check out this list of free online datasets you can use and find the one you are interested in. The import of these datasets will be similar regardless of what dataset you use.

Open whatever variety of SQL you are using. For this example, I’ll be using SQL Server Management Studio, but the steps should be similar for all of the other varieties of SQL out there. Let’s get started:

  • The dataset files can be accessed and downloaded from https://datasets.imdbws.com/ . The data is refreshed daily.
  • basics.tsv.gz
  • akas.tsv.gz
  • crew.tsv.gz
  • episode.tsv.gz
  • principals.tsv.gz
  • ratings.tsv.gz
  • Extract the downloaded zip files. The end result will be a TSV (tab-separated) file for each table.
  • Open each file in a spreadsheet application like Google Sheets or Microsoft Excel.
  • Find and replace all occurrences of “\N” with an empty cell.
  • Save the file as a CSV file. This will make it easier to import into the DBMS of your choice.
  • Open your DBMS.
  • Create a new schema or table by right-clicking on the left pane and selecting “New Database.” I’ve named my new database “imdb.”

SQL movie database

  • Set valid data types for each column you are importing. I recommend using nvarchar(MAX) for string columns, since you do not know how long the strings will be for each field. You can change the column datatype later if required.

SQL movie database

  • Repeat this process for each of the files you have downloaded.

After completing these steps, your SQL movie database will be in place! You are now ready to start analyzing and querying the data.

SQL Exercises on a Movie Database

Thankfully, this dataset came with some descriptive documentation . To get an even better idea of the data, you can quickly select the top 1000 rows from each table.

Let’s start looking for our first movie. Imagine you want to watch a horror movie. How can we isolate only the horror movies? Fortunately, this task is frighteningly simple.

If this query causes any confusion, open this SQL cheat sheet to refresh your knowledge. Have this cheat sheet open for the rest of the tutorial to help you along!

What if we wanted to refine this horror movie list further? We could restrict the results to horror movies created after 1990, with an average rating above 9.0 and at least 10,000 votes.

This will involve getting data from multiple tables. Opening each table and taking a look at the column headers, we can see the following tables will be involved:

  • title_basics : handles the genre of movie and the release year (represented by the column startYear ).
  • title_ratings : handles the rating ( averageRating ) and votes ( numVotes ).

The two tables can be joined on the shared column, tconst . As explained in the IMDb documentation here , tconst is an alphanumeric unique identifier of the title. Let’s write our query:

Executing this query returns a single result, but not the result we want! On closer inspection, we can see that this title is a video game, not a movie. Let’s alter our query to include only movies, and expand the search by reducing the minimum number of votes required to 1,000 and the minimum rating required to 8.0.

Executing this query also yields a single result! Looks like we won’t have to decide what to watch anymore, since there’s only one option that fits our criteria!

Finding all the Movies for a Given Director

Let’s run through another scenario. What if we want to see all of the movies Steven Spielberg has directed? How would this work?

By looking through the tables, we can determine the following:

  • name_basics : It contains the names of all actors, writers, directors, and others involved in the creation of film and TV titles.
  • title_crew : It acts as a linking table for titles, directors, and writers. We’ll use this table to connect Steven Spielberg to the titles he’s involved with.
  • title_basics : We have already used this table. It contains title information like name, release date, rating, etc.

Let’s get to work! Let’s write a query for the name_basics table to try and find the famous director Steven Spielberg.

Executing this query yields a single result:

This gives us the important value of nconst . From the documentation, we know that nconst is the alphanumeric unique identifier of the name/person.

We can feed this value into the title_crew table, which contains the director and writer information for all the titles in IMDb, and match Steven Spielberg to all the titles he’s involved with.

Executing this query results in a list of 45 titles. You can see from the value of the directors column that Steven Spielberg was the director of them all.

We need a way of using this list of titles alongside the title_basics table to get the name of the movies instead of just the tconst. Let’s use a subquery for this!

Execute this query to see the result:

There we have it, all of the Steven Spielberg movie titles from our database!

Don’t stop here! Write your own custom queries to extract more insights from this large dataset. There are many ways to practice SQL. If you feel like you’ve had enough of working with this dataset, check out this post on 12 Ways to Learn SQL Online for more excellent learning resources.

Using SQL on a Large Existing Movie Database

You have learned how to import and analyze large existing datasets into the DBMS of your choice and to use SQL to analyze a movie database. This is a powerful tool in your SQL arsenal. Not to mention, you’ll never have to worry about not being able to choose a movie to watch again! Completing SQL exercises on movie databases is a helpful way to learn, but if you would like more structure, check out this SQL Practice Set from LearnSQL.com .

You may also like

sql assignment on imdb data github applied ai

How Do You Write a SELECT Statement in SQL?

sql assignment on imdb data github applied ai

What Is a Foreign Key in SQL?

sql assignment on imdb data github applied ai

Enumerate and Explain All the Basic Elements of an SQL Query

Overview - datagrad/IMDB-Data-Analysis-in-SQL GitHub Wiki

Imdb-analysis-in-sql.

This analysis is carried out to support RSVP Movies with a well-analyzed list of global stars to plan a movie for the global audience in 2022.

With this, we will be able to answer a set of analytical questions to suggest RSVP Production House on which set of actors, directors, and production houses would be the best fit for a super hit commercial movie.

RSVP Movies is an Indian film production company that has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

Why this Analysis?

The production company wants to plan its every move analytically based on data and has approached for help with this new project.

We have been provided with the data of the movies that have been released in the past three years. Let's analyze the data set and draw meaningful insights that can help them start their new project.

We will use SQL to analyze the given data and give recommendations to RSVP Movies based on the insights.

We will be carrying out the entire analytics process into four segments, where each segment leads to significant insights from different combinations of tables.

logo

IMDb 2: Designing a MySQL database and performing ETL for IMDb dataset using python

sql assignment on imdb data github applied ai

Designing the relational database

In this series of blog posts we will present an end-to-end database project using MySQL the IMDb dataset. This is the second in the series of posts on my database project. In this post we will present the Entity-Relationship ( ER ) and logical schema diagrams for our relational database. We will also discuss the Extract-Transform-Load ( ETL ) tasks we performed using python. Today we will not go into the details of how to design a normalised database as this can take up a large portion of a university level course on databases. Instead we will refer the interested reader to the fantastic book by Ramakrishnan and Gehrke [1] and the incredibly useful set of video lectures for the course CMPSC431W : Database Management Systems taught at Penn State by Yu-San Lin [2]. We will now present our design.

For this project we used yEd to create our Entity-Relationship ( ER ) and logical schema diagrams.

“ yEd is a powerful desktop application that can be used to quickly and effectively generate high-quality diagrams. Create diagrams manually, or import your external data for analysis. Our automatic layout algorithms arrange even large data sets with just the press of a button.” “ yEd is freely available and runs on all major platforms: Windows, Unix/Linux, and macOS.”

We really liked this software. It was very easy to use with simple GUI .

sql assignment on imdb data github applied ai

Entity-Relationship ( ER ) diagram

The IMDb data as provided is not normalised. We designed the entity-relationship diagram for our IMDb relational database. This was created using yEd and is shown below.

sql assignment on imdb data github applied ai

As one can see it is a reasonably complex database, but, not as complex as one may find in production business systems. However, it will suffice for our purposes.

Logical schema

We then normalise our ER diagram and obtain the logical schema illustrated below. Note the following:

New tables were created for multi-valued attributes, such as Title_genres.

We pulled the rating information attributes from the Titles entity, because many titles didn’t have a rating. If we were to store them in the Titles table, then we would have stored many NULL values. Instead we decided to separate this information, by putting it into the table Title_ratings.

sql assignment on imdb data github applied ai

This diagram was also created using yEd . Note that in this diagram we have denote primary keys by PK and foreign keys by FK .

Extract-Transform-Load: Preparing the IMDb data

The ETL part of this project was done using a single python script. The script imdb_converter.py reads in the 7 data files, cleans and normalises the IMDb data. After which, the desired set of tables are output as tab-separate-value (tsv) files. It was written to be as modular as possible with separate functions written to perform specific tasks. The functions that were implemented are:

  • unzip_files(folder)
  • make_Aliases(title_akas)
  • make_Alias_types(title_akas)
  • make_Alias_attributes(title_akas)
  • make_Directors_and_Writers(title_crew)
  • make_Episode_belongs_to(title_episode)
  • make_Names_(name_basics)
  • make_Name_worked_as(name_basics)
  • make_Known_for(name_basics)
  • make_Principals(title_principals)
  • make_Had_role(title_principals)
  • make_Titles(title_basics)
  • make_Title_genres(title_basics)
  • make_Title_ratings(title_ratings)

From the names and the comments their purpose should be clear. The python script is show below

The python script was written to be as modular as possible and made use of the pandas and numpy libraries. Running this python script in the terminal we get the following output:

sql assignment on imdb data github applied ai

In this post we designed a relational database to store the IMDb dataset, used yEd to create the ER and logical schema diagrams and used python to perform ETL tasks. In the next post, the third in the series, we will create the database in MySQL we designed using SQL scripts, load the data into the database, add primary and foreign key constraints and index the database. The code and images for this project are shared in the GitHub repository .

Further Reading

R. Ramakrishnan and J. Gehrke, Database management systems, 3rd edition, Mc Graw-Hill. Companion site

Yu-San Lin, CMPSC431W : Database Management Systems fall 2015 video lectures, Penn State. Course site

SQL: Importance and Sample Problems

59 Comment(s) Loading... Search

How to utilise appliedaicourse, python for data science introduction, python for data science: data structures, plotting for exploratory data analysis (eda), linear algebra, probability and statistics, dimensionality reduction and visualization:, pca(principal component analysis), (t-sne)t-distributed stochastic neighbourhood embedding, case study 1: quora question pair similarity problem, case study 2: personalized cancer diagnosis, case study 4:taxi demand prediction in new york city, case study 5: stackoverflow tag predictor, case study 6: microsoft malware detection, case study 9:netflix movie recommendation system (collaborative based recommendation), opencv using python, case study 10: self driving car, case study 13: semantic search engine for q&a [design + code], statistical testing and experiments(recorded live sessions), module 1: live sessions, module 2: live sessions, module 7: live sessions, module 8: live sessions, module 9: live sessions, machine learning high-level design, sample interview and conceptual questions [audio], module 10: live sessions.

IMAGES

  1. GitHub

    sql assignment on imdb data github applied ai

  2. GitHub

    sql assignment on imdb data github applied ai

  3. GitHub

    sql assignment on imdb data github applied ai

  4. Data visualisation with SQL and Python

    sql assignment on imdb data github applied ai

  5. GitHub

    sql assignment on imdb data github applied ai

  6. IMDb 2: Designing a MySQL database and performing ETL for IMDb dataset

    sql assignment on imdb data github applied ai

VIDEO

  1. Embedded SQL using JDBC

  2. #SQL Genius

  3. Solving AlmaBetter's Analytics Frameworks Assignment 3

  4. SQL Project for Data Analyst: Beginner to Advanced with Movies Dataset (Must-Watch!)

  5. What is SQL?

  6. CCBP SQL Milestone 2 Assignment 3 Answers

COMMENTS

  1. GitHub

    Assignment on IMDB database using sqlite3 and pandas This repository contains Db-IMDB database and its schema is in db_schema file. Required SQL commands are present in mySql Commands file. It is kind of my notes on SQL The Assignment questions are present in sql_questions file and the solutions are present in solutions.ipynb.

  2. BhargavTumu/SQL-Practice-Questions

    Steps to connect to the database using pandas : Create a new hupyter notebook, preferably in the same folder where you put the Db-IMDB.db file. We need to import couple of libraries pandas and sqllite3. import pandas as pd. import sqlite3 as sql # included as part of python standard library. Make a coonection to the sample imdb database.

  3. SOLVING QUERIES ON IMDB DATASET USING SQL

    Use imdb; show tables; select * from information_Schema.columns where table_schema='imdb'; -- this will give us an overview of whats in our dataset and tables. 1.Count the number of rows present in each table. select

  4. IMDB-Data-Analysis-in-SQL

    4. Data Insertion. In the previous steps, we created six tables. Now, we will insert the data into these tables. Here, we will be showing the syntax of 5 rows insertion into each table. (The complete data insertion syntax is available in the Repository) a. Inserting data into Movie Table

  5. SQL Notes

    result-set: a set of rows that form the result of a query along with column-names and meta-data. SELECT name,year FROM movies; SELECT rankscore,name FROM movies; row order same as the one in the table, in case of (*) or order in SELECT query.

  6. Use SQL on a Movie Database to Decide What to Watch

    Open your DBMS. Create a new schema or table by right-clicking on the left pane and selecting "New Database.". I've named my new database "imdb.". Right-click on the database → Tasks → Import Flat File and follow the Import Wizard to create a table for each file: Set valid data types for each column you are importing.

  7. IMDb Project (SQL)

    Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data science goals. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events.

  8. Exploring IMDb Data Through SQL Queries

    Sep 3, 2023. The IMDb dataset is a treasure trove of information for movie enthusiasts and data analysts alike. In this article, we'll embark on a journey through the IMDb dataset using SQL ...

  9. Overview

    IMDB-Analysis-in-SQL. This analysis is carried out to support RSVP Movies with a well-analyzed list of global stars to plan a movie for the global audience in 2022. With this, we will be able to answer a set of analytical questions to suggest RSVP Production House on which set of actors, directors, and production houses would be the best fit ...

  10. Assignment-22: SQL Assignment on IMDB data

    Applied Machine Learning Course Diploma in AI and ML GATE CS Blended Course Interview Preparation Course AI Workshop AI Case Studies. ... Assignment-22: SQL Assignment on IMDB data Instructor: Applied AI Course Duration: 1 mins . Close. ... Smart data acquisition for ML and AI . 10.20 ...

  11. IMDB SQL Data Analysis : PART I

    This is part of my IMDB Data Analysis Project using SQL, this article covers the Database part. Next part is about Data Exploration, solidifying assumptions etc., it's now available at the link ...

  12. IMDb 1: Introduction to the IMDb dataset and our end-to ...

    Introduction to the IMDb dataset. The IMDb dataset consists of 7 compressed tab-separated-value (*.tsv) files, which are explained and available for download from here. The data is refreshed daily, although the data used in this project was obtained on 29/11/2019. Each of these gzipped tab-separated-values ( TSV) formatted files in the UTF -8 ...

  13. PDF CS 327E Lab 1: Exploring the IMDB dataset through SQL

    called imdb_tables.sql. Add this file to your git repo. Step 5. Explore the imdb data through SQL. First, preview each table to get a basic understanding of the data. Next, run some queries to understand what the primary key is for each table, and how the tables relate to each other via primary and foreign keys.

  14. Preparing IMDB Movie Review Data for NLP Experiments

    The IMDB movie review data consists of 50,000 reviews -- 25,000 for training and 25,000 for testing. The training and test files are evenly divided into 12,500 positive reviews and 12,500 negative reviews. Negative reviews are those reviews associated with movies that the reviewer rated as 1 through 4 stars.

  15. SQL tutorial using IMDb Database

    Set like operations using SQL statements. It comes in very handy when data needs to retrieved from two or more databases (say current Database and Back up or historical Database in a single SQL query). Intersection returns the data common to the SELECT statements, Union eliminating duplicates whilst Union all does not .

  16. IMDb 2: Designing a MySQL database and performing

    Extract-Transform-Load: Preparing the IMDb data. The ETL part of this project was done using a single python script. The script imdb_converter.py reads in the 7 data files, cleans and normalises the IMDb data. After which, the desired set of tables are output as tab-separate-value (tsv) files. It was written to be as modular as possible with ...

  17. SQL Solutions.ipynb at master · GopiSumanth SQL.pdf

    2/10/2021 SQL/Solutions.ipynb at master · GopiSumanth/SQL 8/9 A decade is a sequence of 10 consecutive years. For example, say in your database you have movie information starting from 1965. Then the first decade is 1965, 1966, ..., 1974; the second one is 1967, 1968, ..., 1976 and so on. Find the decade D with the largest number of films and the total number of films in D.

  18. SQL: Importance and Sample Problems

    Multi-Processing & Multithreading in Python for AI/ML. 20.13. Parallel programming for training and productionization of ML/AI systems [Flask & Gunicorn] 20.14. SQL: Importance and Sample Problems. 20.15. Interactive Interview Session on Python programming for ML/AI. 20.16.