Using PostgreSQL in Python¶

A database is an organized collection of structured data, typically stored electronically in a computer system.
Despite the recent development of new types of databases (designed to address the increasing volume and variability of data), a considerable amount of data around the world is still stored in what are known as relational databases.
- Relational databases store data as collections of predefined tables with rows and columns that are connected through one or more relationships.
- The standard way to create and manage databases is SQL (Structured Query Language). SQL is the building block for some of the most popular relational databases on the market, such as PostgreSQL, Microsoft SQL Server, MySQL, and SQLite

Reading Material¶

Understanding PostgreSQL¶

PostgreSQL is a lightweight, free, and open-source relational database.
Check PostgreSQL under Database notes for more information

Using PostgreSQL in python¶

Prerequisites¶

Installation of PostgreSQL.
Installation of Python.
Setting up a PostgreSQL database and user.
Installation of psycopg2 library (pip install psycopg2)

Part 1: Connecting to PostgreSQL in Python¶

Importing the Library: import psycopg2

Establishing a Connection:

conn = psycopg2.connect(
    dbname="your_dbname", 
    user="your_username", 
    password="your_password", 
    host="your_host"
)

Creating a Cursor Object: cur = conn.cursor()

In [1]:

Copied!

import psycopg2
import psycopg2

In [2]:

Copied!





try:
    conn = psycopg2.connect(
        dbname="project_movie", 
        user="postgres", 
        password="****", 
        host="localhost"
    )
    cur = conn.cursor()
except Exception as e:
    print(e)
try:
    conn = psycopg2.connect(
        dbname="project_movie", 
        user="postgres", 
        password="****", 
        host="localhost"
    )
    cur = conn.cursor()
except Exception as e:
    print(e)

connection to server at "localhost" (::1), port 5432 failed: FATAL:  password authentication failed for user "postgres"

Part2: Basic Data Manipulation¶

Creating a Table:

cur.execute("CREATE TABLE IF NOT EXISTS test (id SERIAL PRIMARY KEY, data VARCHAR);")

Inserting Data:

cur.execute("INSERT INTO test (data) VALUES (%s)", ("Some data",))

Querying Data:

cur.execute("SELECT * FROM test;")
rows = cur.fetchall()
for row in rows:
    print(row)

Updating Data:

cur.execute("UPDATE test SET data = %s WHERE id = %s", ("New data", 1))

Deleting Data:

cur.execute("DELETE FROM test WHERE id = %s", (1,))

Committing Changes and Closing the Connection:
```
conn.commit()
cur.close()
conn.close()
```

IMDB Top250 Movies Project¶

The Excel file contains the following columns, which represent key attributes of the top 250 movies:

Title: The name of the movie.
Rank: The rank of the movie in the top 250 list.
Length: The length of the movie.
Rating: The IMDb rating of the movie.
Number of Rating: The number of ratings the movie has received.
Year: The release year of the movie.
Category: The category or rating of the movie (e.g., R, PG-13).
Link: A link to the IMDb page of the movie.
Budget: The estimated budget of the movie.
Worldwide Revenue: The worldwide revenue of the movie.
Cast: The main cast of the movie, listed as a string with names separated by semicolons.
Director: The director(s) of the movie.
Writer: The writer(s) of the movie.

In [3]:

Copied!

import pandas as pd
import pandas as pd

In [4]:

Copied!

data = pd.read_excel("./Data/imdb_top250.xlsx",index_col=0)
data.head()
data = pd.read_excel("./Data/imdb_top250.xlsx",index_col=0)
data.head()

Out[4]:

	Title	Rank	Length	Rating	Number of Rating	Year	Category	Link	Budget	Worldwide Revenue	Cast	Director	Writer
0	The Shawshank Redemption	1	2h 22m	9.3	(2.8M)	1994	R	http://www.imdb.com/title/tt0111161/?ref_=chtt...	$25,000,000 (estimated)	$28,884,716	Tim Robbins; Morgan Freeman; Bob Gunton; Willi...	Frank Darabont	Stephen King; Frank Darabont
1	The Godfather	2	2h 55m	9.2	(2M)	1972	R	http://www.imdb.com/title/tt0068646/?ref_=chtt...	$6,000,000 (estimated)	$250,341,816	Marlon Brando; Al Pacino; James Caan; Diane Ke...	Francis Ford Coppola	Mario Puzo; Francis Ford Coppola
2	The Dark Knight	3	2h 32m	9.0	(2.8M)	2008	PG-13	http://www.imdb.com/title/tt0468569/?ref_=chtt...	$185,000,000 (estimated)	$1,029,266,147	Christian Bale; Heath Ledger; Aaron Eckhart; M...	Christopher Nolan	Jonathan Nolan; Christopher Nolan; David S. Goyer
3	The Godfather Part II	4	3h 22m	9.0	(1.3M)	1974	R	http://www.imdb.com/title/tt0071562/?ref_=chtt...	$13,000,000 (estimated)	$47,961,919	Al Pacino; Robert De Niro; Robert Duvall; Dian...	Francis Ford Coppola	Francis Ford Coppola; Mario Puzo
4	12 Angry Men	5	1h 36m	9.0	(850K)	1957	Approved	http://www.imdb.com/title/tt0050083/?ref_=chtt...	$350,000 (estimated)	$955	Henry Fonda; Lee J. Cobb; Martin Balsam; John ...	Sidney Lumet	Reginald Rose

Step 2:Database Schema Design¶

Given this data, we can design a relational database schema. A possible design could involve the following tables:

Movies: Contains details about the movie (Title, Rank, Length, Rating, Year, Category, Budget, Revenue).
People: A table for individuals, including actors, directors, and writers.
MoviePeople: A relationship table linking movies to people with their roles (Actor, Director, Writer).

Movies¶

------
- MovieID (Primary Key)
- Title
- Rank
- Length
- Rating
- Year
- Category
- Budget
- Revenue
- IMDbLink

People¶

------
- PersonID (Primary Key)
- Name

MoviePeople¶

-----------
- MovieID (Foreign Key)
- PersonID (Foreign Key)
- Role (e.g., Actor, Director, Writer)

Step 3: Creating Tables in PostgreSQL¶

SQL Commands for Table Creation:

In [5]:

Copied!





import pandas as pd

identity_df = pd.read_csv("./identity.csv")
# identity_df
dbname = identity_df["dbname"].values[0]
user = identity_df["user"].values[0]
password = identity_df["password"].values[0]
host = identity_df["host"].values[0]
import pandas as pd

identity_df = pd.read_csv("./identity.csv")
# identity_df
dbname = identity_df["dbname"].values[0]
user = identity_df["user"].values[0]
password = identity_df["password"].values[0]
host = identity_df["host"].values[0]

In [6]:

Copied!





import psycopg2

def create_tables(dbname=dbname, user=user, password=password, host=host):
    commands = (
        """
        CREATE TABLE IF NOT EXISTS Movies (
            MovieID SERIAL PRIMARY KEY,
            Title VARCHAR(255),
            Rank INT,
            Length VARCHAR(50),
            Rating FLOAT,
            Year INT,
            Category VARCHAR(50),
            Budget VARCHAR(255),
            Revenue VARCHAR(255),
            IMDbLink VARCHAR(255)
        );
        """,
        """ 
        CREATE TABLE IF NOT EXISTS People (
            PersonID SERIAL PRIMARY KEY,
            Name VARCHAR(255) UNIQUE
        );
        """,
        """
        CREATE TABLE IF NOT EXISTS MoviePeople (
            MovieID INT REFERENCES Movies(MovieID),
            PersonID INT REFERENCES People(PersonID),
            Role VARCHAR(50),
            PRIMARY KEY (MovieID, PersonID, Role)
        );
        """
    )
    conn = None
    try:
        # connect to the PostgreSQL server
        conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
        cur = conn.cursor()
        # create table one by one
        for command in commands:
            cur.execute(command)
        # close communication with the PostgreSQL database server
        cur.close()
        # commit the changes
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()

create_tables()
import psycopg2

def create_tables(dbname=dbname, user=user, password=password, host=host):
    commands = (
        """
        CREATE TABLE IF NOT EXISTS Movies (
            MovieID SERIAL PRIMARY KEY,
            Title VARCHAR(255),
            Rank INT,
            Length VARCHAR(50),
            Rating FLOAT,
            Year INT,
            Category VARCHAR(50),
            Budget VARCHAR(255),
            Revenue VARCHAR(255),
            IMDbLink VARCHAR(255)
        );
        """,
        """ 
        CREATE TABLE IF NOT EXISTS People (
            PersonID SERIAL PRIMARY KEY,
            Name VARCHAR(255) UNIQUE
        );
        """,
        """
        CREATE TABLE IF NOT EXISTS MoviePeople (
            MovieID INT REFERENCES Movies(MovieID),
            PersonID INT REFERENCES People(PersonID),
            Role VARCHAR(50),
            PRIMARY KEY (MovieID, PersonID, Role)
        );
        """
    )
    conn = None
    try:
        # connect to the PostgreSQL server
        conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
        cur = conn.cursor()
        # create table one by one
        for command in commands:
            cur.execute(command)
        # close communication with the PostgreSQL database server
        cur.close()
        # commit the changes
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()

create_tables()

Importing Data from Excel to PostgreSQL¶

This is a simplified version, focusing on the Movies table. The complete implementation would also parse and insert data for People and MoviePeople.

In [7]:

Copied!

data.head()
data.head()

Out[7]:

	Title	Rank	Length	Rating	Number of Rating	Year	Category	Link	Budget	Worldwide Revenue	Cast	Director	Writer
0	The Shawshank Redemption	1	2h 22m	9.3	(2.8M)	1994	R	http://www.imdb.com/title/tt0111161/?ref_=chtt...	$25,000,000 (estimated)	$28,884,716	Tim Robbins; Morgan Freeman; Bob Gunton; Willi...	Frank Darabont	Stephen King; Frank Darabont
1	The Godfather	2	2h 55m	9.2	(2M)	1972	R	http://www.imdb.com/title/tt0068646/?ref_=chtt...	$6,000,000 (estimated)	$250,341,816	Marlon Brando; Al Pacino; James Caan; Diane Ke...	Francis Ford Coppola	Mario Puzo; Francis Ford Coppola
2	The Dark Knight	3	2h 32m	9.0	(2.8M)	2008	PG-13	http://www.imdb.com/title/tt0468569/?ref_=chtt...	$185,000,000 (estimated)	$1,029,266,147	Christian Bale; Heath Ledger; Aaron Eckhart; M...	Christopher Nolan	Jonathan Nolan; Christopher Nolan; David S. Goyer
3	The Godfather Part II	4	3h 22m	9.0	(1.3M)	1974	R	http://www.imdb.com/title/tt0071562/?ref_=chtt...	$13,000,000 (estimated)	$47,961,919	Al Pacino; Robert De Niro; Robert Duvall; Dian...	Francis Ford Coppola	Francis Ford Coppola; Mario Puzo
4	12 Angry Men	5	1h 36m	9.0	(850K)	1957	Approved	http://www.imdb.com/title/tt0050083/?ref_=chtt...	$350,000 (estimated)	$955	Henry Fonda; Lee J. Cobb; Martin Balsam; John ...	Sidney Lumet	Reginald Rose

In [8]:

Copied!

data.shape
data.shape

Out[8]:

(56, 13)

fetchone: fetches the next row of a query result set, returning a single sequence, or None when no more data is available.

In [9]:

Copied!





def insert_movie(title, rank, length, rating, year, category, budget, revenue, link, dbname=dbname, user=user, password=password, host=host):
    """ Insert a new movie into the Movies table """
    sql = """INSERT INTO Movies(Title, Rank, Length, Rating, Year, Category, Budget, Revenue, IMDbLink)
             VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s) RETURNING MovieID;"""
    conn = None
    movie_id = None
    try:
        conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
        cur = conn.cursor()
        cur.execute(sql, (title, rank, length, rating, year, category, budget, revenue, link))
        movie_id = cur.fetchone()[0] # gets the ID of the newly inserted movie record 
        conn.commit()
        cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()
    return movie_id
def insert_movie(title, rank, length, rating, year, category, budget, revenue, link, dbname=dbname, user=user, password=password, host=host):
    """ Insert a new movie into the Movies table """
    sql = """INSERT INTO Movies(Title, Rank, Length, Rating, Year, Category, Budget, Revenue, IMDbLink)
             VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s) RETURNING MovieID;"""
    conn = None
    movie_id = None
    try:
        conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
        cur = conn.cursor()
        cur.execute(sql, (title, rank, length, rating, year, category, budget, revenue, link))
        movie_id = cur.fetchone()[0] # gets the ID of the newly inserted movie record 
        conn.commit()
        cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()
    return movie_id

In [10]:

Copied!





# Insert data into the database
for index, row in data.iterrows():
    print(f"Insert {index}: {row['Title']}")
    insert_movie(row['Title'], row['Rank'], row['Length'], row['Rating'], row['Year'],
                 row['Category'], row['Budget'], row['Worldwide Revenue'], row['Link'])
# Insert data into the database
for index, row in data.iterrows():
    print(f"Insert {index}: {row['Title']}")
    insert_movie(row['Title'], row['Rank'], row['Length'], row['Rating'], row['Year'],
                 row['Category'], row['Budget'], row['Worldwide Revenue'], row['Link'])

Insert 0:  The Shawshank Redemption
Insert 1:  The Godfather
Insert 2:  The Dark Knight
Insert 3:  The Godfather Part II
Insert 4:  12 Angry Men
Insert 5:  Schindler's List
Insert 6:  The Lord of the Rings: The Return of the King
Insert 7:  Pulp Fiction
Insert 8:  The Lord of the Rings: The Fellowship of the Ring
Insert 9:  The Good, the Bad and the Ugly
Insert 10:  Forrest Gump
Insert 11:  Fight Club
Insert 12:  The Lord of the Rings: The Two Towers
Insert 13:  Inception
Insert 14:  Star Wars: Episode V - The Empire Strikes Back
Insert 15:  The Matrix
Insert 16:  Goodfellas
Insert 17:  One Flew Over the Cuckoo's Nest
Insert 18:  Se7en
Insert 19:  It's a Wonderful Life
Insert 20:  Interstellar
Insert 21:  Seven Samurai
Insert 22:  The Silence of the Lambs
Insert 23:  Saving Private Ryan
Insert 24:  City of God
Insert 25:  Life Is Beautiful
Insert 26:  The Green Mile
Insert 27:  Star Wars: Episode IV - A New Hope
Insert 28:  Terminator 2: Judgment Day
Insert 29:  Spider-Man: Across the Spider-Verse
Insert 30:  Back to the Future
Insert 31:  Spirited Away
Insert 32:  The Pianist
Insert 33:  Parasite
Insert 34:  Psycho
Insert 35:  Gladiator
Insert 36:  The Lion King
Insert 37:  Léon: The Professional
Insert 38:  The Departed
Insert 39:  American History X
Insert 40:  Whiplash
Insert 41:  The Prestige
Insert 42:  Grave of the Fireflies
Insert 43:  Harakiri
Insert 44:  The Usual Suspects
Insert 45:  Casablanca
Insert 46:  The Intouchables
Insert 47:  Modern Times
Insert 48:  Cinema Paradiso
Insert 49:  Rear Window
Insert 50:  Once Upon a Time in the West
Insert 51:  Alien
Insert 52:  City Lights
Insert 53:  Apocalypse Now
Insert 54:  Django Unchained
Insert 55:  Memento

Querying Data¶

In [11]:

Copied!





def query_movies_on_year(year, dbname=dbname, user=user, password=password, host=host):
    """ Query movies based on the release year """
    conn = None
    try:
        conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
        cur = conn.cursor()
        cur.execute("SELECT Title, Rating, Year FROM Movies WHERE Year = %s", (year,))
        print(f"Movies released in {year}:")
        for row in cur.fetchall():
            print(f"Title: {row[0]}, Rating: {row[1]}, Year: {row[2]}")
        cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()

# Example usage: Querying movies released in 1994
query_movies_on_year(1994)
def query_movies_on_year(year, dbname=dbname, user=user, password=password, host=host):
    """ Query movies based on the release year """
    conn = None
    try:
        conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
        cur = conn.cursor()
        cur.execute("SELECT Title, Rating, Year FROM Movies WHERE Year = %s", (year,))
        print(f"Movies released in {year}:")
        for row in cur.fetchall():
            print(f"Title: {row[0]}, Rating: {row[1]}, Year: {row[2]}")
        cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()

# Example usage: Querying movies released in 1994
query_movies_on_year(1994)

Movies released in 1994:
Title:  The Shawshank Redemption, Rating: 9.3, Year: 1994
Title:  Pulp Fiction, Rating: 8.9, Year: 1994
Title:  Forrest Gump, Rating: 8.8, Year: 1994
Title:  The Lion King, Rating: 8.5, Year: 1994
Title:  Léon: The Professional, Rating: 8.5, Year: 1994