Using PostgreSQL in Python¶
- A database is an organized collection of structured data, typically stored electronically in a computer system.
- Despite the recent development of new types of databases (designed to address the increasing volume and variability of data), a considerable amount of data around the world is still stored in what are known as relational databases.
- Relational databases store data as collections of predefined tables with rows and columns that are connected through one or more relationships.
- The standard way to create and manage databases is
SQL
(Structured Query Language).SQL
is the building block for some of the most popular relational databases on the market, such asPostgreSQL
,Microsoft SQL Server
,MySQL
, andSQLite
Reading Material¶
Understanding PostgreSQL¶
- PostgreSQL is a lightweight, free, and open-source relational database.
- Check PostgreSQL under
Database
notes for more information
Using PostgreSQL in python¶
Prerequisites¶
- Installation of PostgreSQL.
- Installation of Python.
- Setting up a PostgreSQL database and user.
- Installation of
psycopg2
library (pip install psycopg2
)
Part 1: Connecting to PostgreSQL in Python¶
- Importing the Library:
import psycopg2
- Establishing a Connection:
conn = psycopg2.connect( dbname="your_dbname", user="your_username", password="your_password", host="your_host" )
- Creating a Cursor Object:
cur = conn.cursor()
In [1]:
Copied!
import psycopg2
import psycopg2
In [2]:
Copied!
try:
conn = psycopg2.connect(
dbname="project_movie",
user="postgres",
password="****",
host="localhost"
)
cur = conn.cursor()
except Exception as e:
print(e)
try:
conn = psycopg2.connect(
dbname="project_movie",
user="postgres",
password="****",
host="localhost"
)
cur = conn.cursor()
except Exception as e:
print(e)
connection to server at "localhost" (::1), port 5432 failed: FATAL: password authentication failed for user "postgres"
Part2: Basic Data Manipulation¶
- Creating a Table:
cur.execute("CREATE TABLE IF NOT EXISTS test (id SERIAL PRIMARY KEY, data VARCHAR);")
- Inserting Data:
cur.execute("INSERT INTO test (data) VALUES (%s)", ("Some data",))
- Querying Data:
cur.execute("SELECT * FROM test;") rows = cur.fetchall() for row in rows: print(row)
- Updating Data:
cur.execute("UPDATE test SET data = %s WHERE id = %s", ("New data", 1))
- Deleting Data:
cur.execute("DELETE FROM test WHERE id = %s", (1,))
- Committing Changes and Closing the Connection:
conn.commit() cur.close() conn.close()
IMDB Top250 Movies Project¶
The Excel file contains the following columns, which represent key attributes of the top 250 movies:
Title
: The name of the movie.Rank
: The rank of the movie in the top 250 list.Length
: The length of the movie.Rating
: The IMDb rating of the movie.Number
of Rating: The number of ratings the movie has received.Year
: The release year of the movie.Category
: The category or rating of the movie (e.g., R, PG-13).Link
: A link to the IMDb page of the movie.Budget
: The estimated budget of the movie.Worldwide
Revenue: The worldwide revenue of the movie.Cast
: The main cast of the movie, listed as a string with names separated by semicolons.Director
: The director(s) of the movie.Writer
: The writer(s) of the movie.
In [3]:
Copied!
import pandas as pd
import pandas as pd
In [4]:
Copied!
data = pd.read_excel("./Data/imdb_top250.xlsx",index_col=0)
data.head()
data = pd.read_excel("./Data/imdb_top250.xlsx",index_col=0)
data.head()
Out[4]:
Title | Rank | Length | Rating | Number of Rating | Year | Category | Link | Budget | Worldwide Revenue | Cast | Director | Writer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Shawshank Redemption | 1 | 2h 22m | 9.3 | (2.8M) | 1994 | R | http://www.imdb.com/title/tt0111161/?ref_=chtt... | $25,000,000 (estimated) | $28,884,716 | Tim Robbins; Morgan Freeman; Bob Gunton; Willi... | Frank Darabont | Stephen King; Frank Darabont |
1 | The Godfather | 2 | 2h 55m | 9.2 | (2M) | 1972 | R | http://www.imdb.com/title/tt0068646/?ref_=chtt... | $6,000,000 (estimated) | $250,341,816 | Marlon Brando; Al Pacino; James Caan; Diane Ke... | Francis Ford Coppola | Mario Puzo; Francis Ford Coppola |
2 | The Dark Knight | 3 | 2h 32m | 9.0 | (2.8M) | 2008 | PG-13 | http://www.imdb.com/title/tt0468569/?ref_=chtt... | $185,000,000 (estimated) | $1,029,266,147 | Christian Bale; Heath Ledger; Aaron Eckhart; M... | Christopher Nolan | Jonathan Nolan; Christopher Nolan; David S. Goyer |
3 | The Godfather Part II | 4 | 3h 22m | 9.0 | (1.3M) | 1974 | R | http://www.imdb.com/title/tt0071562/?ref_=chtt... | $13,000,000 (estimated) | $47,961,919 | Al Pacino; Robert De Niro; Robert Duvall; Dian... | Francis Ford Coppola | Francis Ford Coppola; Mario Puzo |
4 | 12 Angry Men | 5 | 1h 36m | 9.0 | (850K) | 1957 | Approved | http://www.imdb.com/title/tt0050083/?ref_=chtt... | $350,000 (estimated) | $955 | Henry Fonda; Lee J. Cobb; Martin Balsam; John ... | Sidney Lumet | Reginald Rose |
Step 2:Database Schema Design¶
Given this data, we can design a relational database schema. A possible design could involve the following tables:
- Movies: Contains details about the movie (Title, Rank, Length, Rating, Year, Category, Budget, Revenue).
- People: A table for individuals, including actors, directors, and writers.
- MoviePeople: A relationship table linking movies to people with their roles (Actor, Director, Writer).
Movies¶
------
- MovieID (Primary Key)
- Title
- Rank
- Length
- Rating
- Year
- Category
- Budget
- Revenue
- IMDbLink
People¶
------
- PersonID (Primary Key)
- Name
MoviePeople¶
-----------
- MovieID (Foreign Key)
- PersonID (Foreign Key)
- Role (e.g., Actor, Director, Writer)
Step 3: Creating Tables in PostgreSQL¶
- SQL Commands for Table Creation:
In [5]:
Copied!
import pandas as pd
identity_df = pd.read_csv("./identity.csv")
# identity_df
dbname = identity_df["dbname"].values[0]
user = identity_df["user"].values[0]
password = identity_df["password"].values[0]
host = identity_df["host"].values[0]
import pandas as pd
identity_df = pd.read_csv("./identity.csv")
# identity_df
dbname = identity_df["dbname"].values[0]
user = identity_df["user"].values[0]
password = identity_df["password"].values[0]
host = identity_df["host"].values[0]
In [6]:
Copied!
import psycopg2
def create_tables(dbname=dbname, user=user, password=password, host=host):
commands = (
"""
CREATE TABLE IF NOT EXISTS Movies (
MovieID SERIAL PRIMARY KEY,
Title VARCHAR(255),
Rank INT,
Length VARCHAR(50),
Rating FLOAT,
Year INT,
Category VARCHAR(50),
Budget VARCHAR(255),
Revenue VARCHAR(255),
IMDbLink VARCHAR(255)
);
""",
"""
CREATE TABLE IF NOT EXISTS People (
PersonID SERIAL PRIMARY KEY,
Name VARCHAR(255) UNIQUE
);
""",
"""
CREATE TABLE IF NOT EXISTS MoviePeople (
MovieID INT REFERENCES Movies(MovieID),
PersonID INT REFERENCES People(PersonID),
Role VARCHAR(50),
PRIMARY KEY (MovieID, PersonID, Role)
);
"""
)
conn = None
try:
# connect to the PostgreSQL server
conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
cur = conn.cursor()
# create table one by one
for command in commands:
cur.execute(command)
# close communication with the PostgreSQL database server
cur.close()
# commit the changes
conn.commit()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
create_tables()
import psycopg2
def create_tables(dbname=dbname, user=user, password=password, host=host):
commands = (
"""
CREATE TABLE IF NOT EXISTS Movies (
MovieID SERIAL PRIMARY KEY,
Title VARCHAR(255),
Rank INT,
Length VARCHAR(50),
Rating FLOAT,
Year INT,
Category VARCHAR(50),
Budget VARCHAR(255),
Revenue VARCHAR(255),
IMDbLink VARCHAR(255)
);
""",
"""
CREATE TABLE IF NOT EXISTS People (
PersonID SERIAL PRIMARY KEY,
Name VARCHAR(255) UNIQUE
);
""",
"""
CREATE TABLE IF NOT EXISTS MoviePeople (
MovieID INT REFERENCES Movies(MovieID),
PersonID INT REFERENCES People(PersonID),
Role VARCHAR(50),
PRIMARY KEY (MovieID, PersonID, Role)
);
"""
)
conn = None
try:
# connect to the PostgreSQL server
conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
cur = conn.cursor()
# create table one by one
for command in commands:
cur.execute(command)
# close communication with the PostgreSQL database server
cur.close()
# commit the changes
conn.commit()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
create_tables()
Importing Data from Excel to PostgreSQL¶
- This is a simplified version, focusing on the
Movies
table. The complete implementation would also parse and insert data forPeople
andMoviePeople
.
In [7]:
Copied!
data.head()
data.head()
Out[7]:
Title | Rank | Length | Rating | Number of Rating | Year | Category | Link | Budget | Worldwide Revenue | Cast | Director | Writer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Shawshank Redemption | 1 | 2h 22m | 9.3 | (2.8M) | 1994 | R | http://www.imdb.com/title/tt0111161/?ref_=chtt... | $25,000,000 (estimated) | $28,884,716 | Tim Robbins; Morgan Freeman; Bob Gunton; Willi... | Frank Darabont | Stephen King; Frank Darabont |
1 | The Godfather | 2 | 2h 55m | 9.2 | (2M) | 1972 | R | http://www.imdb.com/title/tt0068646/?ref_=chtt... | $6,000,000 (estimated) | $250,341,816 | Marlon Brando; Al Pacino; James Caan; Diane Ke... | Francis Ford Coppola | Mario Puzo; Francis Ford Coppola |
2 | The Dark Knight | 3 | 2h 32m | 9.0 | (2.8M) | 2008 | PG-13 | http://www.imdb.com/title/tt0468569/?ref_=chtt... | $185,000,000 (estimated) | $1,029,266,147 | Christian Bale; Heath Ledger; Aaron Eckhart; M... | Christopher Nolan | Jonathan Nolan; Christopher Nolan; David S. Goyer |
3 | The Godfather Part II | 4 | 3h 22m | 9.0 | (1.3M) | 1974 | R | http://www.imdb.com/title/tt0071562/?ref_=chtt... | $13,000,000 (estimated) | $47,961,919 | Al Pacino; Robert De Niro; Robert Duvall; Dian... | Francis Ford Coppola | Francis Ford Coppola; Mario Puzo |
4 | 12 Angry Men | 5 | 1h 36m | 9.0 | (850K) | 1957 | Approved | http://www.imdb.com/title/tt0050083/?ref_=chtt... | $350,000 (estimated) | $955 | Henry Fonda; Lee J. Cobb; Martin Balsam; John ... | Sidney Lumet | Reginald Rose |
In [8]:
Copied!
data.shape
data.shape
Out[8]:
(56, 13)
fetchone
: fetches the next row of a query result set, returning a single sequence, or None
when no more data is available.
In [9]:
Copied!
def insert_movie(title, rank, length, rating, year, category, budget, revenue, link, dbname=dbname, user=user, password=password, host=host):
""" Insert a new movie into the Movies table """
sql = """INSERT INTO Movies(Title, Rank, Length, Rating, Year, Category, Budget, Revenue, IMDbLink)
VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s) RETURNING MovieID;"""
conn = None
movie_id = None
try:
conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
cur = conn.cursor()
cur.execute(sql, (title, rank, length, rating, year, category, budget, revenue, link))
movie_id = cur.fetchone()[0] # gets the ID of the newly inserted movie record
conn.commit()
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
return movie_id
def insert_movie(title, rank, length, rating, year, category, budget, revenue, link, dbname=dbname, user=user, password=password, host=host):
""" Insert a new movie into the Movies table """
sql = """INSERT INTO Movies(Title, Rank, Length, Rating, Year, Category, Budget, Revenue, IMDbLink)
VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s) RETURNING MovieID;"""
conn = None
movie_id = None
try:
conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
cur = conn.cursor()
cur.execute(sql, (title, rank, length, rating, year, category, budget, revenue, link))
movie_id = cur.fetchone()[0] # gets the ID of the newly inserted movie record
conn.commit()
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
return movie_id
In [10]:
Copied!
# Insert data into the database
for index, row in data.iterrows():
print(f"Insert {index}: {row['Title']}")
insert_movie(row['Title'], row['Rank'], row['Length'], row['Rating'], row['Year'],
row['Category'], row['Budget'], row['Worldwide Revenue'], row['Link'])
# Insert data into the database
for index, row in data.iterrows():
print(f"Insert {index}: {row['Title']}")
insert_movie(row['Title'], row['Rank'], row['Length'], row['Rating'], row['Year'],
row['Category'], row['Budget'], row['Worldwide Revenue'], row['Link'])
Insert 0: The Shawshank Redemption Insert 1: The Godfather Insert 2: The Dark Knight Insert 3: The Godfather Part II Insert 4: 12 Angry Men Insert 5: Schindler's List Insert 6: The Lord of the Rings: The Return of the King Insert 7: Pulp Fiction Insert 8: The Lord of the Rings: The Fellowship of the Ring Insert 9: The Good, the Bad and the Ugly Insert 10: Forrest Gump Insert 11: Fight Club Insert 12: The Lord of the Rings: The Two Towers Insert 13: Inception Insert 14: Star Wars: Episode V - The Empire Strikes Back Insert 15: The Matrix Insert 16: Goodfellas Insert 17: One Flew Over the Cuckoo's Nest Insert 18: Se7en Insert 19: It's a Wonderful Life Insert 20: Interstellar Insert 21: Seven Samurai Insert 22: The Silence of the Lambs Insert 23: Saving Private Ryan Insert 24: City of God Insert 25: Life Is Beautiful Insert 26: The Green Mile Insert 27: Star Wars: Episode IV - A New Hope Insert 28: Terminator 2: Judgment Day Insert 29: Spider-Man: Across the Spider-Verse Insert 30: Back to the Future Insert 31: Spirited Away Insert 32: The Pianist Insert 33: Parasite Insert 34: Psycho Insert 35: Gladiator Insert 36: The Lion King Insert 37: Léon: The Professional Insert 38: The Departed Insert 39: American History X Insert 40: Whiplash Insert 41: The Prestige Insert 42: Grave of the Fireflies Insert 43: Harakiri Insert 44: The Usual Suspects Insert 45: Casablanca Insert 46: The Intouchables Insert 47: Modern Times Insert 48: Cinema Paradiso Insert 49: Rear Window Insert 50: Once Upon a Time in the West Insert 51: Alien Insert 52: City Lights Insert 53: Apocalypse Now Insert 54: Django Unchained Insert 55: Memento
Querying Data¶
In [11]:
Copied!
def query_movies_on_year(year, dbname=dbname, user=user, password=password, host=host):
""" Query movies based on the release year """
conn = None
try:
conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
cur = conn.cursor()
cur.execute("SELECT Title, Rating, Year FROM Movies WHERE Year = %s", (year,))
print(f"Movies released in {year}:")
for row in cur.fetchall():
print(f"Title: {row[0]}, Rating: {row[1]}, Year: {row[2]}")
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
# Example usage: Querying movies released in 1994
query_movies_on_year(1994)
def query_movies_on_year(year, dbname=dbname, user=user, password=password, host=host):
""" Query movies based on the release year """
conn = None
try:
conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host)
cur = conn.cursor()
cur.execute("SELECT Title, Rating, Year FROM Movies WHERE Year = %s", (year,))
print(f"Movies released in {year}:")
for row in cur.fetchall():
print(f"Title: {row[0]}, Rating: {row[1]}, Year: {row[2]}")
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
# Example usage: Querying movies released in 1994
query_movies_on_year(1994)
Movies released in 1994: Title: The Shawshank Redemption, Rating: 9.3, Year: 1994 Title: Pulp Fiction, Rating: 8.9, Year: 1994 Title: Forrest Gump, Rating: 8.8, Year: 1994 Title: The Lion King, Rating: 8.5, Year: 1994 Title: Léon: The Professional, Rating: 8.5, Year: 1994