Welcome!
- In previous weeks, we introduced you to Python, a high-level programming language that utilized the same building blocks we learned in C. However, we introduced this new language not for the purpose of learning “just another language.” Instead, we do so because some tools are better for some jobs and not so great for others!
- This week, we will be continuing with more syntax related to Python.
- Further, we will be integrating this knowledge with data.
- Finally, we will be discussing SQL or Structured Query Language, a domain-specific way by which we can interact with and modify data.
- Overall, one of the goals of this course is to learn to program generally – not simply how to program in the languages described in this course.
Flat-File Database
- As you have likely seen before, data can often be described in patterns of columns and rows.
- Spreadsheets like those created in Microsoft Excel and Google Sheets can be outputted to a
csvor comma-separated values file. - If you look at a
csvfile, you’ll notice that the file is flat in that all of our data is stored in a single table represented by a text file. We call this form of data a flat-file database. - All data is stored row by row. Each column is separated by a comma or another value.
- Python comes with native support for
csvfiles. - First, download favorites.csv and upload it to your file explorer inside cs50.dev. Second, examining this data, notice that the first row is special in that it defines each column. Then, each record is stored row by row.
- In your terminal window, type
code favorites.pyand write code as follows:
Notice that the csv library is imported. Further, we created a reader that will hold the result of csv.reader(file). The csv.reader function reads each row from the file, and in our code, we store the results in reader. print(row[1]), therefore, will print the language from the favorites.csv file. You can download this code here.
- You can improve your code as follows:
Notice that favorite is stored and then printed. Also, notice that we use the next function to skip to the next line of our reader. You can download this code here.
- One of the disadvantages of the above approach is that we are trusting that
row[1]is always the favorite. However, what would happen if the columns had been moved around? - We can fix this potential issue. Python also allows you to index by the keys of a list. Modify your code as follows:
Notice that this example directly utilizes the language key in the print statement. favorite is assigned the value of row["language"]. You can download this code here.
- This could be further simplified to:
You can download this code here.
- To count the number of favorite languages expressed in the
csvfile, we can do the following:
Notice that each language is counted using if statements. Further, notice the double equal == signs in those if statements. You can download this code here.
- Python allows us to use a dictionary to count the
countsof each language. Consider the following improvement upon our code:
Notice that the value in counts with the key favorite is incremented when it exists already. If it does not exist, we define counts[favorite] and set it to 1. Further, the formatted string has been improved to present the counts[favorite]. You can download this code here.
- We can also utilize
tryandexceptto account for potential exceptions:
Notice how the if and else have been replaced with try and except. You can download this code here.
- Python also allows sorting
counts. Improve your code as follows:
Notice the sorted(counts) at the bottom of the code. You can download this code here.
- If you look at the parameters for the
sortedfunction in the Python documentation, you will find it has many built-in parameters. You can leverage some of these built-in parameters as follows:
Notice the arguments passed to sorted. The key argument allows you to tell Python the method you wish to use to sort items. In this case, counts.get is used to sort by the values. reverse=True tells sorted to sort from largest to smallest. You can download this code here.
- You can learn more about sorted in the Python Documentation.
Relational Databases
- Google, X, and Meta all use relational databases to store their information at scale.
- Relational databases store data in rows and columns in structures called tables.
- SQL allows for four types of commands:
- These four operations are affectionately called CRUD.
- We can create a database with the SQL syntax
CREATE TABLE table (column type, ...);. But where do you run this command? sqlite3is a type of SQL database that has the core features required for this course.- We can create a SQL database at the terminal by typing
sqlite3 favorites.db. Upon being prompted, we will agree that we want to createfavorites.dbby pressingy. - You will notice a different prompt as we are now using a program called
sqlite. - We can put
sqliteintocsvmode by typing.mode csv. Then, we can import our data from ourcsvfile by typing.import favorites.csv favorites. It seems that nothing has happened! - We can type
.schemato see the structure of the database. - You can read items from a table using the syntax
SELECT columns FROM table. - For example, you can type
SELECT * FROM favorites;which will print every row infavorites. - You can get a subset of the data using the command
SELECT language FROM favorites;. - SQL supports many commands to access data, including:
- For example, you can type
SELECT COUNT(*) FROM favorites;. Further, you can typeSELECT DISTINCT language FROM favorites;to get a list of the individual languages within the database. You could even typeSELECT COUNT(DISTINCT language) FROM favorites;to get a count of those. - SQL offers additional commands we can utilize in our queries:
Notice that we use -- to write a comment in SQL.
SELECT
- For example, we can execute
SELECT COUNT(*) FROM favorites WHERE language = 'C';. A count is presented. - Further, we could type
SELECT COUNT(*) FROM favorites WHERE language = 'C' AND problem = 'Hello, World';. Notice how theANDis utilized to narrow our results. - Similarly, we could execute
SELECT language, COUNT(*) FROM favorites GROUP BY language;. This would offer a temporary table that would show the language and count. - We could improve this by typing
SELECT language, COUNT(*) FROM favorites GROUP BY language ORDER BY COUNT(*);. This will order the resulting table by thecount. - Likewise, we could execute
SELECT COUNT(*) FROM favorites WHERE language = 'C' AND (problem = 'Hello, World' OR problem = 'Hello, It''s Me');. Do notice that there are two''marks to allow the use of single quotes in a way that does not confuse SQL. - Further, we could execute
SELECT COUNT(*) FROM favorites WHERE language = 'C' AND problem LIKE 'Hello, %';to find any problems that start withHello,(including a space). - We can order the output as follows:
SELECT language, COUNT(*) FROM favorites GROUP BY language ORDER BY COUNT(*) DESC;. - We can even create aliases, like variables in our queries:
SELECT language, COUNT(*) AS n FROM favorites GROUP BY language ORDER BY n DESC;. - Finally, we can limit our output to 1 or more values:
SELECT language, COUNT(*) AS n FROM favorites GROUP BY language ORDER BY n DESC LIMIT 1;. - Notice, by convention, SQL keywords are often typed in caps.
INSERT
- We can also
INSERTinto a SQL database utilizing the formINSERT INTO table (column...) VALUES(value, ...);. - We can execute
INSERT INTO favorites (language, problem) VALUES ('SQL', 'Fiftyville');. - You can verify the addition of this favorite by executing
SELECT * FROM favorites;.
DELETE
DELETEallows you to delete parts of your data. For example, you couldDELETE FROM favorites WHERE Timestamp IS NULL;. This deletes any record where theTimestampisNULL.
UPDATE
- We can also utilize the
UPDATEcommand to update your data. - For example, you can execute
UPDATE favorites SET language = 'SQL', problem = 'Fiftyville';. This will result in updating all the rows. - Notice that these queries have immense power. Accordingly, in the real-world setting, you should consider who has permissions to execute certain commands and if you have backups available!
IMDb
-
We can imagine a database that we might want to create to catalog various TV shows. We could create a spreadsheet with columns like
title,star,star,star,star, and more stars. A problem with this approach is that it has a lot of wasted space. Some shows may have one star. Others may have dozens. -
We could separate our database into multiple sheets. We could have a
showssheet, astarssheet, and apeoplesheet. On thepeoplesheet, each person could have a uniqueid. On theshowssheet, each show could have a uniqueidtoo. On a third sheet calledstarswe could relate people to shows by having ashow_idandperson_id. While this is an improvement, this is not an ideal database. -
IMDb offers a database of people, shows, writers, stars, genres, and ratings. Each of these tables is related to one another as follows:
six boxes that represent various sql tables arrows are drawn to each showing their many relationships with one another -
After downloading
shows.db, you can executesqlite3 shows.dbin your terminal window. -
Let’s zero in on the relationship between two tables within the database called
showsandratings. The relationship between these two tables can be illustrated as follows:two boxes one called shows and the other called ratings -
To illustrate the relationship between these tables, we could execute the following command:
SELECT * FROM ratings LIMIT 10;. Examining the output, we could executeSELECT * FROM shows LIMIT 10;. -
Examining
showsandratings, we can see these have a one-to-one relationship: One show has one rating. -
To understand the database, upon executing
.schemayou will find not only each of the tables but the individual fields inside each of these tables. -
More specifically, you could execute
.schema showsto understand the fields insideshows. You can also execute.schema ratingsto see the fields insideratings. -
A reference to the show’s id exists in all tables. In the
showstable, it is simply calledid. This common field among all the tables is called a key. Primary keys are used to identify a unique record in a table. Foreign keys are used to build relationships between tables by pointing to the primary key in another table. You can see in the schema ofratingsthatshow_idis a foreign key that referencesidinshows. -
By storing data in a relational database, as above, data can be more efficiently stored.
-
In sqlite, we have five data types, including:
-
Additionally, columns can be set to add special constraints:
-
We can further play with this data to understand these relationships. Execute
SELECT * FROM ratings;. There are a lot of ratings! -
We can further limit this data down by executing
SELECT show_id FROM ratings WHERE rating >= 6.0 LIMIT 10;. From this query, you can see that there are 10 shows presented. However, we don’t know what show eachshow_idrepresents. -
You can discover what shows these are by executing
SELECT * FROM shows WHERE id = 626124; -
We can refine our query to be more efficient by executing:
Notice that this query nests together two queries. An inner query is used by an outer query.
JOINs
- We are pulling data from
showsandratings. Notice how bothshowsandratingshave anidin common. - How could we combine tables temporarily? Tables could be joined together using the
JOINcommand. - Execute the following command:
Notice this results in a wider table than we have previously seen.
- Where the previous queries have illustrated the one-to-one relationship between these keys, let’s examine some one-to-many relationships. Focusing on the
genrestable, execute the following:
Notice how this provides us a sense of the raw data. You might notice that one show has three values. This is a one-to-many relationship.
- We can learn more about the
genrestable by typing.schema genres. - Execute the following command to learn more about the various comedies in the database:
Notice how this produces a list of comedies, including Catweazle.
- We can learn more about Catweazle by joining various tables:
Notice that this results in a temporary table. It is fine to have a duplicate table. Further, notice that Catweazle (one title) is assigned many genres, including adventure, comedy, and family.
- In contrast to one-to-one and one-to-many relationships, there are many-to-many relationships. For example, many people could appear in many shows!
- We can learn more about the show The Office and the actors in that show by executing the following command:
Notice that this results in a table that includes the names of various stars through nested queries.
- We can find all the shows in which Steve Carell starred:
This results in a list of titles of shows wherein Steve Carell starred.
- This could be expressed as a
JOINas:
- This could also be expressed in this way:
- The wildcard
%operator can be used to find all people whose names start withSteve Cby employing the following syntax:SELECT * FROM people WHERE name LIKE 'Steve C%';.
Indexes
-
While relational databases have the ability to be faster and more robust than utilizing a
CSVfile, data can be optimized within a table using indexes. -
Indexes can be utilized to speed up our queries.
-
We can track the speed of our queries by executing
.timer oninsqlite3. -
To understand how indexes can speed up our queries, run the following:
SELECT * FROM shows WHERE title = 'The Office';Notice the time that displays after the query executes. -
Then, we can create an index with the syntax
CREATE INDEX title_index ON shows (title);. This tellssqlite3to create an index and perform some special under-the-hood optimization relating to this columntitle. -
This will create a data structure called a B Tree, a data structure that looks similar to a binary tree. However, unlike a binary tree, there can be more than two child nodes.
one node at the top from which come four children and below that there are three children coming from one of the nodes and two from another two from another and three from another -
Further, we can create indexes as follows:
- Run the query and you will notice that the query runs much more quickly!
- Unfortunately, indexing all columns would result in utilizing more storage space. Therefore, there is a tradeoff for enhanced speed.
Using SQL in Python
-
To assist in working with SQL in this course, the CS50 Library can be utilized as follows in your code:
-
Similar to previous uses of the CS50 Library, this library will assist with the complicated steps of utilizing SQL within your Python code.
-
You can read more about the CS50 Library’s SQL functionality in the documentation.
-
Using our new knowledge, we can now leverage Python alongside SQL.
-
Modify your code for
favorites.pyas follows:
Notice that db = SQL("sqlite:///favorites.db") provides Python the location of the database file. Then, the line that begins with rows executes SQL commands utilizing db.execute. Indeed, this command passes the syntax within the quotation marks to the db.execute function. We can issue any SQL command using this syntax. Further, notice that rows is returned as a list of dictionaries. In this case, there is only one result, one row, returned to the rows list as a dictionary. You can download this code here.
Race Conditions
- Utilization of SQL can sometimes result in some problems.
- You can imagine a case where multiple users could be accessing the same database and executing commands at the same time.
- This could result in glitches where code is interrupted by other people’s actions. This could result in a loss of data.
- Built-in SQL features such as
BEGIN TRANSACTION,COMMIT, andROLLBACKhelp avoid some of these race condition problems.
SQL Injection Attacks
-
Now, still considering the code above, you might be wondering what the
?question marks do above. One of the problems that can arise in real-world applications of SQL is what is called an injection attack. An injection attack is where a malicious actor could input malicious SQL code. -
For example, consider a login screen as follows:
login screen with username and password fields -
Without the proper protections in our own code, a bad actor could run malicious code. Consider the following:
Notice that because the ? is in place, validation can be run on username and password before they are blindly accepted by the query.
- Never blindly trust the user’s input.
- Utilizing the CS50 Library, the library will sanitize and remove any potentially malicious characters.
Summing Up
In this lesson, you learned more syntax related to Python. Further, you learned how to integrate this knowledge with data in the form of flat-file and relational databases. Finally, you learned about SQL. Specifically, we discussed…
- Flat-file databases
- Relational databases
- SQL commands such as
SELECT,CREATE,INSERT,DELETE, andUPDATE. - Primary and foreign keys
JOINs- Indexes
- Using SQL in Python
- Race conditions
- SQL injection attacks
See you next time!