What is SQL?

SQL stands for Structured Query Language.

It is heavily used by data scientists to query a Database which is nothing but a collection of Tables with records. A record in a database table is composed of fields or columns that contain information about that one particular entry in the table.

How to Publish your own Data?

Datasette is a great tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.

Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world.

More about installing datasette on your own server in some other post.

Can I use an already published Data Source?

There is this great website FiveThirtyEight. It is a website that focuses on opinion poll analysis, politics, economics, and sports blogging. They have shared the data and code behind some of their articles and graphics under CC Attribution 4.0 License.

For this post, I will be using the dataset biopics.csv listed on their website. It is a table containing the following fields:


So let's get started with Thinking in SQL...

Step 1: The SQL Query Editor

Click on this link to go to the SQL query editor for the biopics dataset

select * from [biopics/biopics]

The above SQL statement simply retrieves all fields (*) from the table named [biopics/biopics].

The result of this query would be something like this:


Step 2: Advanced Queries with SQL

Click on this link to retrieve the count of all biopics from the biopics table where type_of_subject doesn't contain Other and group the results by type_of_subject.

select type_of_subject, count(title) as titles from [biopics/biopics]
where type_of_subject <> 'Other'
group by type_of_subject
order by titles desc

The above SQL query allows us to perform the following things in one statement:

  1. Select particular columns form the table
  2. Use the aggregate function count() to get a count of all the titles from the final result and alias it as titles
  3. Filter the table based on a condition - in this case type_of_subject is not equal to (<>) other
  4. Group the result based on a particular field - in this case type_of_subject
  5. Order the resulting data based on a particular column - in this case the count(title) aliased as titles

The result of this query would be something like this:


How do I decide on a particular Movie?

Okay, so I had a data source. I did some basic querying on that data source to get a feel of the data.

I see that there are 76 biopic titles in the data source in which the type of subject was a Musician.

But I want to dig further into these 76 biopics to get to one particular movie. So let's do some more filtering.

Click on this link to query the biopics.csv data with some more filters:

select title, site, country, subject_race, type_of_subject, box_office, year_release 
from [biopics/biopics]
where subject_race = 'White' 
and type_of_subject = 'Musician'
and year_release >2000
and country = 'UK'
and box_office <> '-'
order by box_office desc

In the above query, I have selected only relevant columns from the data source with which I am concerned.

I think the where conditions are self explanatory. In SQL we can combine multiple conditions with the and operator. The and operator displays a record if all the conditions separated by and are true.

I've also ordered the final result based on the descending order of the box office collection of that biopic.

Finally, I am left with two biopics:


Okay, so I visit the site field which is a link to the IMDB trailer of the biopics.

Well, don't know why but I am going to watch The Pianist now.

Hope this post gave you some understanding of how we can use SQL to perform basic queries on data and use where conditions and aggregate functions with where and group by clauses to filter and summarize the data.

In later posts, I'll try to include some more advanced SQL functions.

Subscribe here to stay informed!