Read an SQLite File

Let's find out how to use SQL with pandas DataFrame.

We'll cover the following

Try it yourself

Try executing the code below to see the result.

Press + to interact
import sqlite3
import pandas as pd
conn = sqlite3.connect(':memory:')
conn.executescript('''
CREATE TABLE visits (day DATE, hits INTEGER);
INSERT INTO visits VALUES
('2020-07-01', 300),
('2020-07-02', 500),
('2020-07-03', 900);
''')
df = pd.read_sql('SELECT * FROM visits', conn)
print('time span:', df['day'].max() - df['day'].min())

Explanation

SQLite3 is a great single-file database that is used many times to transfer data. It is widely used and heavily tested and can handle vast amounts of data up to 140 terabytes.

However, SQL does require some training.

In the teaser code, we create a hits table that has the following two columns:

  • day column with SQL DATE type.
  • hits column with SQL INTEGER type.

The mapping from SQL types to pandas types is defined in the SQL driver that is used to access the database. SQLite3 is a bit different from other databases in that it only has numbers and strings as types. However, it does support declaring a column with a DATE, TIME, or TIMESTAMP type. We can see that for ourselves when we look at the .dtypes below:

In [1]: df.dtypes
Out[1]:
day object
hits int64
dtype: object

The day column has an object dtype, which in most cases, means it’s an str. When we do df['day'].max() - df['day'].min(), we’re subtracting two strings, which is not a legal operation in Python.

We can convert data in a column to a pandas Timestamp either by using the to_datetime function or by passing the column names to convert in the parse_dates parameter of read_sql. However, we somehow need to know what columns are time related.

The better option is to use the detect_types parameter in sqlite3.connect. When we pass PARSE_DECLTYPES to sqlite3.connect, it’ll convert DATE, TIME, and TIMESTAMP columns to Python’s datetime types. The parameter read_sql will convert these pandas.Timestamp columns.

Solution

Press + to interact
import sqlite3
import pandas as pd
conn = sqlite3.connect(
':memory:',
detect_types=sqlite3.PARSE_DECLTYPES,
)
conn.executescript('''
CREATE TABLE visits (day DATE, hits INTEGER);
INSERT INTO visits VALUES
('2020-07-01', 300),
('2020-07-02', 500),
('2020-07-03', 900);
''')
df = pd.read_sql('SELECT * FROM visits', conn)
print('time span:', df['day'].max() - df['day'].min())

Get hands-on with 1300+ tech skills courses.