Call Functions on Pandas DataFrames Values
Let's find out how a pandas DataFrames works along with Python functions.
We'll cover the following
Try it yourself
Try executing the code below to see the result.
import pandas as pd cities = pd.DataFrame([ ('Vienna', 'Austria', 1_899_055), ('Sofia', 'Bulgaria', 1_238_438), ('Tekirdağ', 'Turkey', 1_055_412), ], columns=['City', 'Country', 'Population']) def population_of(city): return cities[cities['City'] == city]['Population'] city = 'Tekirdağ' print(population_of(city))
Explanation
The output is telling us that Tekirdağ
couldn’t be found in the cities
DataFrame. But, it’s
clearly there!
Let’s investigate the code below:
In [1]: city
Out[1]: 'Tekirdağ'
In [2]: city2 = cities.loc[2]['City']
In [3]: city2
Out[3]: 'Tekirdağ'
In [4]: city2 == city
Out[4]: False
In [5]: len(city)
Out[5]: 9
In [6]: len(city2)
Out[6]: 8
Note: Unicode is a universal character encoding standard that assigns a code to every character and symbol of any language.
In the beginning, computers were primarily developed in English-speaking countries, namely, in the UK and the US. When early developers wanted to encode text in ways that computers could understand, they came out with the following scheme:
Use a byte
(8 bits) to represent a character. For example, a
is 97 (01100001), b
is 98, and so on. One byte is enough for the English alphabet, which contains twenty-six lowercase letters, twenty-six uppercase letters, and ten digits.
There is even some space left for other special characters (for example, 9 for the “Tab” key). This scheme is known as ASCII encoding.
After a while, countries outside of the UK and the US needed support for their own languages. ASCII wasn’t good enough, because a single byte can’t hold all the numbers needed to represent larger alphabets. This led to several different encoding schemes. The most common one is UTF-8.
Some of the characters in UTF-8 are control characters. In city
, we have the character g
at position 7. After that, we have a control character that indicates that a breve should be added to the previous character. This is why the length of the city
is 9. This is also why city2
from the cities
DataFrame has ğ
at location 7. These are known as Unicode normalization forms. We can use the unicodedata
module to normalize strings to the same format.
There might be times where we want to do case-insensitive searches for cities.
In those cases, involving Unicode, str.lower
or str.upper
methods won’t adequately do the job. Therefore, we use the str.casefold
method instead.
Here’s a solution to this teaser incorporating all of these methods:
Solution
import unicodedataimport pandas as pdcities = pd.DataFrame([('Vienna', 'Austria', 1_899_055),('Sofia', 'Bulgaria', 1_238_438),('Tekirdağ', 'Turkey', 1_055_412),], columns=['City', 'Country', 'Population'])def population_of(city):city = normalize(city)return cities[cities['city_norm'] == city]['Population']def normalize(name):return unicodedata.normalize('NFKC', name).casefold()cities['city_norm'] = cities['City'].apply(normalize)city = 'Tekirdağ'print(population_of(city))
Get hands-on with 1300+ tech skills courses.