How to deal with international data formats in Python
A frequent hassle when dealing with data from various international sources is how to deal with differences in how various languages and cultures represent decimal and thousands separators, the order of year, month and day in dates, etc. Many countries go from the smaller (day) to the largest (year) unit of time, while some, like the U.S., do the weird thing that is starting from the middle (month), then going small (day), then completely reversing direction going to the large unit (year).
If you look at decimal separators, it seems that just about half the world uses dots and the other half uses commas. The thousands separator is the other mark. That is, in countries that use the dot as decimal, the comma is the thousands separator and vice versa.
Parsing numbers in Python, the naïve way
What people often do when interpreting those numbers with Python is
simply using the replace
method of the str
class.
In [1]: number = '12,75'
In [2]: parsed = float(number.replace(',', '.'))
In [3]: parsed
Out[3]: 12.75
But what about when you have also to take into account the different thousands separators? What if you have a monetary symbol before or after the number? Things get complicated then to the point of making the code less readable.
In [1]: price = 'R$ 1.999,99'
In [2]: parsed = float(
...: price
...: .replace('R$', '')
...: .strip()
...: .replace('.', '')
...: .replace(',', '.')
...: )
In [3]: parsed
Out[3]: 1999.99
Such code is also brittle when it comes to dealing with variabilities in the input.
Parsing dates in Python
There are times you will come across dates in data sources that are
formatted in different orders, sometimes even with month names and
abbreviations, like so: 4-Feb-2021
. Python comes with a datetime
module in the standard library which is quite useful for dealing with
dates in just about any format you can possibly find: just use the
strptime
method of the datetime
class. Just check the
table of format codes
and build your format mask. Here are the format codes relevant for our
example:
Directive | Meaning | Example |
---|---|---|
%d | Day of the month as a zero-padded decimal number. | 01, 02, …, 31 |
%b | Month as locale’s abbreviated name. | Jan, Feb, …, Dec (en_US); Jan, Feb, …, Dez (de_DE) |
%Y | Year with century as a decimal number. | 0001, 0002, …, 2013, 2014, …, 9998, 9999 |
Note that the documentation says that when using strptime
the
zero-padding in %d
is optional.
Then do
In [1]: from datetime import datetime
In [2]: written_date = '4-Feb-2021'
In [3]: parsed = datetime.strptime(written_date, '%d-%b-%Y').date()
In [4]: parsed
Out[4]: datetime.date(2021, 2, 4)
That approach can also be used in cases where the order of the date
components is different. Just reorder the format mask and use
strptime
.
But what if the date you need to parse has components in different languages? Do you use dictionaries with the names of the months in said languages?
french_months = {
'janvier': 1,
'février': 2,
# oh, God, please, no, stop! 😖 Arrête tout de suite ! 🤢
}
Please don’t do that.
The actual, proper solution
All of this makes us wish there was a standard way in Python to deal
with all those number and date format differences. And there is! Meet
the locale
module. It is also
part of the Python standard library,
but is so frequently ignored in code I have seen in data science and
data engineering circles that it makes me cringe when I see things like
some of the deliberately bad code examples above.
Configuring locales
Before you can use the locales you need in Python, however, you need to install those locales in your system. The reason for that is that Python uses the POSIX locale database upstream. The way to do it varies depending on your operating system and version, but if you’re using Debian or a Debian based system like Ubuntu, then the following should work.
-
If you haven’t already, install the
locales
module on your system.sudo apt-get install locales
-
Locate and edit the
/etc/locale.gen
file on your system. That will probably require administrator privileges. Find the locales you want to use, uncomment those lines and save.Alternatively, you can use
sed
to edit the lines in-place, like this,sed -i 's/^# de_DE.UTF-8 UTF-8$/de_DE.UTF-8 UTF-8/g' /etc/locale.gen sed -i 's/^# fr_FR.UTF-8 UTF-8$/fr_FR.UTF-8 UTF-8/g' /etc/locale.gen
for instance, to enable the French and German locale settings.
-
Run
locale-gen
to generate the locales you have chosen. For example, if you wish to have U.S. English, German and French enabled, you would dolocale-gen en_US.UTF-8 de_DE.UTF-8 fr_FR.UTF-8
-
Finally, update the locales. Here you don’t need to list them all, just the default one.
update-locale LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
You just need to do this configuration once for every time you wish to add or remove a locale.
Using the locale module
Now that the system has the locales we want to use, we need to set them in Python before we process those numbers.
In [1]: import locale
...: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
Out[1]: 'de_DE.UTF-8'
If it is properly configured, you should see no errors here. If you see an “Error: unsupported locale setting” message, go back and reconfigure the locales and make sure you have that locale enabled.
Now you can switch locales and both parse and format numbers and dates in the proper way for that locale. You can also switch between locales when needed.
In [1]: import locale
...: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
Out[1]: 'de_DE.UTF-8'
In [2]: locale.currency(0.5)
Out[2]: '0,50 €'
In [3]: locale.currency(1000.5, grouping=True) # thousands separator
Out[3]: '1.000,50 €'
In [4]: print('eine halbe Einheit: ' + locale.format_string('%.2f', 0.5))
eine halbe Einheit: 0,50
In [5]: locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
Out[5]: 'en_US.UTF-8'
In [6]: print('half a unit: ' + locale.format_string('%.2f', 0.5))
half a unit: 0.50
Parsing numbers in Python, the proper way
For interpreting numbers in a locale aware way, the locale
module
provides the atoi
(for integers) and atof
(for floats) functions.
In [1]: import locale
...: locale.setlocale(locale.LC_ALL, 'pt_BR.UTF-8')
Out[1]: 'pt_BR.UTF-8'
In [2]: number = '12,75'
In [3]: parsed_number = locale.atof(number)
In [4]: parsed_number
Out[4]: 12.75
In [5]: price = 'R$ 1.999,99'
In [6]: parsed_price = locale.atof(price.split()[-1])
In [7]: parsed_price
Out[7]: 1999.99
The atof
function is also very useful for converting data in Pandas,
the ubiquitous data science library in Python.
In [1]: import pandas as pd
In [2]: import locale
...: locale.setlocale(locale.LC_ALL, 'pt_BR.UTF-8')
Out[2]: 'pt_BR.UTF-8'
In [3]: df = pd.DataFrame(
...: [
...: ['banana', 'R$ 3,99'],
...: ['maçã', 'R$ 4,49'],
...: ['pêssego', 'R$ 8,90']
...: ],
...: columns=['fruta', 'preço'],
...: )
In [4]: df
Out[4]:
fruta preço
0 banana R$ 3,99
1 maçã R$ 4,49
2 pêssego R$ 8,90
In [5]: df['price'] = df['preço'].apply(lambda v: locale.atof(v.split()[-1]))
In [6]: df
Out[6]:
fruta preço price
0 banana R$ 3,99 3.99
1 maçã R$ 4,49 4.49
2 pêssego R$ 8,90 8.90
Note that atof
can even convert to floats strings containing numbers in
scientific notation:
In [7]: locale.atof('1E+10')
Out[7]: 10000000000.0
Parsing dates
The datetime
module, as with all of the Python standard libraries, is
locale
aware. So, parsing dates with a different order of numbers or
with month names is easy.
In [1]: import locale
...: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
Out[1]: 'de_DE.UTF-8'
In [2]: from datetime import datetime
In [3]: written_date = '1-Mai-2021'
In [4]: parsed = datetime.strptime(written_date, '%d-%b-%Y').date()
In [5]: parsed
Out[5]: datetime.date(2021, 5, 1)
Conclusion
Using the locale
module is an essential ability for many data
engineering and data science tasks when you have international data
involved. It pays off to have some practice with it as your data
processing will be more robust and avoid common mistakes when cleaning
data.
Note: edited in 2022-02-14 to include examples using currency with thousands separators and converting numbers in scientific notation. Thanks for your suggestions, Washington and Salomão!