Data scientist toolbox
Reading data from excel
Excel format is popular in analysis and data science. Pandas reads the excel data well, but sometimes you would like to specify data types. How can you do that?
Let’s assume your data looks like this
As you can see, pandas read all data as int, despite the first column should be text, the second short and the third digital 10.2. How we can force reading data in such a format.
Pandas data types
We can use command dtype with a dictionary for all or selected columns. But what are the types you can use in data science? Pandas datatypes correspond to Numpy detailed datatypes, but for most cases it is enough to use following datatypes: object, int, float, datetime, bool. As mentioned, you can use Numpy ones: object (str, np.string_, np.unicode_), for int (np.int_, np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64), and for float (np.float16, … 32, … 64),
pd.read_excel('file_name.xlsx', dtype={'Customer': str, 'Zip code': np.int32, 'Credit limit': float})
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer 4 non-null object
1 Zip code 4 non-null int32
2 Credit limit 4 non-null float64
dtypes: float64(1), int32(1), object(1)
memory usage: 208.0+ bytes
We no only get data in format we want, but also optimased memory usage.
Currency display formatting
The question remains what data type for currency you should use? It is still float for internal representation, and this format displays.
Type “object” means “string”. You need to distinguish between the underlying data (e.g., the integer 1234) and its (string) representation e.g., 1,234. Pandas allows you to define custom formatters on a per-column basis. You should store integer data as integer data and define a custom formatter for it.
pd.options.display.float_format = '{:6.2f}'.format
pd.options.display.float_format = '{:,}'.format
pandas.io.formats.format.IntArrayFormatter
After that you got results
Customer Zip code Credit limit
0 00123 20345 120,000.00
1 00362 21890 10,000.00
2 00467 34020 20,000.00
3 00234 34300 30,000.00
There are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex. Those with numbers in their name indicate the bitsize of the type (i.e. how many bits are needed to represent a single value in memory). Some types, such as int
and intp
, have differing bitsizes, dependent on the platforms (e.g. 32-bit vs. 64-bit machines). This should be considered when interfacing with low-level code (such as C or Fortran) where the raw memory is addressed.
And that is it. Happy coding.