Data scientist toolbox

Maki Pa — Fri, 06 Jan 2023 20:27:48 +0000

Reading data from excel

Excel format is popular in analysis and data science. Pandas reads the excel data well, but sometimes you would like to specify data types. How can you do that?

Excel data science

Let’s assume your data looks like this

As you can see, pandas read all data as int, despite the first column should be text, the second short and the third digital 10.2. How we can force reading data in such a format.

Pandas dataset

Pandas data types

We can use command dtype with a dictionary for all or selected columns. But what are the types you can use in data science? Pandas datatypes correspond to Numpy detailed datatypes, but for most cases it is enough to use following datatypes: object, int, float, datetime, bool. As mentioned, you can use Numpy ones: object (str, np.string_, np.unicode_), for int (np.int_, np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64), and for float (np.float16, … 32, … 64),

pd.read_excel('file_name.xlsx', dtype={'Customer': str, 'Zip code': np.int32, 'Credit limit': float})

Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer 4 non-null object
1 Zip code 4 non-null int32
2 Credit limit 4 non-null float64
dtypes: float64(1), int32(1), object(1)
memory usage: 208.0+ bytes

We no only get data in format we want, but also optimased memory usage.

Currency display formatting

The question remains what data type for currency you should use? It is still float for internal representation, and this format displays.

Type “object” means “string”. You need to distinguish between the underlying data (e.g., the integer 1234) and its (string) representation e.g., 1,234. Pandas allows you to define custom formatters on a per-column basis. You should store integer data as integer data and define a custom formatter for it.

pd.options.display.float_format = '{:6.2f}'.format
pd.options.display.float_format = '{:,}'.format
pandas.io.formats.format.IntArrayFormatter

After that you got results

Customer Zip code Credit limit
0 00123     20345    120,000.00
1 00362     21890     10,000.00
2 00467     34020     20,000.00
3 00234     34300     30,000.00

There are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex. Those with numbers in their name indicate the bitsize of the type (i.e. how many bits are needed to represent a single value in memory). Some types, such as int and intp, have differing bitsizes, dependent on the platforms (e.g. 32-bit vs. 64-bit machines). This should be considered when interfacing with low-level code (such as C or Fortran) where the raw memory is addressed.

And that is it. Happy coding.

The post Data scientist toolbox appeared first on Customer Experience Management.

pandas – Customer Experience Management

Data scientist toolbox

Reading data from excel

Pandas data types

Currency display formatting