I’ve been learning how to do perform data science and analysis using various programming tools, such as Python, NumPY, Pandas, and others. Though through most of my career, I’ve mainly used Excel/Sheets to do most of my analysis, there are some limitations that maybe learning these tools can provide other advantages.
Because of its programmatic nature, it’s so much easier to manipulate data and perform calculations at scale. Excel/Sheets is primarily interface driven. So if you’re trying to operate on thousands of records, it won’t be efficient. On the other hand, I notice that I like to visualize data, maybe because that’s all I’ve ever known. I like to look at my data set to get a sense of patterns, data types, and outliers before I perform any action. Using the programmatic method, it’s still possible to do so, but just takes a different approach on “getting a basic sense of your data”.
So for the rest of this post, I am going to go through an exercise using programmatic languages to analyze data. As a Product Manager, I think it’s value-add to learn through this way of analyzing data. It’s basically the beginning step towards using ML techniques for analysis.
So for the dataset, I’m going to use a data set from Kaggle.
The Dataset
This problem statement is taken directly from the Kaggle dataset:
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.
They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
Which variables are significant in predicting the price of a car
How well those variables describe the price of a car
Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.
And here’s the business goal:
We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
My goal for this exercise is to find the answers to the following questions:
- What are the most common types of car in the dataset?
- What factors determine the price of the car?
The sample dataset is for learning purposes only. It does not represent real data. After downloading it, I can quickly get a sense of the first 10 records with a simple df.head(10) function. I won’t go over the basics and syntax of python, but in this instance “df” represents the variable of my dataset. It’s shorthand for “dataframe”. And the “head(10)” function means to return the first 10 records of the dataframe.
As you can see from the return data, there’s 10 records of cars with 26 dimensions. It doesn’t show all of the 26 dimensions so I need to figure out a way to find out descriptions of all of the fields and what they mean. Luckily, the dataset provided another file as a data dictionary for that.
Now that I have this, I want a bit more info about the data types of these fields and also the size of the dataframe. I’ll use df.info() and df.shape functions.
As you can see, at the top of that screen shot, (205, 26) means there are 205 records with 26 fields/dimensions in the dataframe. And looking at the datatypes, there’s nothing that stands out as an error, and there doesn’t seem to be any missing data in the “Non-Null Count” column. The dataset looks pretty clean.
The next thing I want to do is pretty cool in my opinion. I always do this in Excel to get an even better sense of the numerical data I’m working with. I’m talking about the describe() function.
This function allows me to see some descriptive statistics on all numerical columns. It helps me see at a glance quick patterns, tendencies, frequencies, and distributions of the dimensions. This might help me later decide to pick certain dimensions to drill down and visualize. This also includes some statistics on categorical dimensions, like CarName and fueltype.
Looking at this data, I want to first answer what are the most common types of cars in this data set. I want to first take a look at the categorial dimensions:
- Carname
- fueltype
- aspiration
- doornumber
- carbody
- drivewheel
- enginelocation
- enginetype
- cylindernumber
- fuelsystem
Count means nothing new, just counts how many records there are. Unique shows how many unique values that exist for each dimension. Top shows the most common value of each dimension, and Freq shows how many times that Top value appears in that dimension.
To me, the dimensions worth taking a deeper dive are:
- carbody
- drivewheel
- enginetype
- cylindernumber
- fuelsystem
These dimensions show more than 2 unique values, and their frequencies also aren’t too low or too near the 205 max. Therefore, I think these dimensions may show an interesting distribution of values.
As for the other dimensions, we can easily deduce what kind of cars are most common for that specific dimension. For example, most of the data set has more gas fuel types and standard aspirations. It takes a bit of car knowledge to know what that means, but basically this means the cars are either gas powered or diesel gas powered. A standard aspiration means the car engine does NOT have a turbo. So therefore, most cars in the data set are non-turbo gas-powered vehicles.
Looking at this chart, the most common types of cars are the sedan and hatchback types, most being front wheel driven. You could have deduced this using the categorical dimension stats. However, you wouldn’t have known that hatchbacks are a relatively close 2nd frequency in this data set nor would you know the distribution of other variables within those body types.
Here is another visual analysis of the other categorical dimensions. An overwhelming majority of vehicles have an engine type = “ohc”, which means overhead cam. I won’t go into details as to what that means because at this point these dimensions are extremely technical. As a simple summary, majority of the cars in this data set have an overhead camshaft with a four cylinder engine. This tells me that most of these cars are economical cars, given the low cost of four cylinder ohc engines.
To simply test that theory, I want to see what’s the distribution of car prices for the whole data set.
And based on car prices, it does seem like the majority of cars are priced under 10,000, indicating they’re on the cheaper end of the spectrum.
Moving on ahead, I want to now analyze the rest of the numerical data sets. There is a lot, so I don’t want to spend time graphing each dimension and comparing its price. So first, I want to produce a correlation heatmap analysis to quickly glance at the data set.
To answer my question #2, which factors determine the price of the car, the correlation function really helps out with a starting point to answer this question. As you can see, the enginesize, curbweight, horsepower, carwidth, and carlength are the independent variables that correlated the most to the price of the car.
Here are some of my thoughts on these findings:
- Although the most common cars are economic cars, price correlates significantly with the power and size output of the car and not its efficiency. In fact, price is inversely related to MPG. An explanation that would make sense is that the more features a car has, the more weight it has. Therefore, in order to move that weight, the larger the power output would be needed. Because of these features, the more costly the car will be. It also doesn’t cost more to make a lesser power output car.
- Many of the independent variables have high correlations with each other, which makes sense. The higher the horsepower of a car, the larger and heavier it tends to be.
- There doesn’t seem to be an independent variable that is over 90% correlated with price. Therefore, I predict an ML algorithm, such as a multiple linear regression, won’t yield entirely accurate results but will still predict a reasonable “ball park” result.
- This heatmap does not include the categorical dimensions, such as engine types. Using ML techniques, we can include those cat-dimensions into our prediction model.
This concludes a basic analysis of the car price data set. On the next part, I will use a machine learning algorithm to create a model to predict car price. Then I will examine the performance of that model.