Asking Good Questions

Asking good questions is important when analyzing data. Thats not easy when you have no motivation to find something specific, like statistics from last sales to improve your profit in next one.

Environment: ASUS ROG G55VW, Xubuntu 18.04, Anaconda, Python 3.7. Using Jupyter-notebook.

Our teacher had a CSV file to play with. http://taanila.fi/employee.csv

It has a lot of information but I’m interested in people who travels for business frequently. How old are they, is it men or females who travels more, marital status, job satisfaction and stuff like that.

First: How many people travels frequently?

277 emplyees from 1470 =18.84% travels frequently in this company.

What’s the percentage of females and males?

As we can see with this filtering, that travelling is quite even between men and women. 117 out of 277 is females.

I’d assume that travellers would be singe or divorced most of the time, let’s see.


I tried to squeeze a “or” condition to shorten the line but I don’t know how to do it properly.


Trying a little differently

Success finally. Excel sheet is now filtered to show only single or divorced, frequently travelling people. So 159 out of 277 frequently travelling human is single. What is the difference between single and divorced people? Aren’t they both single? Why would you make a gategorie of people who’s relationship didn’t workout, let’s get rid of that.

I added a line for replacing every ‘Divorced’ to ‘Single’, so we don’t make weird gategorizations. And also new column name ‘mstatus’ to get correct values when printing. It’s complaining “A value is trying to be set on a copy of a slice from a DataFrame.”, but still gave me the wanted result.

Okay, so now we got a list of single people who travels a alot for this company. Maybe we should arrange them into groups by their age and do some graphics about that.

With these lines of codes we arrenged every employee of the company into a new column by their age. Groups goes like 15 to 20, 21 to 30 etc. Now should be able to use it with earlier filters.

I don’t know why it’s giving me ages between 10-20 when it should give me 15-20. I refreshed both cells and tried to put the bins variable into the last cell but no effect. I restarted kernel and it blew up the last cell, there must be something wrong with this one. “KeyError: “[‘ageFilter’] not in index””.


I just chaged dataframe’s name to previously used ‘single’. Next let’s do some graphics.

At the moment job satisfaction is only rated to 1-5 as one is lowest. I want to build a chart which tells how happy employees are, with words not numbers. Also I want to separate men and women, to see if there’s some differences.

This is company’s “approval rating” among employees. If I was the president of this company, I would be actually pretty concerned. If over every third employee is not happy with their job, it must affect quality of work.


Added new column for percentage.

Alright, pretty straight forward chart of earlier dataframe. It shows how employees are feeling about their job and how big group it is. Maybe some numbers would give a little insight for human not familiar with this company.

Okey, so I strugled for way too many hours with trying to get all bar values on next to the bars. BUT NO! I just don’t get it. It’s possible but I just don’t get it. I had every value ready to go
I really wanted those values to be printed into the chart. I took a shortcut, feels bad.

I had issues with iterating those values with For In loop. All examples was with static values. Here is same chart with static values.

GITHUB https://github.com/matilinux/DataAnalytics

To be Continued

Leave a Reply

Your email address will not be published. Required fields are marked *