I have been studying data science for the last few months. I informally started last year when I signed up to the Machine Learning course by Coursera taught by Andrew Ng.
My first impression was that it reminded me of Matlab. It had a lot of math and linear algebra. I figured that it was going to take a lot of motivation to finish this class.
When I went to engineering school I spent a lot of time learning advanced math, electronics, computer design, Matlab, machine code, Java and some C.
I figured that this Machine Learning class would take a lot of effort and motivation and I put it on hold.
Then I read these books:
- Predictive Analytics by Eric Siegel
- Data Analytics by Anil Maheshwari
It wasn't until January 2017 that I really got officially motivated to learn faster. I read that a group of developers were downloading massive data sets on a race to save climate data.
I signed up to "Executive Data Science" on Coursera and took these classes:
- A crash course in data science
- Building a data science team
- Managing data analysis
The first 2 courses were sort of simple. The last one "Managing data analysis" took a little bit more than the average amount of grey matter.
It took me about 2 months to complete this class because I wanted the concepts to really sink in.
Managing Data Analysis follows this iteration:
- Set expectations
- Collect data
- Revise expectations
And it is divided into these sections:
- State and refine the question
- Explore the data
- Build formal models
- Interpreting results
- Communicating results
During "Explore the data" there was an example that required R. I knew what R was but I never used it. It took me about 1-2 weeks to get into the R mindset to really understand the concepts that were not explained in the example.
names(ozone) <- make.names(names(ozone))
This looks like a very simple line. But the example didn't explain anything about it.
Over the years managing engineering projects I learned that copy/paste is your worst enemy.
I always go by "never copy paste what you don't understand".
I wrote a story about the best time to run according to science.
I used a data set about ozone levels in the US from 2016. I did some exploratory data analysis and created plots for a few cities. This was a great learning experience.
During "Build Formal Models" they introduced a lot of new concepts, including associational analysis and prediction analysis.
I will add more details about this course in another post.
Half through this course I started reading this book:
- The Data Science Handbook by Carl Shan and friends
It has a very interesting story by Clare Corthell, where she explains how she came up with her own "open source" MS in Data Science.
For the last few months I have been thinking about a MS in Data Science
I have 2 engineering degrees from FIU and I wasn't looking to go back to FIU. If I went to get another degree it would be from a top engineering school.
I have been thinking about the Masters in Predictive Analytics from Northwestern.
But I've been also thinking what would be the goal of getting this degree. I know that the future of everything will be based on data science but I am not exactly sure what character I want to play in this future.
I have been growing technical teams for the last year. Currently focused on growing data science teams so I figured on following the Open Source MS in Data Science for now and apply this knowledge to my projects.
There is also a lot of knowledge I have been postponing for...perhaps not lack of motivation but just...life. There are always a million excuses for not doing something. I figured that life is short and instead of finding excuses, perhaps is better to just learn new things.
My MS in Data Science curriculum
A few weeks ago I learned about the Data Science Venn Diagram.
When I looked at it I figured that I have been following this for the last few years.
A good data scientist should have these skills:
- Coding skills
- Math and statistics skills
- Business domain skills
Actually not only a data scientist should have these skills. To build the future of everything. Most people that want to build this future should also have these skills.
A doctor will make diagnostics based on data. To really trust on the data, she (or he...I will use she from now) would need to learn how the recommendation was created. She would need to study math, statistics and programming to really dig deep and understand how it works.
A marketer will create a marketing campaign based on data. To trust on the recommendations based on the data, she has to follow the same path.
The same with other industries.
This comparison is a little bit extreme. There are creators and there are users. Just because you trust in buying online doesn't mean you have to know how it works.
But the future will be based on recommendations and trusting these recommendations will take time until we are "recommendation users".
Early in my career I learned C and Java. Then I got involved in Ruby.
For data science you must kick ass in R and Python.
I have been learning R and I know some Python but I decided to follow a path. It's easy to develop a bad technique.
Math and Statistics skills
This will be the hard part. Programming is not hard to learn. But math is complicated. I studied a lot of it in engineering college and it took me a long time to understand many math subjects.
Business domain skills
I think I am strong here. I don't have business domain in every subject. But I have many years working on business development using the lean startup and I know how to identify problems and solutions.
I signed up for these Coursera paths:
Python for Everybody from University of Michigan
This class has 4 courses:
- Intro to Python
- Python data structures
- Using Python to access web data
- Using databases with Python
- Retrieving, processing and visualizing data
Data Science Specialization from Johns Hopkins University
This specialization has a lot of courses:
- The data scientist's toolbox
- R programming
- Getting and cleaning data
- Exploratory data analysis
- Reproducible research
- Statistial inference
- Regression models
- Practical machine learning
- Developing data products
- Data science capstone
Later on I might take these
- Data Visualization with Tableau from UC Davis
- MEAN stack
In 6 months I might sign up for:
- Udacity Data Analyst Nanodegree
I am currently taking these 2:
- Intro to Python
- R programming
I will keep this site updated with lessons learned in my quest to conquer data science.