Skip to content

didiooi/beginnersguideML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting Started with Data Science and AI

by Didi Ooi S. v1.0 October-22-2017
Repo Content: Resources (all hyperlinked for your convenience!), background, and (very) basic introduction.

1. My Long-Short Summary

Hello, a lot of you are probably NOT coming from a computer science or applied statistics background (like myself). One of the most effective way to pick up Data Science (especially Machine Learning and Deep Learning) is to:

  1. PRACTICE, PRACTICE, PRACTICE and LEARN at the same time
  2. Post your repositories (project) on Github & Kaggle
  3. Start participating in local Hackathons and/or on Kaggle Competitions to try and solve real data.

The list above is not necessarily in order. Half of the people I talked to are more comfortable with 1, 2 and then 3. Which is akin to a traditional education (the top-down approach). Some people are a bit more adventurous, starting with 3, then 2 then 1 (like myself)! I decide to visit my first hackathon Houston Hackathon this year to find out what it is all about - only to get pulled in one project, and then another one! If you like learning-on-the-go and love solving problems and don't mind being thrown into an unknown pit, then this is the way to go.

PRACTICE AND LEARN (AT THE SAME TIME)

The reason why I said practice and learn 'at the same time' is that:

  • PRACTICING allows you to apply the TOOLS that you need to solve your data science problem with existing repositories without wasting time to start from ground up. I believe practicing it this way from top-down is much faster than the traditional learning we have all been exposed to (especially me!)
  • LEARNING is crucial to understand WHY did you apply the tools that you did i.e. why did you choose the algorithms and the background math that shaped it. You're not a Data Scientist if you're just the end-user of tools!

PARTICIPATE IN INTEREST GROUPS

NEXT STEP, if you live in medium/big cities - go out of your comfort zone and meet other people in the AI/DS field, in your area. Chances are your city already has plenty of MeetUps or Eventbrite events where you can go participate and sit in. 98% of these FREE events/meetups cater to like-minded people of various degree of experiences! Houston's Machine Learning and Energy Data Science Meetups have between 3 - 10 NEW people each meeting. And up to half of them are without the AI/DS background/degree, BUT have the subject matter expertise in which they want to use this technology to solve problems.

If there isn't one, then be the one to organize the first in your area!

From finding out about Meetup in the summer 2017 (yes I am the late bloomer), I now regularly go to Houston Data Science, Houston Machine Learning, Houston Energy Data Science and Houston Data Visualization groups and I definitely learned a lot by talking to people, asking questions, or sitting in lectures.

(Be warned that it can be intimidating at first, but understand and expect that you will leave from first few Meetups with 0.1 to 5% comprehension of the whole learning experience and that is COMPLETELY NORMAL. Read Carol Dweck's Mindset to get a sense of what I mean. It is really important to keep an open mind, and be very eager to learn and you will soon realize how that percentage of comprehension grows over time. It is easier and faster than just learning the theoretical background from books and courses but most importantly, it will reinforce your theoretical learning.)

(At the more recent Agile* Geophysics Hackathon in Houston!)

UNDERSTAND HOW IT IS BEING APPLIED

Last but not least, learn how data science and its architectures are being applied in the REAL WORLD. Start by figuring out which industries you are interested in, research about how they are (or are not) integrating traditional methods and products with advanced analytics and emerging technologies into their existing or future products, workflow pipelines or supply chain etc. Be the big picture problem solver. Then VOILA! Your journey begins!

For the Geologist and the Geophysicists, I recommend keeping up-to-date with these two journals: Computers and Geoscience journal and Leading Edge. For open-source tools please visit our open-source collaborative effort Open Geoscience!

2. Background for the Newbies

People often get mixed-up with the term Machine Learning (ML), Deep Learning (DL) and how it relates to Artificial Intelligence (AI).

Here is one diagram I made (inspired from Nvidia): (In short, Deep Learning is a subset of Machine Learning which is a subset of AI)

There is also another way to view this: Supervised and Unsupervised Learning. Supervised Learning includes Machine Learning while Unsupervised Learning includes Deep Learning and Reinforcement Learning.

BEHIND THE HYPE: TERMINOLOGIES EXPLAINED

  • Data Science: Extraction of knowledge and information from data, using integrated ideas from Mathematics, Statistics, Machine Learning, Computer Science, and Subject Matter Expertise (SME).
  • Big Data: Unstructured data from multiple sources arriving at an alarming Velocity, Volume and Variety and in format in which meaningful value and information is not leveraged from (yet).
  • Machine Learning (ML): A field in computer science whereby the algorithm has the ability to learn without being explicitly programmed.
  • Statistical Learning: Branch of applied statistics recently emerge in response to ML, emphasizing statistical models and assessment of uncertainty.
  • Deep Learning (DL): A computational method for implementing machine learning using artificial neural network by building multiple layers of abstraction to solve complex semantic problems.
  • Reinforcement Learning (RL): An extremely promising new area using the trial-and-error paradigm where the (computing) Agent learns and corrects its Action based on Reward signals and State.

(The goal is to be a UNICORN, or at least a strong one third and half of the other two thirds, if that makes sense?)

All of the terminologies are very similar but have DIFFERENT EMPHASES.

Notice the importance of a subject matter expertise in the equation. Don't ditch your science degree/masters and jump straight on the Data Science and AI bandwagon - your skillsets from your courses are still valuable, it will make you the subject matter expertise and think about how you can use AI for your industry/field of research.

3. Practice

  • Kaggle: Run through tutorials and start with solving the Titanic problem with Machine Learning!
  • Forking from existing and popular GitHub repositories and play with it!
  • It will also be extremely USEFUL to have these 30 essential data science, ML and DL CHEAT SHEETS next to you at all times, posted on your corkboard at work, at home and by your bedside.

4. Resources

MACHINE LEARNING

  • Andrew Ng's Stanford (now Google's) Machine Learning course is a great place to start if you already have a decent science and math background.
  • For the theoretical background behind Statistical Learning, which is an advanced branch of statistics invented in conjunction with Machine Learning, your best bet will be Introduction to Statistical Learning. The book is free to access online!
  • If you want the classic beginner's guide to ML and needs some refresher with math, definitely go to Chris Bishop's Pattern Recognition.
  • If you're feeling extra adventurous and would love to learn the theoretical and mathematical background, try Hastie's Introduction to Statistical Learning

DEEP LEARNING

REINFORCEMENT LEARNING

MATH & STATS (OFTEN ENCOUNTERED)

I am not a big fan of this but since this is a frequently asked question, I will just put it here. My advice is to learn it procedurally by demand, as starting it this way will quickly diminish your interest to pursue ML/DS very quickly. Tbh, Andrew Ng's ML course is very forgiving with refreshing the math and stats for you!

  1. Probability Statistics
  2. Linear Algebra
  3. Multivariate Calculus esp Derivative and Integral
  4. Optimization

5. Programming Language

With so many languages out there and people preaching on theirs to use, it is easy to get overwhelmed. Advice here is to remember that your goal of mastery is not the language, it is the knowledge of logic and syntax. For complete newbies, I definitely recommend Python (as of 2017). The other reason is because Python has the greatest community support and it calls out Machine Learning libraries/framework easily.

Python is the fastest growing language because of how dynamic and readable it is, so I'd suggest getting started with the basics of it. If you're a complete beginner, like me, start with this Al Sweigart's no-fuss examples from Automate The Boring Stuff with Python.

Still not convinced that Python is beating R, Matlab etc? Read 'Python overtakes R, becomes the leader in Data Science, Machine Learning platforms'

Important Python Tools, Libraries
[I will elaborate more on this in the future!]

6. Open Source Machine Learning libraries

  1. Scikit-Learn: for the Pythonista
  2. Tensorflow: Google Brain's open source software library for Machine Learning
  3. Theano: another Python library, I believe it is similar to NumPy
  4. Keras: capable of running on top of Deeplearning4j, Tensorflow, Microsoft Cognitive Toolkit(CNTK) or Theano
  5. ...and more but get to know the first two first maybe experiment it with examples from Aurelion Geron's book!.

7. News and Forums for Data Science and AI

  1. KDNuggets
  2. Following the right people on Twitter (Most of the people I follow on my Twitter are at the forefront of the Machine Learning and Deep Learning realm)
  3. Quora on Machine Learning: for pretty intelligent discussion you can just simply follow the top/most viewed writers, like Andrew Ng
  4. Medium Short reads on all sorts of topics, including ML, DL, robotics (make sure to personalize your feed first)
  5. Reddit for hype-and-updates on /MachineLearning
  6. StackExchange: to ask for help in any data science or programming problems

8. Data Visualization

Now that you have the tools and resources, it is important to remember that data visualization is also an important front-end component to Data Science. This is because EFFECTIVE COMMUNICATION of data is crucial to all the work you have spent your blood, sweat and tears on, especially when you are sharing the results with your boss, stakeholders and/or clients. The lack thereof is what gave rise to the other buzzword - Business Intelligence which includes tools like Microsoft Power BI, TIBCO Spotfire, Tableau (which are basically Excel on steroids). Inspired by Microsoft's Data Summit 2017 keynote by Alberto Cairo (modern data viz guru to Edward Tufte) - here is a short read 6 Fundamentals of Data Visualization summarising it.

Key Takeaway

The key takeaways that I have learned:

  1. You do not need to know advanced coding to get into Data Science or Machine Learning etc, do it from top-down
  2. Data Science and its tools is NOT magic! You should remain skeptical and vigilant. Good data and proper internal validation is required.

Questions?

Yes, this is what I decided to do on a Sunday morning after receiving requests from friends on how they can get started in the AI field over the last few weeks, so forgive any grammatical errors. Do let me know if you have any questions, at didi.ooi@bristol.ac.uk or message me at LinkedIn.

About

Getting Started with Data Science and AI in 2017

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published