Skip to content

quintenrosseel/databricks-ml-associate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exam Information

See more info here.

Objectives

  • Describe the learning context, format, and structure behind the exam.
  • Describe the topics covered in the exam.
  • Recognize the different types of questions provided on the exam.
  • Identify resources to learn the material covered in the exam.

Target Audience

  • Data scientist
  • Beginner-level, practitioner certification
  • Assess candidates at a level equivalent to six months of experience with machine learning with Databricks

Expectations

  • Use Databricks Machine Learning and its capabilities within machine learning workflows
  • Implement correct decisions in machine learning workflows
  • Implement machine learning solutions at scale using Spark ML and other tools
  • Understand advanced scaling characteristics of classical machine learning models

Not expected

  • Advanced ML Operations: Webhooks, Automation, Deployment, Monitoring, CLI/REST APIs
  • Advanced ML Workflows: Target Encoding, Embeddings, Deep Learning, Natural Language Processing

Notes

  • Through Kryterion
  • Automatically graded

Basic Details

  • Time allotted to complete exam = 1.5 hours (90 minutes)
  • Passing scores = At least 70% on the overall exam
  • Exam fee = $200
  • Retake policy = As many times as you want, whenever you want
  • Number of Ouestions = 45
  • More info. on the Databricks Academy FAQ: http://files.training.databricks.com/Ims/docebo/databricks-academy-fag.odf Select all of the following resources that will be available during the exam.
  1. Databricks documentation
  2. Paper and pencil
  3. A single, index card of pre-written notes
  4. A running Spark session
  5. None of the above

Certification topics

Use Databricks Machine Learning and its capabilities within machine learning workflows (29%), including:

  • Databricks Machine Learning (clusters, Repos, Jobs)
    • Clusters
      • Types of clusters
      • Single-node use cases
      • Standard cluster use cases
    • Repos
      • Connect a repo from an external git provider
      • Commit changes
      • Create a new branch
      • Pull changes from an external Git provider
    • Jobs
      • Orchestrate multi-task ML workflows
  • Databricks Runtime for Machine Learning (basics, libraries)
    • Basics
      • Create a cluster with MLR
      • Identify differences between the standard DBR and MLR
    • Library Usage
      • Install a Python library in notebook scope
      • Install a Python library in cluster scope
  • AutoML (classification, regression, forecasting)
    • Classification
      • Common steps in the workflow
      • How to locate source code
      • Evaluation metrics
      • Data exploration
    • Regression
      • Common steps in the workflow
      • How to locate source code
      • Evaluation metrics
      • Data exploration
    • Forecasting
      • Common steps in the workflow
      • How to locate source code
  • Feature Store (basics)
    • Basics
      • Benefits
      • Create a feature store table
      • Write data to a feature store table
    • Pipelines
      • Train a model with features from a feature store table
      • Score a model using features from a feature store table
  • MLflow (Tracking, Model Registry)
    • Experiment tracking
      • Querying past runs
      • Logging runs
      • UI information
    • Model registry
      • Register a model
      • Transition a model across stages

Note, there are self assessment questions to check your understanding.

ML Workflows (29%) Implement correct decisions in machine learning workflows, including:

  • Exploratory data analysis (summary statistics, outlier removal)
    • Summary stats
      • Compute summary statistics for a Spark DataFrame
      • DataFrame.summary)
      • dbutils.data.summarize()
    • Outlier removal
      • Removing outlier features
      • Filtering records in a Spark DataFrame
  • Feature engineering (missing value imputation, one-hot-encoding)
    • Missing Values
      • Binary indicator features
      • Identifying the optimal replacement value
      • Imputer with Spark ML
    • One-hot-encoding
      • Complications with certain algorithms
      • OHE with Spark ML
  • Tuning (hyperparameter basics, hyperparameter parallelization)
    • Hyperparameter basics
      • Grid Search vs. Random Search
      • Tree of Parzen Estimators
      • Scikit-learn
    • Hyperparamter parallelization
      • Hyperopt applications
      • Relationship between selection algorithm and parallelization
  • Evaluation and selection (cross-validation, evaluation metrics)
    • Cross-validation
      • Number of trials
      • Train-validation split vs. cross-validation
      • Using Spark ML to accomplish the above
    • Evaluation metrics
      • Recall
      • Precision
      • F1
      • Log-scale interpretation

Note, there are self assessment questions to check your understanding.

Usage of SparkML (33%) Implement machine learning solutions at scale using Spark ML and other tools, including:

  • Distributed ML Concepts
    • Difficulties
      • Data location and shuffling
      • Data fitting on each core for parallellization
    • Spark ML
      • No UDF requirement - when to use UDF; when to use Spark ML. Relationship with other libraries.
  • Spark ML Modeling APIs (data splitting, training, evaluation, estimators vs. transformers, pipelines)
    • Prep
      • Splitting data
      • Reproducible splits
    • Modeling
      • Fitting
      • Feature vector columns
      • Evaluators
      • Estimators vs. transformers
    • Pipelines
      • Relationship with Cross validation (e.g. inside or outside pipeline)
      • Relationship with training and test data
  • Hyperopt
    • Basics
      • Bayesian hyperparameter abilities
      • Parallelization abilities
    • Applications
      • SparkTrials sv Trials
      • Relationship between number of evaluations and level of parallelization
  • Pandas API on Spark
    • Concepts
      • InternalFrame
      • Metadata storage
    • Benefits
      • Easy refactoring for scale
      • Pandas API
    • Usage
      • Importing
      • Converting between DF types
  • Pandas UDFs and Pandas Function APIs
    • Conversion - Apache arrow (why is this efficient?) - Vectorization
    • Pandas UDFs - why would you use them?
      • Iterator UDF benefits
      • For scaled prediction
    • Pandas Function APIs
      • Group-specific training
      • Group-specific inference

Note, there are self assessment questions to check your understanding.

**Scaling ML models (9%) - Understand advanced scaling characteristics of classical machine learning models, including **:

  • Distributed Linear Regression
    • Identify what type of solver is used for big data and linear regression
    • Identify the family of techniques used to distribute linear regression
  • Distributed Decision Trees
    • Describe the binning strategy used by Spark ML for distributed decision trees
    • Describe the purpose of the maxBins parameter
  • Ensembling Methods (bagging, boosting)
    • Basics
      • Combining models
      • Implications of multi-model solutions
    • Types
      • Bagging
      • Boosting
      • Stacking

Note, there are self assessment questions to check your understanding.

Practice & preparation

  • Practice exam is coming soon
  • Some practice questions are shown at the end of the course.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages