-
Notifications
You must be signed in to change notification settings - Fork 31
The markovchain package
This project aims to extend the current functionality and capabilities of the R package ‘markovchain’ in order to provide statisticians a more functional tool to perform analysis of stochastic projects related to Markov chains (MCs). Optimization (in terms of coding performance and algorithms) of existing package is also included in the project’s scope.
Current Status
The package currently provides tools and functions for discrete time, continuous time, higher order and multivariate Markov chains. Various chain parameter estimation methods are available to construct Markov chain models on provided datasets. Several statistical tests can be performed on models to ascertain the Markov property, order, etc. Probabilistic analyses such as stationary distribution calculations, investigation of state properties, etc. can be performed. For more details see vignette.
Other R packages providing functions and methods to handle discrete time Markov chains (DTMCs) are (to my knowledge): DTMCPack and clickstream. Affine statistical models for MCs are: msm and mstate (for survival analysis on multistate models), HMM and deepmixS4 for Hidden Markov Models and mcmc for Monte Carlo Markov Chains. By the way, markovchain R package is by far the most used R package to model DTMCs owing to its intuitive S4 structure, the availability of methods to perform probabilistic analysis (structural analysis of DTMCs) and statistical inference (estimation and simulations). The markovchain package is becoming increasingly popular in the R statistical community. An initial project on optimization was funded during Google Summer of Code, 2015. The selected student rewrote a huge part of the code in Rcpp and improved greatly the existing statistical software. Further functionalities were also implemented during GSOC 2016 edition. MCs being a fundamental tool in the statistician's toolbox, it is hoped that the package will greatly facilitate the analyses and research done by researchers in industry and academia.
Broadly, the student will have tasks in the following categories
- Coding new functions
- Improving and fine-tuning existing ones (using Rcpp/C++ when possible)
- Maintaining and creation of documentation (.Rnd and vignettes)
Some of the functionality that can be added to the package during the coding phase include
CTMCs
- Probabilities of states at any given time t. Given a CTMC object, we would like to evaluate P(t).
- Functions/Methods to get the Generator Matrix and Transition Diagram Plotting
- Function for expected hitting times (time taken to go to state j from state i)
- Imprecise CTMCs (ICTMCs) - (https://arxiv.org/pdf/1611.05796.pdf, https://arxiv.org/pdf/1702.07150.pdf) (Robust generalization of CTMCs with relaxed assumptions, with computational tractability) Implementation of basic infrastructure and methods - algorithms for computing lower expectations of functions that depend on a state at any number of finite points
Higher order multivariate MCs
- Simulation - generate random sequence from chain object and initial conditions
Misc/General
- Additions to the graphics aspect of the package - currently the plot method of the package only depicts the underlying transition probability diagram. This can be extended to visualize communicating classes as well - essentially, this plotting feature would enable the user to view the transition probability diagram for communication classes. This is a feature that will be useful in large and complicated Markov chains.
- Markov chain statistics - (http://www2.math.uu.se/~takis/L/McRw/mcrw.pdf) currently computation of only the first passage time has been implemented. The following random variable pdfs can be implemented. The pdfs for each of these can be obtained by solving a set of equations with similar forms but varying initial conditions for a ‘minimal’ solution. (i) Extending the first passage time pdf computation for a set of states A and the expected first passage time (ii) Given two disjoint sets A, B, the pdf which takes an initial state i and tells you the probability that A is hit before B
- Stability tests for an MC - (https://arxiv.org/pdf/1608.03257.pdf) The stability of a Markov chain is arguably among its most important properties. For example, in queueing applications it offers the guarantee that service has been sufficiently provisioned to cope with the load imposed on the network in the long run. The simulated annealing based approach presented in the paper can possibly be added as a feature.
- Computation of rewards - presently the package does not offer such a feature. Each state is associated with a reward. To be implemented - a module that gives the expected reward before a set of states A is hit. Also, for Markov chains possessing a positive recurrent state, given a bounded reward function, the time-average reward function can be easily computed.
- Joint Distributions of the Numbers of Visits for Finite-State MCs - this function would return a joint pdf of the number of visits to the various states of the DTMC during the first N steps or before the Nth visit. Any new features proposed by the student.
Sai Bhargav Yalamanchi (primary), Giorgio A. Spedicato (backup)
The markovchain package is used by many R users around the world, so it is necessary that existing functions backward compatibility to be preserved and changes to be well documented. The student should ideally know R and C++, as well as be willing to learn RCpp and write R packages (see Hadley Wickham free online books, e.g. Advanced R and R Packages ). Also it is fundamental that any C++/RCpp functions to be tested for memory access.
Knowledge of GitHub is also assumed. Knowledge of unit testing (R package testthat) and oxygen (R package roxygen2) is also assumed. Regarding the academic background, he should have sound knowledge of inferential statistics and probability. A review of Markov Chain theory is fundamental and a strong prerequisite, see for example Dobrow, Introduction to Stochastic Processes in R Book Website.
When submitting the application the chosen candidate should well demonstrate:
- Details of previous coding experiences and academic background explaining why the mentors should accept his application
- A project idea along with details of the coding project (timeline, literature review, coding effort etc)
- Pass the following tests, after having forked the GitHub project site (https://github.com/spedygiorgio/markovchain), where the current development version of the project is hosted. (i) The functions markovchainFit and markovchainListFit attempt to fit Markov chain models to the provided data of state transitions. In practice, it is possible that this data has missing entries. A scheme to handle missing data for both methods must be proposed and implemented, as per Issue 106. Each missing entry would be represented by the keyword NA in R. Suggest the solution through a pull request. (ii) Suggest (by pull request) a solution for the Issue 115. (iii) Propose a relevant improvement to the package capabilities/functionalities that you would implement during the coding period.
The implementation can be pushed to the student’s fork and the modified package can be emailed to saibhargav.yalamanchi@gmail.com, CC'd to spedygiorgio@gmail.com in addition to proposed pull requests. The student proposal document may be mailed to the above mail addresses.