-
Notifications
You must be signed in to change notification settings - Fork 31
Integrated oversampling
Sequence classification problems are ubiquitous and arise when the data exhibits a spatial-temporal structure. Examples include predicting traffic, earthquake prediction and even predicting the result from auctioning systems such as those in the financial markets. Recurrent Neural networks, such as Long Short-Term Memory (LSTM) networks are well suited to these types of problems. Oftentimes, however, the sequence is strongly imbalanced and the challenge is how to sample the training set while preserving the temporal structure. Integrated sampling provides a solution to this problem.
- Hong C., Xiao-Li L., Yew-Kwong W. and See-Kiong Ng, D. (2013) Integrated Oversampling for Imbalanced Time Series Classification, IEEE Transactions on Knowledge and Data Engineering, vol 25 (12).
- Liang G., Zhang C. (2012) A Comparative Study of Sampling Methods and Algorithms for Imbalanced Time Series Classification. In: Thielscher M., Zhang D. (eds) AI 2012: Advances in Artificial Intelligence. AI 2012. Lecture Notes in Computer Science, vol 7691. Springer, Berlin, Heidelberg
The goal of this project will be to implement, assess and refine the method of integrated sampling. The technique shall be demonstrated with LSTMs applied to various imbalanced time series data sets including, traffic prediction and high frequency trading. The following project milestones shall be achieved:
Identification, implementation and evaluation of a critical subset of sampling techniques for time series and sequence classification. Structuring of library and function interfaces to support xts objects.
Testing statistical properties of the sampling and impact on various example data. Full error handling implementation. Accelerating sections of code and offloading to C++.
Preparation of documentation. Package unit testing across different platforms. Submission of package to CRAN.
An integrated oversampling package will support the application of LSTMs and other RNNs to real world time series problems plagued by class imbalance.
Please contact Matthew Dixon or Diego Klabjan if you are a student interested in this project.
Applicants have to be able to show that they have:
-Ability to quickly identify and clearly communicate technical problems orally and in writing using R Markdown and latex.
-Mathematically orientated software engineering experience, preferably in industry, required.
-Ability to work to deadlines in a collaborative project with mentors and potentially other students.
-Solid background in statistics and computation including time series analysis, data structures, algorithms and text mining.
-Experience in applying machine learning and forecasting methods in R.
-Experience with Rcpp and statistical computing in C++.
-Must be able to develop software on windows and remote linux platforms using ssh and github.
Using the following Google R guidelines
https://google.github.io/styleguide/Rguide.xml
interested students should write a R script
-Easy: demonstrating the application of a hashmap to efficiently represent the sequence {(x_1,y_1),…(x_n,y_n)} where the {y_i} are assumed to be non-unique integers.
-Medium: constructs a data frame of a subset of x and y by under sampling so that the number of repeated values of y are the same. This is commonly referred to as under-sampling the 'majority class(es)'.
-Hard: constructs a data frame of a subset of x and y by over-sampling so that the number of repeated values of y are the same. This is commonly referred to as over-sampling the 'minority class(es)'. Estimate the bias in x introduced by over-sampling and propose an over-sampling method that minimizes the bias.
Email your answers to the above address.