Skip to content

Commit f498269

Browse files
committed
Updated README
1 parent ba2fc79 commit f498269

File tree

2 files changed

+44
-1
lines changed

2 files changed

+44
-1
lines changed

README.org

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,48 @@
22

33
My MSc on Data Science final project. This is a library for Massive Data Streaming analysis using Apache Flink
44

5+
* Implemented algorithms
6+
** Feature Selection
7+
*** Fast Correlation-Based Filter (FCBF)
8+
FCBF is a multivariate feature selection method where the class relevance and the dependency between each feature pair are taken into account. Based on information theory, FCBF uses symmetrical uncertainty to calculate dependencies of features and the class relevance. Starting with the full feature set, FCBF heuristically applies a backward selection technique with a sequential search strategy to remove irrelevant and redundant features. The algorithm stops when there are no features left to eliminate.
9+
10+
H.-L. Nguyen, Y.-K. Woon, W.-K. Ng, L. Wan, Heterogeneous ensemble for feature drifts in data streams, in: Proceedings of the 16th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining - Volume Part II, PAKDD’12, 2012, pp. 1–12.
11+
*** Online Feature Selection (OFS)
12+
OFS proposes an ε-greedy online feature selection method based on weights generated by an online classifier (neural networks) which makes a trade-off between exploration and exploitation of features.
13+
14+
J. Wang, P. Zhao, S. Hoi, R. Jin, Online feature selection and its applications, IEEE Transactions on Knowledge and Data Engineering 26 (3) (2014) 698–710.
15+
*** Katakis' FS
16+
This FS scheme is formed by two steps: a) an incremental feature ranking method, and b) an incremental learning algorithm that can consider a subset of the features during prediction (Naive Bayes).
17+
18+
I. Katakis, G. Tsoumakas, I. Vlahavas, Advances in Informatics: 10th Panhellenic Conference on Informatics, PCI 2005, Springer Berlin Heidelberg, 2005, Ch. On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams, pp. 338–348.
19+
** Discretization
20+
*** Incremental Discretization Algorithm (IDA)
21+
Incremental Discretization Algorithm (IDA) approximates quantile-based discretization on the entire data stream encountered to date by maintaining a random sample of the data which is used to calculate the cut points. IDA uses the reservoir sampling algorithm to maintain a sample drawn uniformly at random from the entire stream up until the current time.
22+
23+
G. I. Webb. 2014. Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data. In Proceedings of the 2014 IEEE International Conference on Data Mining (ICDM '14). IEEE Computer Society, Washington, DC, USA, 1031-1036.
24+
*** Partition Incremental Discretization algorithm (PiD)
25+
PiD performs incremental discretization. The basic idea is to perform the task in two layers. The first layer receives the sequence of input data and keeps some statistics on the data using many more intervals than required. Based on the statistics stored by the first layer, the second layer creates the final discretization. The proposed architecture processes streaming exam ples in a single scan, in constant time and space even for infinite sequences of examples.
26+
27+
J. Gama, C. Pinto, Discretization from data streams: Applications to histograms and data mining, in: Proceedings of the 2006 ACM Sympo sium on Applied Computing, SAC ’06, 2006, pp. 662–667.
28+
*** Local Online Fusion Discretizer (LOFD)
29+
LOFD \cite{lofd} is an online, self-adaptive discretizer for
30+
streaming classification. It smoothly adapts its interval limits
31+
reducing the negative impact of shifts and analyze interval
32+
labeling and interaction problems in data streaming. Interaction
33+
discretizer-learner is addressed by providing 2 alike solutions.
34+
The algorithm generates an online and self-adaptive discretization
35+
solution for streaming classification which aims at reducing the
36+
negative impact of fluctuations in evolving intervals.
37+
38+
S. Ramírez-Gallego, S. García, F. Herrera, Online entropy-based
39+
discretization for data streaming classification, Future Generation
40+
Computer Systems, Volume 86, 2018, Pages 59-70, ISSN 0167-739X,
41+
https://doi.org/10.1016/j.future.2018.03.008.
42+
(http://www.sciencedirect.com/science/article/pii/S0167739X17325815)
43+
Keywords: Data stream; Concept drift; Data preprocessing; Data
44+
reduction; Discretization; Online learning
45+
46+
547
* References
648
- [[https://github.com/sramirez/MOAReduction][MOAReduction]] By [[https://github.com/sramirez/][@sramirez]]
749
- Some DataStructures like =IntervalHeap= has been adapted from [[https://github.com/allenbh/gkutil_java/blob/master/src/gkimfl/util/IntervalHeap.java][allenbh/gkutil_java]], by [[https://github.com/allenbh/][@allenbh]].
@@ -15,7 +57,6 @@ This is a list of all resources that helped me to build this library:
1557
- [[https://github.com/tmadl/sklearn-expertsys/blob/master/Discretization/MDLP.py][tmadl/sklearn-expertsys: Discretization MDLP]]
1658
- [[https://github.com/shiralkarprashant/FCBF][FCBF python implementation]]
1759

18-
1960
* Used DataSets
2061
- [[https://archive.ics.uci.edu/ml/datasets/Iris/][Iris]]
2162
- [[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#svmguide3][SvmGuide3]]

dpasf/src/main/scala/com/elbauldelprogramador/featureselection/OFSGDTransformer.scala

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
package com.elbauldelprogramador.featureselection
1919

20+
import com.elbauldelprogramador.utils.FlinkUtils
2021
import breeze.linalg.norm
2122
import org.apache.flink.api.scala._
2223
import org.apache.flink.ml.common.{ LabeledVector, Parameter, ParameterMap, WeightVector }
@@ -121,6 +122,7 @@ object OFSGDTransformer {
121122

122123
// TODO: Better way to compute dimensions
123124
val dimensionsDS = input.map(_.vector.size).reduce((_, b) b).collect.head
125+
val dimensionsDS = FlinkUtils.numAttrs(input)
124126

125127
val values = Array.fill(dimensionsDS)(0.0)
126128
var weights = WeightVector(DenseVector(values), .0).weights.asBreeze

0 commit comments

Comments
 (0)