The task is query classification, or query intent detection.
About the competition, please visit http://cikm2014.fudan.edu.cn/index.php/Index/index and http://openresearch.baidu.com/topic/71.jspx
- Multi-class multi-label
- Short text
- Click and session
- Unlabelled data
- Unbalanced data
- Structured labels
- N-gram, word position, aggregated query as a sample
- In-session queries and labels, keyword and entity detection
- Semi-supervised learning
- Sampling, post-processin
- query words (1-gram, 2-gram, word position)
- clicked title words (1-gram, 2-gram)
- words of top 30 titles in query's same sessions
- words of top 3 labels in query's same sessions
- labels in query's same sessions
- query length
- query frequence
- average length of clicked titles
- average search times in query's same sessions
- average click times in query's same sessions
- averge duplicated clicks in query's same sessions
- GBM: Xgboost with softmax-objective
- SVC: Liblinear
- Multi-class LR: Sklearn.MultiTaskLasso
- Random Forest: Sklearn.RandomForestClassifier
- Labelled LDA: modified PLDA
- Markov Chain: query-query similarity by text and session co-occurrence
- weighted averaging
- linear model
- cascading: feed xgboost
- Calibration: same label distribution as training set
- Threshold: same average labels as training set
-
Dependencies:
- XGBoost for GBM: https://github.com/tqchen/xgboost
- Liblinear for LR and SVC: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
-
Assumpation:
- XGboost's path is ../../tools/xgboost3/
- Liblinear's path is ../../tools/liblinear/
- raw training data is in ../raw_data
- need 3 folds ../trans_data, ../dataset, ../submit for temporary data
-
Run:
cd V2
sh -x run_all.sh
-
Steps:
- split train.txt to dog/valid (for offline tuning): split_train.py
- merge information for each query: trans_train.py
- generate features: prepare_feature.py
- train and predict by xgboost: run_xgboost3_dog.sh
- train and predict by liblinear: run_liblinear_dog.sh
- ensemble: run_ensemble.sh