9 Jan 2011

New features for 2011

Here is a list of features I would like to add to DauroLab in 2011. It's ambitious, I know. But I hope to get them all by the end of the year. I will name each feature with an "Fi" label.

  • New classifiers:
    • Add a online classifier (particularly, I'm interested in a Bayesian Logistic Regression model, see paper by Genkin et al in JMLR) (F1).
    • Add a single layer perceptron, with the possibility of choosing a loss function (square, Hinge, logistic, ...) and a regularization procedure (F2). This is easy.
    • Add a tree classification algorithm (C4.5 or ID3) (F3).
    • Add a random forest procedure (F4). Both F4 and F3 are easy, choosing and adequate representation for trees, which does not use many memory.
  • Feature selection:
    • Add a way to compute feature selection on datasets (this is already done, but not commited to the working copy) (F5).
  • Vector input and classfile input:
    • Add a way to input vector files and classfiles in different formats using user-programmed factories (F6). This needs to be planned carefully it is not straightforward).
  • New programs to be added: which should not be a big deal with the amount of things already done.
    • TrainAndTest: (or any other name) via cross validation, leave-one out, or a single train-test split on a fixed set of vectors. For the cases in which there is only a training set. (F7). Not very difficult.
    • ThresholdEstimation: (with PCut, SCut and SCutFbr 0 or 1). Should not be a big deal with previous program done. The estimated threshold set could be written to a file, and used for classification afterwards (slightly modifying ExecuteParametricClassifier) but requires a careful review of the literature (I have not used threshold estimation until now) (F8).
  • "Internal things" to be changed:
    • Add support for a logging method (log4j?). (F9)
    • Fix idf and normalization thing: this is a bit messy. When classifying large-scale data, mainly textual documents, sometimes vectors are multiplied by the idf vector (ponderating which coordinate is more important), and often (l2) normalized. In some cases, a vector file already contains idf multiplied vectors and/or normalized ones (F10). Two things need to be done:
      • Include a way to denote when a classifier needs idf and/or normalization
      • Make a global option in learning/classification programs to manually specify if idf and/or normalization is performed. If nothing is specified, frequential vectors should be assumed, and therefore, the procedure needed for that classifier will be done (SVM: idf + normalization, Naive Bayes: no idf + no normalization).
This makes a total of 10 changes (some of them easier, some of them not). If I do one per month, leaving one month for holidays and one extra month as overflow, seems not to be a very ambitious planning. On the other hand, I assume that getting around 70%-80% of those things done is a good mark, having into consideration that this is not my main work (DauroLab is not the tool I am currently using in my research). 

We will leave for 2012 other learning paradigms which I'm interested in, like active learning or semi-supervised learning. Also, integration with new libraries is not planned by the moment (but would be interesting a good [sparse] matrix library and a Bayesian networks library). Maybe I should code those libraries by myself...

We will see at the end of the year.