9 Jan 2011

New features for 2011

Here is a list of features I would like to add to DauroLab in 2011. It's ambitious, I know. But I hope to get them all by the end of the year. I will name each feature with an "Fi" label.

  • New classifiers:
    • Add a online classifier (particularly, I'm interested in a Bayesian Logistic Regression model, see paper by Genkin et al in JMLR) (F1).
    • Add a single layer perceptron, with the possibility of choosing a loss function (square, Hinge, logistic, ...) and a regularization procedure (F2). This is easy.
    • Add a tree classification algorithm (C4.5 or ID3) (F3).
    • Add a random forest procedure (F4). Both F4 and F3 are easy, choosing and adequate representation for trees, which does not use many memory.
  • Feature selection:
    • Add a way to compute feature selection on datasets (this is already done, but not commited to the working copy) (F5).
  • Vector input and classfile input:
    • Add a way to input vector files and classfiles in different formats using user-programmed factories (F6). This needs to be planned carefully it is not straightforward).
  • New programs to be added: which should not be a big deal with the amount of things already done.
    • TrainAndTest: (or any other name) via cross validation, leave-one out, or a single train-test split on a fixed set of vectors. For the cases in which there is only a training set. (F7). Not very difficult.
    • ThresholdEstimation: (with PCut, SCut and SCutFbr 0 or 1). Should not be a big deal with previous program done. The estimated threshold set could be written to a file, and used for classification afterwards (slightly modifying ExecuteParametricClassifier) but requires a careful review of the literature (I have not used threshold estimation until now) (F8).
  • "Internal things" to be changed:
    • Add support for a logging method (log4j?). (F9)
    • Fix idf and normalization thing: this is a bit messy. When classifying large-scale data, mainly textual documents, sometimes vectors are multiplied by the idf vector (ponderating which coordinate is more important), and often (l2) normalized. In some cases, a vector file already contains idf multiplied vectors and/or normalized ones (F10). Two things need to be done:
      • Include a way to denote when a classifier needs idf and/or normalization
      • Make a global option in learning/classification programs to manually specify if idf and/or normalization is performed. If nothing is specified, frequential vectors should be assumed, and therefore, the procedure needed for that classifier will be done (SVM: idf + normalization, Naive Bayes: no idf + no normalization).
This makes a total of 10 changes (some of them easier, some of them not). If I do one per month, leaving one month for holidays and one extra month as overflow, seems not to be a very ambitious planning. On the other hand, I assume that getting around 70%-80% of those things done is a good mark, having into consideration that this is not my main work (DauroLab is not the tool I am currently using in my research). 

We will leave for 2012 other learning paradigms which I'm interested in, like active learning or semi-supervised learning. Also, integration with new libraries is not planned by the moment (but would be interesting a good [sparse] matrix library and a Bayesian networks library). Maybe I should code those libraries by myself...

We will see at the end of the year.

6 May 2010

Keys in the design of DauroLab

Okay, I am not a software architect (in fact it is not my job), but I tried to keep several guidelines while designing DauroLab. They are, basically, the following:

  • It should be designed to handle more amount of data than other toolboxes, as for example, Weka. Weka is a superb machine learning package, but it assumes datasets fits entirely in memory (which is not necessarily true). Thus, usage of disk is preferred against memory consumption, and therefore, it is more oriented to Large Scale Machine Learning Tasks (anyway, it is not as efficient, in this sense, as I would like to).
  • Because of that "need of memory", it has been implemented in a language (Java) which is enough efficient, without using so much memory. Another option would have been implementing core functions in C/C++, and make a wrapper in Python, Perl or Ruby. But sincerely, I did not have that great idea when I started working on it...
  • The less dependencies of DauroLab has, the better it is. However, to avoid reinventing the wheel, I have used several free-software packages in several tasks where implementation was indeed hard (at the moment, the SVM classifiers and the stemming algorithms).
  • Of course, all the software is now free software. This is not a real "guideline", but has been round into my head during these last years. Now, It has been made reality.
  • DauroLab should be able to firstly learn a classifier, in order to be used afterwards. This implies that the classifiers should be storable in secondary memory. In this toolbox, after learning a classifier, you get a single file, where all the learnt information is contained, and therefore, it can be used afterwards.

5 May 2010

Artwork and name



The artwork of DauroLab presents a cartoonized bridge on a small river in what it seems to be an old city. In fact, the city shown is the old part of Granada (my city): the Acera del Darro Carrera del Darro, a street near the Alhambra and one of the most typical postcards of Granada. The river in the picture is the Darro, whose name comes from the latin word Dauro (making reference to the gold content on its banks).

The name DauroLab is a conjunction of Dauro (the name of that river) and Lab (laboratory). The fact that the river contained gold in ancient times makes the name nicer, because one may do the interpretation that this package is built to extract gold from its data.

3 May 2010

About DauroLab

DauroLab is a toolbox for large scale machine learning, mainly supervised classification, originally designed for document categorization purposes. It has been written by Alfonso E. Romero, while writing his PhD Thesis (from April 2006 to April 2010), but he also tries to keep on developing it, adding new features.

DauroLab is written in Java, under a GNU Gpl license (version 3). It has few dependencies, all of them released under a OSI approved license (in other words, libre or open source licences).