6 May 2010

Keys in the design of DauroLab

Okay, I am not a software architect (in fact it is not my job), but I tried to keep several guidelines while designing DauroLab. They are, basically, the following:

  • It should be designed to handle more amount of data than other toolboxes, as for example, Weka. Weka is a superb machine learning package, but it assumes datasets fits entirely in memory (which is not necessarily true). Thus, usage of disk is preferred against memory consumption, and therefore, it is more oriented to Large Scale Machine Learning Tasks (anyway, it is not as efficient, in this sense, as I would like to).
  • Because of that "need of memory", it has been implemented in a language (Java) which is enough efficient, without using so much memory. Another option would have been implementing core functions in C/C++, and make a wrapper in Python, Perl or Ruby. But sincerely, I did not have that great idea when I started working on it...
  • The less dependencies of DauroLab has, the better it is. However, to avoid reinventing the wheel, I have used several free-software packages in several tasks where implementation was indeed hard (at the moment, the SVM classifiers and the stemming algorithms).
  • Of course, all the software is now free software. This is not a real "guideline", but has been round into my head during these last years. Now, It has been made reality.
  • DauroLab should be able to firstly learn a classifier, in order to be used afterwards. This implies that the classifiers should be storable in secondary memory. In this toolbox, after learning a classifier, you get a single file, where all the learnt information is contained, and therefore, it can be used afterwards.

5 May 2010

Artwork and name



The artwork of DauroLab presents a cartoonized bridge on a small river in what it seems to be an old city. In fact, the city shown is the old part of Granada (my city): the Acera del Darro Carrera del Darro, a street near the Alhambra and one of the most typical postcards of Granada. The river in the picture is the Darro, whose name comes from the latin word Dauro (making reference to the gold content on its banks).

The name DauroLab is a conjunction of Dauro (the name of that river) and Lab (laboratory). The fact that the river contained gold in ancient times makes the name nicer, because one may do the interpretation that this package is built to extract gold from its data.

3 May 2010

About DauroLab

DauroLab is a toolbox for large scale machine learning, mainly supervised classification, originally designed for document categorization purposes. It has been written by Alfonso E. Romero, while writing his PhD Thesis (from April 2006 to April 2010), but he also tries to keep on developing it, adding new features.

DauroLab is written in Java, under a GNU Gpl license (version 3). It has few dependencies, all of them released under a OSI approved license (in other words, libre or open source licences).