6 May 2010

Keys in the design of DauroLab

Okay, I am not a software architect (in fact it is not my job), but I tried to keep several guidelines while designing DauroLab. They are, basically, the following:

  • It should be designed to handle more amount of data than other toolboxes, as for example, Weka. Weka is a superb machine learning package, but it assumes datasets fits entirely in memory (which is not necessarily true). Thus, usage of disk is preferred against memory consumption, and therefore, it is more oriented to Large Scale Machine Learning Tasks (anyway, it is not as efficient, in this sense, as I would like to).
  • Because of that "need of memory", it has been implemented in a language (Java) which is enough efficient, without using so much memory. Another option would have been implementing core functions in C/C++, and make a wrapper in Python, Perl or Ruby. But sincerely, I did not have that great idea when I started working on it...
  • The less dependencies of DauroLab has, the better it is. However, to avoid reinventing the wheel, I have used several free-software packages in several tasks where implementation was indeed hard (at the moment, the SVM classifiers and the stemming algorithms).
  • Of course, all the software is now free software. This is not a real "guideline", but has been round into my head during these last years. Now, It has been made reality.
  • DauroLab should be able to firstly learn a classifier, in order to be used afterwards. This implies that the classifiers should be storable in secondary memory. In this toolbox, after learning a classifier, you get a single file, where all the learnt information is contained, and therefore, it can be used afterwards.

No comments:

Post a Comment