Database Collections

Collection: Problem

A problem contains the following elements:

  • uuid: a unique identifier of the problem. db.UUIDField().
  • workflow: UUID of the associated problem workflow stored on storage. db.UUIDField().A problem workflow mainly defines what are the data targets and the performance metric used to evaluate machine learning models. An example of a problem workflow is given for sleep stages classification here.
  • timestamp_upload: timestamp of the problem creation. db.DateTimeField().
  • test_dataset: list of UUIDs of test data, which are not accessible, except by Computeto compute performances of submitted algorithms. db.ListField(db.UUIDField()).
  • size_train_dataset: size of mini-batch for each training task. db.IntegerField().

Collection: Learnuplet

A learnuplet defines a learning task. It is constructed by the Orchestrator in two cases:

  • when new data is uploaded
  • when a new algorithm is uploaded It is then used by Compute to do the training.

A learnuplet is made of the following elements:

  • uuid: a unique identifier of the task. db.UUIDField().
  • problem: the UUID of the problem associated to the learning task. db.UUIDField().
  • workflow: the UUID of the problem workflow associated to the learning task. db.UUIDField().
  • train_data: list of train data UUIDs, on which the learning will be done. db.ListField(db.UUIDField()).
  • test_data: list of test data UUIDs, on which the performance of the algorithm is computed. db.ListField(db.UUIDField()).
  • algo: UUID of submitted algorithm. db.UUIDField().
  • model_start: UUID of model to be trained. If rank=0, this UUID is the same as algo. db.UUIDField().
  • model_end: UUID of the model obtained after training of model_start. db.UUIDField().
  • rank: rank of the task, which defines the order in which learnuplets must be trained. For more details, see in Details on the construction of a learnuplet at algorithm upload and in Details on the construction of a learnuplet at data upload.
  • worker: UUID of worker which is in charge of the training task defined by this learnuplet. db.UUIDField().
  • status: status of the training task. It can be waiting if we are waiting for a model training with a lower rank, todo if the traiing job can start, pending if a worker is currently consuming the task, or done if training has been done successfully, or failed is trainig has been unsuccesfully done. db.StringField(max_length=8).
  • perf: performance on test data. db.FloadField().
  • test_perf: dictionary of performances on test data: each element is the performance on one test data file (the keys being the corresponding data uuids). db.ListField(db.FloatField()).
  • train_perf: dictionary of performances on train data: each element is the performance on one train data file (the keys being the corresponding data uuids). db.ListField(db.FloatField()).
  • training_creation: timestamp of the learnuplet creation. db.DateTimeField().
  • training_done: timestamp of feeback from compute (when updating status to done or failed). db.DateTimeField().

Details on the construction of a learnuplet at algorithm upload

When uploading a new algorithm, its training is specified in learnuplets by the Orchestrator.

For now, they are constructed following these steps:

  1. selection of associated active data: for now all data corresponding to the same problem with targets.This might change later to lower computational costs.
  2. for each mini-batch containing size_train_dataset (parameter fixed for the problem), creation of a learnuplet.Each learnuplet contains the UUID of the model from which to start the training in model_startand UUID where to save the model after training in model_end.The first learnuplet has rank=0, status=todo and a specified model_start , and other have incremental values of rank, status=todo and nothing in model_start (filled later). +Model from which to start the learning is not defined for learnuplets with rank=i at learnuplet creation, but when performance of learnuplet with rank=i-1 is registered on the Orchestrator. At this moment, the Orchestrator looks for the model_end of the learnuplet with the best performance to choose it as the model_start for learnuplet of rank=i.

Details on the construction of a learnuplet at data upload

When uploading new data, relevant models are updated.

For now, the construction of corresponding learnuplets is made as follows:

  1. selection of relevant models called active models: for now all models corresponding to the same problem.This might change later to lower computational costs.
  2. for each algorithm:
  • 2.1 find the model which has the best performance (which is not necessarily the one with the highest rank).
  • 2.2 for each mini-batch containing size_train_dataset (parameter fixed for the problem), creation of a learnuplet starting from the model found in 2.1.

Collection: Algo

An algo represents a untrained machine learning model for a given problem submitted via Analytics, stored in Storage, and registered in the Orchestrator database. An algo has the following fields:

  • uuid: a unique identifier of the algo. db.UUIDField().
  • problem: UUID of the associated problem. db.UUIDField().
  • name: name of the algo. db.StringField().
  • timestamp_upload: timestamp of registration on Orchestrator. db.DateTimeField().

For details about how to register an algo, see the endpoints documentation.

Note: For now, there is no field to indicate who submitted the algo, since it is out of scope for phase 1.1. For phase 1.2, a Poster collection might be introduced (with an uuid and a token fields), and its uuid might be added to the algo table.

Collection: Data

A data is submitted via the Viewer, stored in Storage, and registered in the Orchestrator database. It has the following fields:

  • uuid: a unique identifier of the data. db.UUIDField().
  • problems: list of UUIDs of associated problems (a data can be associated with several problems). db.ListField(db.UUIDField()).
  • timestamp_upload: timestamp of registration on Orchestrator. db.DateTimeField().

Note: For now, there is no field to indicate who submitted the algo, since it is out of scope for phase 1.1. For phase 1.2, a Poster collection might be introduced (with an uuid and a token fields), and its uuid might be added to the data table.

For details about how to register a data, see the endpoints documentation.

Collection: Preduplet

A preduplet is created in the Orchestrator when a prediction is requested. It has the following fields:

  • uuid: . db.UUIDField()
  • problem: UUID of the associated problem db.UUIDField(max_length=50).
  • workflow: UUID on Storage of the workflow associated with the problem db.UUIDField(max_length=50).
  • data: UUID on Storage of the data from which to compute the prediction db.ListField(db.UUIDField()).
  • prediction_storage_uuid: UUID of the associated prediction file on Storage db.ListField(db.UUIDField()).
  • model: UUID on Storage of the model used to compute the prediction db.UUIDField().
  • worker: UUID of the worker on which computation are made db.UUIDField().
  • status: db.StringField(max_length=8).
  • timestamp_request: db.DateTimeField().
  • timestamp_done: db.DateTimeField().

For details about how to request a prediction, see the endpoints documentation.