LiveML¶
Overview¶
LiveML is an automated machine learning framework that trains Gradient Boosted Tree with hyper-parameters tuned using Bayesian Search, offloading compute resources to AWS services. Gradient Boosted Trees are ensembles of decision trees that collectively "vote" to predict outcomes. The set of decision trees in a model are constructed sequentially through machine learning so as to minimize a defined loss function (Area Under the Precision-Recall Curve) and improve upon the errors created by the prior trees in the model. These models are considered state-of-the-art.
Gradient Boosted Trees have a number of hyper-parameters, and the values on these parameters have dramatic impact on the performance of any given model. Finding "good" hyper-parameter values, ones that yield high model performance, is a computational task more than an analytic one because the search space is quite large, and grows exponentially with the number of hyper-parameters. The computational process of finding "good" values is called hyper-parameter tuning. We use Bayesian Search as our hyper-parameter tuning process because it allows for a faster and more efficient search through a very large search space.
Pre-requisites¶
Files¶
Every project requires, at minimum, four files.
- A features file
- An outcomes file
- A scale file
- A categorical file
These files are <non-optional. Typically they are in the form of .CSV files but can also be .xlsx files. Other optional files are:
- An out-of-sample file
- A weights file
- A survey file
To do: give more details about the nature of these files and what they look like
The following project assumes the following values are in a file called .env
environment = dev
api_name = liveml
profile = livescoring
aws_profile = livescoring
project_id = P555555
iv_filename = features.csv
dv_filename = outcomes.csv
cat_vars_filename = categoricals.csv
scale_vars_filename = scales.csv
iv_idvar = LRW_ID
dv_idvar = Respondent_ID
scorers = ["f1_macro"]
email = aalvarezsuarez@lrwonline.com
iterations = 1
repeats = 1
global_filter_variable = ""
Database connectivity¶
In order to use this module, you will need to be able to access the database. (We don't currently support local databases - which we'll update eventually.) Reach out to Hrag Balian to gain access.
AWS access¶
This module is tightly integrated with AWS -- leveraging S3, ECS and Lambda heavily as micro-services. As a result, in order to use this module, you will need to have access to the AWS account that it is connected to. Reach out to Hrag Balian to gain access. Once you have access, make sure to create a profile for this account in your AWS configs.
Environment¶
One of the integrations with AWS is a lambda function that requires the setting of an environment variable DROP_FILES_PREFIX
Start a project¶
Initialize¶
from adele.modeling.liveml.project_client import ProjectClient
project_client = ProjectClient(
env.get('project_id'),
env.get('environment'),
local=True)
Add a description¶
Until you add a description, your project is not going to be activated. So let's do that by using add
.
project_client.add('What a cool project')
Okay, now our project is activated. You can verify that with info
.
project_client.info()
You should see something like the following response
('SUCCESS',
[{'project_id': 'P555555',
'description': 'What a cool project',
'match_thresh': '0.4',
'missingness': '0.1',
'features_id': None,
'outcomes_id': None,
'created_at': '2022-02-09T20:42:57.137738+00:00'}])
which indicates that your project exists in the database records. If instead you see this response
('SUCCESS', [])
when using info
then your project did not initialize successfully.
If your description is wrong for whatever reason, or you want to update it just for fun, you can, at any time. Just use the update() method on the project_client
. (We'll come back to the other pieces of information retrieved from info
later.)
project_client.update(dict(description = 'Not a cool project'))
Next, we want to add data files to the project. You can confirm that your project doesn't have any files yet.
project_client.filenames
which should return an empty list []
. Remember, we need to add at minimum the four files noted above.
Upload data¶
To upload data, we reference FileClient
to initialize a file_client
.
Initialize your file client¶
from adele.modeling.liveml.file_client import FileClient
from dotenv import dotenv_values
import requests
env = dotenv_values("examples/liveml/.env")
file_client = FileClient(
env.get("environment"),
local=True)
Features file¶
First, we'll upload a features file. In our case, our features are in in a .CSV file called "features.csv".
file_client.project_id = env.get("project_id")
file_client.filename = env.get("iv_filename")
file_client.create_upload_url()
fileref = f"examples/liveml/data/valid/{env.get('iv_filename')}"
with open(fileref, 'rb') as data:
requests.put(file_client.upload_url, data=data)
Once you've uploaded your file, you can verify that your file has landed and associated with your project using project_client
. Give it one or two seconds as a Lambda function is doing the work of picking up your file and moving it, and larger files will take longer.
We want to tag this file as a "features" file, which will be important for a set of validations later.
file_client.tag = 'features'
For any features file you will also need to specify the joining identifier. This is a column header in the features file that will be used to join files.
file_client.idvar = env.get("iv_idvar")
Note that there should be consistency between files around the joining identifier. Specifically, all features files should have the same joining identifier, and all outcomes files should have the same joining identifier -- these get set as global attributes. The features and outcomes files do not need to share the same joining identifier.
Now looking at the filenames
attribute on the project_client
project_client.filenames
should yield the following response:
[{'project_id': 'P555555',
'filename': 'features.csv',
'tag': 'features',
'intra_validation': None,
'error_list': None,
'file_size': '580879',
'last_modified': '2022-02-10 18:58:08+00',
'created_at': '2022-02-10T18:58:11.785165+00:00'}]
Outcomes file¶
You can reuse the existing file_client
to upload other files. Next we'll upload our outcomes file. Just go ahead and change the filename attribute filename
, and then go through the same steps for the upload, but now tagging your file as an "outcomes" file, and setting your idvar
.
file_client.filename = env.get("dv_filename")
file_client.filename = env.get("scale_vars_filename")
file_client.create_upload_url()
fileref = f"examples/liveml/data/valid/{env.get('scale_vars_filename')}"
with open(fileref, 'rb') as data:
requests.put(file_client.upload_url, data=data)
file_client.tag = 'outcomes'
file_client.idvar = env.get("dv_idvar")
Verify that your file has been uploaded, referring to project_client.filenames
. The response should look something like this.
[{'project_id': 'P555555',
'filename': 'features.csv',
'tag': 'features',
'intra_validation': None,
'error_list': None,
'file_size': '580879',
'last_modified': '2022-02-10 18:58:08+00',
'created_at': '2022-02-10T18:58:11.785165+00:00'},
{'project_id': 'P555555',
'filename': 'outcomes.csv',
'tag': 'outcomes',
'intra_validation': None,
'error_list': None,
'file_size': None,
'last_modified': None,
'created_at': '2022-02-10T20:06:11.058237+00:00'}]
Categorical and scale file¶
In the same way, we'll upload our categorical and scale files. Let's start first with categorical...
file_client.filename = env.get("cat_vars_filename")
file_client.create_upload_url()
fileref = f"examples/liveml/data/valid/{env.get('cat_vars_filename')}"
with open(fileref, 'rb') as data:
requests.put(file_client.upload_url, data=data)
file_client.tag = 'categorical'
... and now let's upload the scale file.
file_client.filename = env.get("scale_vars_filename")
file_client.create_upload_url()
fileref = f"examples/liveml/data/valid/{env.get('scale_vars_filename')}"
with open(fileref, 'rb') as data:
requests.put(file_client.upload_url, data=data)
file_client.tag = 'scale'
Note that with categorical and scale files we don't need to any joining identifier, but we do need to add a tag to the file.
Let's confirm once again that our files are uploaded, again using project_client.filenames
.
[{'project_id': 'P555555',
'filename': 'features.csv',
'tag': 'features',
'intra_validation': None,
'error_list': None,
'file_size': '580879',
'last_modified': '2022-02-10 18:58:08+00',
'created_at': '2022-02-10T18:58:11.785165+00:00'},
{'project_id': 'P555555',
'filename': 'outcomes.csv',
'tag': 'outcomes',
'intra_validation': None,
'error_list': None,
'file_size': None,
'last_modified': None,
'created_at': '2022-02-10T20:06:11.058237+00:00'},
{'project_id': 'P555555',
'filename': 'scales.csv',
'tag': 'scale',
'intra_validation': None,
'error_list': None,
'file_size': None,
'last_modified': None,
'created_at': '2022-02-10T20:51:36.277828+00:00'},
{'project_id': 'P555555',
'filename': 'categoricals.csv',
'tag': 'categorical',
'intra_validation': None,
'error_list': None,
'file_size': None,
'last_modified': None,
'created_at': '2022-02-10T20:51:44.957992+00:00'}]
We can confirm that all four of our files have been received successfully.
Validation¶
LiveML has robust file validation methods as the files need to be in a specific format. So far we've only uploaded files to the project but haven't actually done any validation on those files. The next step is to proceed with those validations.
The following validations need to happen:
- the features file and the outcomes file are joinable on the specified joining identifier.
- the column headers in the scale and categorical files have corresponding data in the features file.
- the variables identified as scale in the scale file are numeric in the features file, and the variables identified as categorical in the categorical file are nominal in the features file
- the outcomes file has an identifier... (Complete with other validations that happen)
Notice how some file validations require the presence of other files. We could validate files one at a time, and the file_client
has a corresponding validation
method to do so. But that is tedious and will likely result in some files not being fully validated. Instead, we can use validation functionality on the project_client
instead to validate everything at once.
Validation token¶
To do this, first we need a validation token. Because validating all files at once is potentially a time consuming process, the validation token can be used to poll the validation for progress.
project_client.validation_token()
token = project_client.token
Note that you can't skip generating a validation token. To execute the validation, call validate_all
, with the two joining identifiers as parameters:
Execute validation¶
project_client.validate_all(env.get('iv_idvar'), env.get('dv_idvar'))
You don't need to use the token to execute the validation; it is stored as an attribute to project_client
. Instead, because the validation will likely take a few seconds to run, you can use the token to poll the status of the validation. (The token will be especially useful when executing the validation on the LiveML API; running this locally will lock up your machine until the validation completes. When executed on the LiveML API, this validation executes as a background task and can polled async).
Poll validation¶
Assuming that you didn't locked up your machine when you triggered the validation (or you're running from a separate python instance) you can call get_validation_status
to get the latest status of the validation, using the validation token:
project_client.get_validation_status(token)
which after some time will yield the following response:
{'token': 'd2cb2ccc-50d3-4368-a75d-3c78842e3fbe',
'status': 'COMPLETED',
'result': 'SUCCESS',
'problem_list': ''}
The problem_list
entry of the response contains any details around any observed issues from the validation. It being empty is a good sign. Any validations that would potentially cause issues with model training will return a FAILED
result.
Check individual file validation¶
You can also now check the individual files for their validation status:
[{'project_id': 'P555555',
'filename': 'features.csv',
'tag': 'features',
'intra_validation': 'PASS',
'error_list': 'warning: the following 229 Features columns have a number of missing rows over the missingness threshold: <LIST OF VARIABLES REDACTED FOR DOCUMENTATION>,
'file_size': '580879',
'last_modified': '2022-02-10 18:58:08+00',
'created_at': '2022-02-10T18:58:11.785165+00:00'},
{'project_id': 'P555555',
'filename': 'outcomes.csv',
'tag': 'outcomes',
'intra_validation': 'PASS',
'error_list': 'warning: no user-defined weights detected - will use default weights, which is to try to equalize classes',
'file_size': None,
'last_modified': None,
'created_at': '2022-02-10T20:06:11.058237+00:00'},
{'project_id': 'P555555',
'filename': 'scales.csv',
'tag': 'scale',
'intra_validation': 'PASS',
'error_list': '',
'file_size': None,
'last_modified': None,
'created_at': '2022-02-10T20:51:36.277828+00:00'},
{'project_id': 'P555555',
'filename': 'categoricals.csv',
'tag': 'categorical',
'intra_validation': 'PASS',
'error_list': '',
'file_size': None,
'last_modified': None,
'created_at': '2022-02-10T20:51:44.957992+00:00'}]
Notice the error list for the features file (despite redaction of variable list for documentation). There are some warnings but none that are... REVISE: MIGHT PUT INTO SEPARATE WARNING LIST.
Model Training¶
If your four core files have been validated successfully, you are ready to submit models for training. Model training happens in jobs, which are batches of individual models that get trained in parallel using AWS resources. The scope of a single model within a job is a specific outcome with a specific feature set using a specific hyperparameter tuning scorer. Many models can be trained simultaneously when submitted as part of a job. To submit a job, a user just needs to select the outcome (or outcomes) they want to train, the feature set (i.e. features file) to train on. Optionally, they can select a specific hyperparameter tuning scorers to use, or just train at least one model using each scorer. More on this below.
Hyperparameter tuning scorers¶
There are 33 possible tuning scorers available. The complete list is below.
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score',
'balanced_accuracy', 'completeness_score', 'explained_variance', 'f1_macro',
'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score',
'jaccard_macro', 'jaccard_micro', 'jaccard_weighted', 'max_error',
'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_gamma_deviance',
'neg_mean_squared_log_error',
'neg_root_mean_squared_error', 'normalized_mutual_info_score',
'precision_macro', 'precision_micro', 'precision_weighted',
'r2', 'recall_macro', 'recall_micro', 'recall_weighted',
'roc_auc_ovo', 'roc_auc_ovo_weighted', 'roc_auc_ovr', 'roc_auc_ovr_weighted', 'v_measure_score']
There is no a priori way of knowing which scorer is going to produce the best performing model. As such, there is no reason to not try all of them when submitting jobs.
Training Jobs¶
The scope of a training job is a set of outcome(s) with a set of scorers on a specific feature set. For example, if we have one feature set, two outcomes, and 10 scorers that we we want to submit as part of a training job, our job will consist of 2 x 10 = 20
models, each of which will train in parallel. If we used every scorer, this would yield 2 x 33 = 66
models. (There is a concept I have not yet mentioned around iterations which we will come back to later.) Since training happens in parallel, there are no additional timing implications to consider when training more vs less models as part of a job, so the recommendation is to try all scorers unless you have a strong reason not to do this.
Initialze your training client¶
To submit a training job, you need to initialize your training client.
from adele.modeling.liveml.job_client import TrainingJobClient
training_client = TrainingJobClient("dev", local=True)