Skip to content

LiveML

LiveML Core

adele.modeling.liveml2.core.LiveMLS3Boto

LiveML Boto3 connections (can only be used in datalake account)

adele.modeling.liveml2.core.LiveMLCore (LiveMLS3Boto)

Core LiveML class for all of the other clients to inherit from

Parameters:

Name Type Description Default
environment str

Distinguishes between different deployment environments. Currently only dev (development) available. Defaults to None.

None
local bool

Is this client being deployed on local machine (which requires bastion connection to RDS instance) or not

False

App

adele.modeling.liveml2.classes.app.core.App (LiveMLCore)

Core App-level class that deals with anything at the application level. This includes listing all available projects and managing the benchmark of models.

Parameters:

Name Type Description Default
environment str

Distinguishes between different deployment environments. Currently only dev (development) available. Defaults to None.

required
local bool

Is this client being deployed on local machine (which requires bastion connection to RDS instance) or not

required

Examples:

Example 1, get the list of all available projects and their info

>>> client = AppClient("dev")
>>> result = client.list_projects()

Example 2, get the list of projects and their info given a list of project ids

>>> client = AppClient("dev")
>>> result = client.list_projects(['P123456','P555555'])

adele.modeling.liveml2.classes.app.benchmarks.AppBenchmarks (App)

Benchmark class

Examples:

>>> client = AppBenchmarks(
        "dev",
        local=True)
>>> client.retrieve_benchmark()
>>> client.get_scorers()
retrieve_benchmark(self, criteria='neutral')

Retrives all models that have been marked as part of the benchmark, returning the information of interest reformatted to fit our explorations. Models get marked as 'benchmark' when running get_best_model within a project, to avoid storing those that perform worse than no-skill classifiers.

Parameters:

Name Type Description Default
criteria str

Scoring criteria, can prioritize precision, recall or none. Defaults to 'neutral'.

'neutral'

Returns:

Type Description
DataFrame

dataframe containing all properly-formatted models.

calculate_benchmark_model(self, model, criteria='neutral', by_metric=True)

Retrieves the benchmarked models, ranks them and calculates evaluation metrics (avg_pct) for the given model uuid.

Parameters:

Name Type Description Default
model str

model uuid of the model we are interested on.

required
criteria str

Scoring criteria, can prioritize precision, recall or none. Defaults to 'neutral'.

'neutral'
by_metric bool

description. Defaults to True.

True

Returns:

Type Description
dict

dictionary with all the relevant evaluation metrics, including our own avg_pct.

calculate_benchmark_scorers(self, criteria='neutral')

Retrieves the benchmarked models, ranks them and calculates the evaluation metric (avg_pct) descriptors for each stored scorer.

Parameters:

Name Type Description Default
criteria str

Scoring criteria, can prioritize precision, recall or none. Defaults to 'neutral'.

'neutral'

Returns:

Type Description
List

List of dictionaries with the scorers names and information.

get_scorers(self)

Returns all the available scorers and their info.

Returns:

Type Description
dict

Dict with scorers as key and their available info in the items.

adele.modeling.liveml2.classes.app.projects.AppProjects (App)

Project level class at the app level

Examples:

>>> client = AppProjects(
        "dev",
        local=True)
>>> client.get_projects()
get_projects(self, filter=None)

Returns all the available projects and their info, or only the ones specified in the filter.

Parameters:

Name Type Description Default
filter list

List of project_ids which information we want to retrieve. Defaults to None.

None

Returns:

Type Description
list

List of projects. Each project is a dictionary with all the relevant fields: 'project_id','description',etc.

File

adele.modeling.liveml2.classes.file.core.File (LiveMLCore)

Client within the LiveML modeling framework that deals with the upload, deletion and validation of files within a given project, as well as validation. For validating files, they must be tagged with one of the available file types or tags. File existence checks and reading the content of the file are also within the scope of the client.

Parameters:

Name Type Description Default
environment str

Distinguishes between different deployment environments. Currently only dev (development) available. Defaults to None.

None

Examples:

Example 1, upload file or get content if it exists.

>>> client = FileClient("dev")
>>> client.project_id = "P123456"
>>> client.filename='FEATURES.csv'
>>> if client.check_existence():
        print(client.dataframe)
    else:
        client.create_upload_url()
        print(client.upload_url)

Example 2, validate file.

>>> client = FileClient("dev")
>>> client.project_id = "P123456"
>>> client.filename ='FEATURES.csv'
>>> client.tag = 'features'
>>> client.validate()

Example 3, delete file.

>>> client = FileClient("dev")
>>> client.project_id = "P123456"
>>> client.filename ='FEATURES.csv'
>>> client.delete()
get_validated_pairs(self)

Given the filename of the client, get the validation status of that file and the results of its validation against other files in the same project. Validation statuses include STARTED and COMPLETED, while validation results include SUCCESS, FAIL, and a null value if it has not yet been run.

Exceptions:

Type Description
FileNotInDB

Raised if the client filename cannot be found in file_info

Returns:

Type Description
dict

complete_set is a boolean that is True if the target file is and other files it is validated against consist of at least one categorical, scale, features, and outcomes file. These files must all pass intravalidation and intervalidation against the target file in order to count towards a complete set.

missing_validations contains a list of file tags still required to make a complete set of training
files using the target file.

filelist is a list of all intervalidations run against the target file, their intervalidation 
status with the target file (verdict), their intravalidation status, their tag, and a list of 
errors found when performing intervalidation against the target file.

Example output:

{'complete_set': 'False', 'missing_validations': ['categorical', 'scale'], 'filelist': [ {'filename': 'outcomes4.csv', 'tag': 'outcomes', 'inter_validation': 'PASS', 'inter_errors': None, 'intra_validation': 'PASS', 'intra_errors': 'warning: no user-defined weights detected - will use default weights, which is to try to equalize classes'} ] 'target_file': {'filename': 'FEAUTRES.csv', 'intra_validation': 'PASS', 'intra_val_errors': None} 'final_val_status': 'PASS' }

check_existence(self)

Checks for the presence of the file in the DB, which implies its s3 existence.

Exceptions:

Type Description
ValidationError

When an input fails its pydantic validation.

Returns:

Type Description
bool

True if it exists and False otherwise.

create_upload_url(self)

In order to upload a file, this method generates an AWS presigned url, checking for prior file existance and if the parameter upload_override has been set to True before.

Exceptions:

Type Description
PresignedUrlError

The AWS client had an error creating the url.

FileExistNoOverride

File exists and override was not set to True.

ValidationError

When an input fails its pydantic validation.

create_download_url(self)

In order to download a file, this method generates an AWS presigned url, checking for prior file existance.

Exceptions:

Type Description
PresignedUrlError

The AWS client had an error creating the url.

FileExistNoOverride

File exists and override was not set to True.

ValidationError

When an input fails its pydantic validation.

preview(self, num_rows=30)

Returns a preview of the

Parameters:

Name Type Description Default
num_rows int

Number of rows at the top of the dataframe to return. Defaults to 30.

30

Exceptions:

Type Description
FileDoesNotExist

Exception if the file does not exist in the S3 path

Returns:

Type Description
pd.DataFrame

A dataframe of containing the top num_rows rows of the file.

delete(self)

Deletes the given file from the s3 bucket after checking for its existence.

Exceptions:

Type Description
FileDoesNotExist

File does not exist in the db and therefore not in s3.

ValidationError

When an input fails its pydantic validation.

adele.modeling.liveml2.classes.file.engineering.FileFeatureEngineering (File)

feature_engineering(self, new_file_name, remappings)

This endpoint creates new features files in the project directory with the new columns appended which should now be visible by the API. The user must specify either a categorical or numeric variable, specifying a dictionary for mapping values or bins for discretizing values respectively. The new features file is saved as a csv in S3.

Parameters:

Name Type Description Default
new_file_name str

the name of the new feature file

required
remappings list

A list of dictionaries indicating how to perform the remapping. example: [{ 'old_feature_name': 'Cluster', 'new_feature_name': 'ClusterNew', 'type': 'categorical', 'value_map': { 'TO': ['t','o'], 'PAQ': ['p','a','q'] } }, { 'old_feature_name': 'v6MonthTotalUnits', 'new_feature_name': 'v6MonthTotalUnitsNew', 'type': 'numeric', 'value_bins': [0, 30, 100, 1000, 10000] }]

required

Exceptions:

Type Description
NoFilesFound

Raised if the features file cannot be found in S3

FeatureTypeError

Raised if the indicated variable type does not match the data type in the file.

DiscretizationError

Error raised if there are issues descretizing a numeric variable

Returns:

Type Description
str

Confirmation message with containing the filename of the new features file.

adele.modeling.liveml2.classes.file.validate.FileValidate (File)

validate(self, rerun=True)

For a file that has been previously tagged, it performs a series of validations.

Exceptions:

Type Description
TagNotAvailable

Tag needs to be set before running.

IdVarNotAvailable

Id variable needs to be set before running.

Job

adele.modeling.liveml2.classes.job.core.Job (LiveMLCore)

Client within the LiveML modelling framework that deals with the creation and polling of the ECS tasks that train and evaluate xgboost models for a given project and data together with other configuration parameters.

Each task produces a model. Each set of tasks that are created from within a given project, with a given set of data, one or more scorers and one or more sets of dependent variables (each called a solution) is called a job.

Parameters:

Name Type Description Default
environment str

Distinguishes between different deployment environments. Currently only dev (development) available. Defaults to None.

None

Examples:

Example 1, submit a job.

>>> client = Job("dev")
    client.project_id = "P123456"
    client.job = my_job
    client.submit(payload)

Example 2, poll a job.

>>> client = Job("dev")
    client.project_id = "P123456"
    client.job_id = "e7818082-b400-46cb-87f9-1761ed1149f0"
    client.poll()
    results = client.polled_tasks
create(self)

One of the main methods of the job client, generates the job identifier as well as the model identifers that correspond to each of the possible combinations of solutions (dv) and scorers. Possible solutions and modelUIDs are stored as a TaskCreationPayload object. Needs to be called before submit.

Exceptions:

Type Description
ValidationError

When an input fails its pydantic validation.

JobOrProjectIdMissing

The necessary project and job identifiers are missing.

submit(self)

One of the main methods of the job client, creates the necessary ECS tasks for a given job (project and set of files), one task per combination of scorer and solution and stores them in the task_status table of the database. To be used as BackgroundTask in the LiveML API, always after calling the create method.

Exceptions:

Type Description
ValidationError

When an input fails its pydantic validation.

SubmitPayloadMissing

The necessary TaskCreationPayload object which is set in create is missing.

poll(self)

One of the main methods of the job client, returns status information of the task(s) associated with a given job identifier. It leverages both the information in the database (that the ECS task updates when it starts RUNNING, gets COMPLETE or has an ERROR) and the information stored in AWS in case of AWS infrasctructure errors or undetected error in the ECS task. Client class stores in polled_tasks the list with all the available tasks within the given job id their its status.

Exceptions:

Type Description
JobNotInDatabase

The given job identifier is not present in the database.

ValidationError

When an input fails its pydantic validation.

adele.modeling.liveml2.classes.job.summary.JobSummary (Job)

all_jobs property readonly

This property getter for the list of jobs reads them from db.

Returns:

Type Description
pd.Dataframe

dataframe with all the updated jobs info for the project

get_summary_and_update(self)

This summary method returns the list of existing jobs together for a project with other project-level job-related metrics. To avoid race conditions if updating on the ECS training tasks, which are likely due to parellization, it calculates all metrics for existing jobs that have not been marked yet as Finished. The only way to mark jobs as Finished is for this same method to have found all their tasks are completed.

Exceptions:

Type Description
NoJobsAvailableForProject

Project has no jobs

Returns:

Type Description
dict

Includes a list of jobs with their metrics and some project-level jon summary metrics.

Model

adele.modeling.liveml2.classes.model.core.Model (LiveMLCore)

Core Model-level class that deals with anything at the application level. This includes listing all available projects and managing the benchmark of models.

Parameters:

Name Type Description Default
model_uid str

model uuid provided by the model client

required

Examples:

info(self)

Returns performance metrics, scorecard, P/R, etc..

Returns:

Type Description
dict

Data from model_results corresponding to model_id.

adele.modeling.liveml2.classes.model.package.ModelPackage (Model)

Subclass of Model that deals with creating a deployment package for a given model. Deployment packages are given as a download from a presigned url and include a report together with the model and different ways of deploying it: API in docker locally, API in docker in AWS, local python code. It uses jinja2 to custom templates with each models information.

create_deployment_package(self)

Method for the creation of a deployment package.

Exceptions:

Type Description
PresignedUrlError

Error creating a presigned url.

Returns:

Type Description
str

Presigned url.

adele.modeling.liveml2.classes.model.predictions.ModelPredictions (Model)

Subclass of Model that deals with creating model predictions

prepare_predictions(self, targetfile)

Computes predictions then saved as csv for a given compatible file. Predictions contain columns for ids, class probabilities, and softmax classification.

Parameters:

Name Type Description Default
targetfile str

The name of the file to use as inputs.

required

Exceptions:

Type Description
IVsFileWrongFormat

Format of target files is not proper IVs format

Returns:

Type Description
dict

background_task_payload to be passed to calculate_predictions_background

get_training_predictions(self)

Method for returning the training predictions file.

Exceptions:

Type Description
PresignedUrlError

Any kind on error related to generating a presigned url.

Returns:

Type Description
dict

A dictionary containing the name of the training predictions file as well as a presigned URL for retrieving the file

calculate_out_of_sample_predictions(self, ivs, respids)

Background task function to be run in route logic.

Parameters:

Name Type Description Default
ivs pd.DataFrame

Independent variable.

required
respids pd.Series

Description of parameter respids.

required

Examples:

>>> client = ModelClient(PostgresConnection, "R111111", model_id: "1b6ace9d-2986-435e-9bff-8a78e3fb52d8")
>>> background_payload = client.prepare_predictions("target_file.csv")
>>> calculate_predictions(**background_payload)
get_out_of_sample_predictions(self, target_file=None)

Method for returning the out-of-sample predictions file.

Parameters:

Name Type Description Default
target_file str

Filename of out of sample features. Defaults to None.

None

Exceptions:

Type Description
PresignedUrlError

Any kind on error related to generating a presigned url.

Returns:

Type Description
dict

A dictionary containing the name of the out of sample predictions file as well as a presigned URL for retrieving the file

profile_predicted(self)

Profile a specified predicted outcome using features information, split by scales and categoricals.

!!!NOTE: This needs to be updated to profile any prediction, not just the training predictions

Exceptions:

Type Description
FileNotFound

Error if the client cannot find a particular file

Returns:

Type Description
dict

A profile table contains the profiling variables in the rows and the classes in the columns. This dataframe is converted to a dictionary for the purpose of returning the values through the API. The means/percentages are in each row.

Project

adele.modeling.liveml2.classes.project.models.ProjectModels (Project)

list_models(self, column_list=['uuid', 'created_at', 'email'])

Lists all of the model_ids for the project

run_no_skill_models(self, solution, proportions)

For current project and given solution, generates two no-skill models. One would be a random classifier and the other a majority class classifier. The same evaluation metrics that are calculated for xgboost models are run on them, using the holdout dataset.

Parameters:

Name Type Description Default
solution str

name of the solution (outcome classes or target column in dvs)

required
proportions pd.DataFrame

dataframe with the proportions per class for given solution

required

Returns:

Type Description
list

list of models which are defined as a list containing model id, solution and performance results.

check_class_imbalance(self, solution)

For current project and given solution, it checks if the training data outcomes are balanced (defined as if any class has aproportion less than 1/(3*nclasses) of the share).

Parameters:

Name Type Description Default
solution str

name of the solution (outcome classes or target column in dvs)

required

Returns:

Type Description
bool

True if the training data is imbalaced, False otherwise pd.DataFrame: proportions for each class

save_benchmark(self, df)

Given a DataFrame of models that includes no-skill models and skilled ones, this method compares each model with each corresponding (same solution) no skills models and if its avg_pct metric is better than the no-skill equivalents, it gets saved in the benchmark (Benchmark column in database is set as 'Yes' or 'No' if failure)'

Parameters:

Name Type Description Default
df pd.DataFrame

DataFrame of models that includes no-skill models and skilled ones

required
get_best_model(self, criteria, solution=None)

Detects the model for a project that has the best model performance, given a precision vs recall criteria, based on average percentile. It also saves in benchmark those models considered better than no skill models.

Parameters:

Name Type Description Default
criteria str

scoring criteria to be used for selecting the best model

required
solution str

filters best models by solution if specified

None

Exceptions:

Type Description
SolutionNotAvailable

Given solution must be in the outcomes_info table

Returns:

Type Description
dict

dictionary containing the best model for that criteria and solution together with their performance metrics

adele.modeling.liveml2.classes.project.flags.ProjectFlags (Project)

toggle_star(self, filename)

Stars or unstars a file in the database

Parameters:

Name Type Description Default
filename str

target file for toggling star

required

Exceptions:

Type Description
UnstarrableFile

Raised if the target file is not a scale, categorical, features, or outcomes file

ExistingStarredFile

Raised if another file of the same tag and project_id has already been starred

Returns:

Type Description
str

Success message stating if the file has been starred or unstarred

set_winning_model(self, model_id, overwrite=False)

Set a model as the winner for a given project_id.

Parameters:

Name Type Description Default
model_id str

The model_id to set as the winning model

required
overwrite bool

Whether to overwrite an winning_model if a model has already been chosen. Defaults to False.

False

Exceptions:

Type Description
ExistingWinningModel

Error if a winning model has been chosen for the project_id and the overwrite argument is set to False.

Examples:

>>> client = ProjectClient('P123456', 'dev', local=True)
>>> client.set_winning_model('019912be-d4bd-43aa-8cd2-23706eddc405', overwrite=True)

adele.modeling.liveml2.classes.project.stats.ProjectStats (Project)

class_counts(self, header, resolve_multiple_files=True)

Reads the class counts for a selected outcome within any file.

Parameters:

Name Type Description Default
header str

Column header to count classes

required
resolve_multiple_files bool

Whether to produce a result when the same header is found in multiple files. The resolution is that it picks the first file.

True

Exceptions:

Type Description
NoHeaderFound

Error if the specified header cannot be found in the outcomes_info given the project_id

Returns:

Type Description
list

An array of dictionaries giving the class name and class count. As an example:

[{'class_name': 1, 'count': 202}, {'class_name': 2, 'count': 261}, {'class_name': 3, 'count': 193}, {'class_name': 4, 'count': 328}]

cross_table(self, header_1, header_2, resolve_multiple_files=True)

This method creates a cross table given 2 headers for a project.

Parameters:

Name Type Description Default
header_1 str

The name of the first header

required
header_2 str

The name of the second header

required

Exceptions:

Type Description
NoFilesFound

Raised if there are no S3 files for either of the indicated headers

CrossTableError

General error for anything that comes up during cross table calculations

Returns:

Type Description
dict

Dictionary containing fields for the cross table, row percents, and column percents.

calculate_bias(self, training_filename, outof_filename, scale_filename=None, categ_filename=None)

This method calculates sample bias between the training data and the out-of-sample data to ascertain data quality. We use the classic t-test of the means to assert the p-value of the null hypothesis for scale variables, and chi square for categoricals. Null hypothesis for t-test is that the means are equal, while for the chi-square is that the frequencies are equal.

Parameters:

Name Type Description Default
training_filename str

The training data filename to compare (features file)

required
outof_filename str

The out-of-sample data filename to compare

required
scale_filename str

Chosen scale file. If None obtained automatically.

None
categ_filename str

Chosen categories file. If None obtained automatically.

None

Exceptions:

Type Description
NotBiasFileTypes

File tags are not correct

Returns:

Type Description
List

list of analyzed variables (dictionary) for scales List: list of analyzed variables (dictionary) for categories

adele.modeling.liveml2.classes.project.validation.ProjectValidation (Project)

validation_token(self)

Create a uuid token to be used for validation. This can be retrieved from the client to be returned to an end-user to poll for the validation.

validate_all(self, dv_idvar=None, iv_idvar=None)

Runs intra and inter validation on all files associated with a project_id. Writes validation status and problem list into validation_info. Token must be generated and passed to this function.

Parameters:

Name Type Description Default
dv_idvar str

Column name for the dependent variable. Optional (default None)

None
iv_idvar str

Column name for the independent variable. Optional (default None)

None

Examples:

>>> client = ProjectClient('P111111', 'dev', 'liveml', local=True)
>>> iv_idvar = 'LRW_ID'
>>> dv_idvar= 'Respondent_ID'
>>> client.validate_all(dv_idvar, iv_idvar)
get_validation_status(self, token=None)

Retrieves the validation status from validation_info of a given validation task for the project. Return includes the pass/fail result, along with the list of problems.

Parameters:

Name Type Description Default
token str

Token for the validation attempt. The value in self.token takes precedence.

None

Returns:

Type Description
dict

Contains token, the progress status of the validation, the validation result, and a list of problems

Validation

adele.modeling.liveml2.classes.validation.core.Validator

Validator class is the parent Validator class where the generic methods for intravalidation (validation of the given file as its given type without any exterior context) and intervalidation (validation of the given file against other existing files of different tags with which certain requirementsmust be met).

intra_validate(self, func)

Generic method for intravalidation, checks for prior existing validation in the db and if so, for the rerun parameter to be set as True to perform it.

Parameters:

Name Type Description Default
func function

Function implemented in the children validation classes that includes the different intra validation rules that the file has to follow and returns the errors.

required

Returns:

Type Description
list(str)

List of database related error that may have stopped the validation from being adequately executed.

inter_validate(self, func, targettag)

Generic method for intervalidation implemented by the children Validators, given a file and type of file to be validated against, gets pairs of files to be validated, checks for prior existing validation in the db and if so, for the rerun parameter to be set as True to perform the validation for that pair. This method might be called more than once within the children Validators if more than one kind of file is susceptible to paired validation with the given file.

Parameters:

Name Type Description Default
func funcion

Function implemented in the children validation classes that includes the different inter validation rules.

required
targettag str

The types of file the given file is going to be validated against.

required

Returns:

Type Description
list(str)

List of database related error that may have stopped the validation from being adequately executed.

adele.modeling.liveml2.classes.validation.features.FeaturesValidator (Validator)

Interfile checks would be a “joinability” check against outcomes, and “exists” against scales and categoricals. However, we will consider ‘features’ as the base type and those checks will only be performed against it. Intrafile check is that every class for minimum missingness.

adele.modeling.liveml2.classes.validation.outcomes.OutcomesValidator (Validator)

Intrafile check is that every class for every outcome has a minimum base size. Joinability checks for existing matching ids in both files, using a matching threshold (project parameter). It also looks to guarantee a minimum amount of data points (project parameter).

update_outcomes_table(self, rerun=True)

Inserts headers into outcomes_info table according to their project_id and filename.

adele.modeling.liveml2.classes.validation.outofsample.OutofValidator (Validator)

Internal checks are the same as features. Interfile check is a check against scales and categories in the same manner as they do with features files

adele.modeling.liveml2.classes.validation.categoricals.CategValidator (Validator)

Interfile check is an “exists” check against features file. Intrafile check is that it’s a single column with regex on Vvar in the column.

adele.modeling.liveml2.classes.validation.scales.ScalesValidator (Validator)

Interfile check is an “exists” check against features file and that selected variables can be typed as int or float by python. Intrafile check is that it’s a single column with regex on Vvar in the column.

adele.modeling.liveml2.classes.validation.weights.WeightsValidator (Validator)

Intravalidation checks for no missing values, two columns, id_var present, all weight values must be numbers. Intervalidation with outcomes file checks that every record within the outcomes file has a corresponding weight. *We are assuming structure is first column is id second is weights, do we have a fixed var name?.


Last update: 2022-06-07
Back to top