Skip to content

Data Lakehouse 2

Lakehouse

adele.data.lakehouse2.core.Lakehouse

Core class for working with lakehouse data. This class is meant to be extended by S3 and REST specific classes. Validations on required parameters (environment, project_id, requester_app) happen here and do not need to happen in the classes that inherit from this base class.

Parameters:

Name Type Description Default
environment str

Valid values are "dev" and "prod"

required

Attributes:

Name Type Description
project_id str

Valid projects start with R and have at least six subsequent characters.

requester_app str

Valid requester_app values are no app, madlab, pepsi, jupyter, enctest, spotlight, datamanager. Defines the app name of the incoming data request.

Examples:

>>> client = Lakehouse('dev')
>>> setattr(rest,"project_id","R211379")
>>> setattr(rest,"requester_app","enctest")

LakehouseS3

adele.data.lakehouse2.s3.core.LakehouseS3Buckets (Lakehouse)

Lakehouse bucket connections (can only be used in datalake account)

Parameters:

Name Type Description Default
environment str)

Valid values are "dev" and "prod"

required

adele.data.lakehouse2.s3.core.LakehouseS3 (LakehouseS3Boto, LakehouseS3Buckets)

Core Lakehouse S3 class, to be used only from the datalake account. This class handles the setting of a desired version of an encoding, showing the available versions of the encodings. If project_id changes at any point, the versions will update to reflect the updated project_id.

Parameters:

Name Type Description Default
environment str)

Valid values are "dev" and "prod"

required
desired_version property writable

Override desired_version with the correct named value if not already set. All encoding methods use desired_version in some capacity, and the desired_version property checks to see if the specified project is the same as the project_id, which would reset the version to the latest.

project_versions(self)

Returns a dictionary of unique versions available for the specified project, and empty list if none available

Returns:

Type Description
Versions

The versions found for a given project, which includes information on version name, version time, version size, and whether there is both a data file and dictionary file associated with that version (when data is missing it is not a usable version).

LakehouseS3Trigger

adele.data.lakehouse2.s3.trigger.LakehouseS3Trigger (LakehouseS3)

trigger_r_on_demand(self)

Triggers the R project encoding process.

Returns:

Type Description
trigger status

A tuple containing a token if the encoding was successfully triggered and a failure message if not.

status_r_on_demand(self, token=None)

Retrieves the status of an on_demand_r_project_encoding. Each encoding has a specific token whose location in s3 reveals the status of the encoding itself.

Parameters:

Name Type Description Default
token str

Optionally receives a token that overrides the token set

None

Exceptions:

Type Description
ValidationError

Pydantic validation error on token

TokenMissing

Missing token

LakehouseS3Text

adele.data.lakehouse2.s3.text.LakehouseS3Text (LakehouseS3)

Class to retrieve 'text' data from Athena related to a project. Text data is not 'encoded' so is handled separately.

get_text(self)

Retrieves any text variables (i.e., open ended questions) for a given project This is essentially a wrapper around the other hidden methods for this class

LakehouseS3DataDictionary

adele.data.lakehouse2.s3.data_dictionary.core.LakehouseS3DataDictionary (LakehouseS3)

Core Data / Dictionary class for retrieving encoded data and dictionary This class handles reading the data and dictionary files from S3 for a given version.

LakehouseS3Compare

adele.data.lakehouse2.s3.data_dictionary.compare.LakehouseS3Compare (LakehouseS3DataDictionary)

Class that is responsible for comparing an encoding version to the latest encoding version within the same project.

Examples:

>>> s3 = LakehouseS3Compare('dev')
>>> setattr(s3, 'project_id', 'R211779')
>>> setattr(s3, 'requester_app', 'enctest')
>>> setattr(s3, 'desired_version', '20220302074945644340_v8')
>>> s3.compare_versions()
compare_versions(self)

Compares the desired version with the latest version of the given project

comparer(reference_dictionary, comparator_dictionary, compare_column='friendly_name') staticmethod

Returns a dictionary denoting differences between dictionaries from the version requested and the latest version. If differences were detected, the dictionary returns a summary of additions, deletions, and changes between the versions while providing row level details of each aggregate.

Parameters:

Name Type Description Default
reference_dictionary EncodedDictionary

Dictionary used to compare

required
comparator_dictionary EncodedDictionary

Dictionary used to compare against

required
compare_column str

Basis for comparison

'friendly_name'

Returns:

Type Description
message (str)

A statement denoting if differences were detected. latest_version (dict): File name and size of latest version result (dict): Row level detail additions, deletions, and changes in records

LakehouseS3Encoded

adele.data.lakehouse2.s3.data_dictionary.encoded.LakehouseS3Encoded (LakehouseS3DataDictionary)

Class that handles the DELIVERY of encoded data into various formats

Examples:

>>> from adele.data.lakehouse2.s3.data_dictionary.encoded import LakehouseS3Encoded
>>> s3 = LakehouseS3Encoded('dev')

Example 1.

>>> setattr(s3, "project_id", 'R220178')
>>> setattr(s3, "requester_app", env.get("requester_app"))
>>> setattr(s3, "extension", '.zsav')
>>> encoded = s3.get_encoded()

Example2.

>>> setattr(s3, "download", True)
>>> encoded = s3.get_encoded()
>>> data = pd.DataFrame(json.loads(encoded["data"]))
>>> dictionary = pd.DataFrame(json.loads(encoded["dictionary"]))

Change version by setting desired_version attribute

>>> settatr(s3, "desired_version","20220223025456347540_v8")
>>> encoded = s3.get_encoded()
get_encoded(self)

Retrieves encoded data for a given project version.

Exceptions:

Type Description
ValidationError

Pydantic validation error on EncodedRequest

ProjectIdMissing

Missing Project Id

LakehouseS3Instructions

adele.data.lakehouse2.s3.data_dictionary.instructions.LakehouseS3Instructions (LakehouseS3DataDictionary)

Class to create "instructed" (i.e. modified) versions of the files.

Creates the instructions.json, data_instructed.csv and dictionary_instructed.csv files. Once these are created, apply_instructions = True can be used on the to read and return instructed data.

Modification instructions have the following format, provided as keys to a dictionary:

overwrite: bool = False
hide_variables: Optional[List[StrictStr]] = []
hide_categories: Optional[List[Dict[StrictStr, List[StrictStr]]]] = []
variable_label_mapping: Optional[List[Dict[StrictStr, StrictStr]]] = []
!!! category_mapping "Optional[List[Dict[StrictStr,"
                                     List[Dict[StrictStr, StrictStr]]]]] = []

Examples:

>>> from adele.data.lakehouse2.s3.data_dictionary.instructions import LakehouseS3Instructions
>>> from adele.data.lakehouse2.schemas.Instructions import Instructions
>>> s3 = LakehouseS3Instructions('dev')
>>> instructions = Instructions()
>>> instructions.dict().items()
>>> instructions.overwrite = True
>>> instructions.hide_variables = ["S31_001","S31_002"]
>>> s3.create_instructions(instructions)
create_instructions(self, instructions)

Validate and Create project instructions file

Parameters:

Name Type Description Default
Instructions Instructions

Must be a validated Instructions Model Object

required

Exceptions:

Type Description
ValidationError

Pydantic validation error on Instructions

Value Error

Raised if the overwrite boolean in Instructions is invalid

LakehouseS3Similarity

adele.data.lakehouse2.s3.data_dictionary.similarity.LakehouseS3Similarity (LakehouseS3DataDictionary)

Class that contains the logic to run a similarity check for the data in a given version within a given project. Metadata such as the z threshold and distance function is an optional input. It uses the webauthor API to obtain the list of questions in the project.

Examples:

>>> s3 = LakehouseS3Similarity("dev")
>>> setattr(s3,"project_id","R211379")
>>> s3.desired_version='20211025112117154663_v8'
>>> result = s3.similarity()
similarity(self)

Analyzes the quality of the respondants from the encoded data by creating groups of similar response patterns. It first generates a similarity matrix with cosine or euclidean distances, then makes similarity groups based on the resulting connected components of a similarity graph where all the connections with similarity beyond a given amount of standard deviations from the mean are removed. Results (a .csv report and associated plot compressed in a zip file) are returned as a presigned url.

Exceptions:

Type Description
DataOrDictionaryMissing

Version within given project must have both data and dictionary files available.

ValidationError

Pydantic validation of the parameters.

Returns:

Type Description
str

Presigned url that contains the results zip.


Last update: 2022-04-21
Back to top