Data Lakehouse 2¶
Lakehouse¶
adele.data.lakehouse2.core.Lakehouse
¶
Core class for working with lakehouse data. This class is meant to be extended by S3 and REST specific classes. Validations on required parameters (environment, project_id, requester_app) happen here and do not need to happen in the classes that inherit from this base class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
environment |
str |
Valid values are "dev" and "prod" |
required |
Attributes:
Name | Type | Description |
---|---|---|
project_id |
str |
Valid projects start with R and have at least six subsequent characters. |
requester_app |
str |
Valid requester_app values are |
Examples:
>>> client = Lakehouse('dev')
>>> setattr(rest,"project_id","R211379")
>>> setattr(rest,"requester_app","enctest")
LakehouseS3¶
adele.data.lakehouse2.s3.core.LakehouseS3Buckets (Lakehouse)
¶
Lakehouse bucket connections (can only be used in datalake account)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
environment |
str) |
Valid values are "dev" and "prod" |
required |
adele.data.lakehouse2.s3.core.LakehouseS3 (LakehouseS3Boto, LakehouseS3Buckets)
¶
Core Lakehouse S3 class, to be used only from the datalake account. This class handles the setting of a desired version of an encoding, showing the available versions of the encodings. If project_id changes at any point, the versions will update to reflect the updated project_id.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
environment |
str) |
Valid values are "dev" and "prod" |
required |
desired_version
property
writable
¶
Override desired_version with the correct named value if not already set. All encoding methods use desired_version in some capacity, and the desired_version property checks to see if the specified project is the same as the project_id, which would reset the version to the latest.
project_versions(self)
¶
Returns a dictionary of unique versions available for the specified project, and empty list if none available
Returns:
Type | Description |
---|---|
Versions |
The versions found for a given project, which includes information on version name, version time, version size, and whether there is both a data file and dictionary file associated with that version (when data is missing it is not a usable version). |
LakehouseS3Trigger¶
adele.data.lakehouse2.s3.trigger.LakehouseS3Trigger (LakehouseS3)
¶
trigger_r_on_demand(self)
¶
Triggers the R project encoding process.
Returns:
Type | Description |
---|---|
trigger status |
A tuple containing a token if the encoding was successfully triggered and a failure message if not. |
status_r_on_demand(self, token=None)
¶
Retrieves the status of an on_demand_r_project_encoding. Each encoding has a specific token whose location in s3 reveals the status of the encoding itself.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token |
str |
Optionally receives a token that overrides the token set |
None |
Exceptions:
Type | Description |
---|---|
ValidationError |
Pydantic validation error on token |
TokenMissing |
Missing token |
LakehouseS3Text¶
adele.data.lakehouse2.s3.text.LakehouseS3Text (LakehouseS3)
¶
Class to retrieve 'text' data from Athena related to a project. Text data is not 'encoded' so is handled separately.
get_text(self)
¶
Retrieves any text variables (i.e., open ended questions) for a given project This is essentially a wrapper around the other hidden methods for this class
LakehouseS3DataDictionary¶
adele.data.lakehouse2.s3.data_dictionary.core.LakehouseS3DataDictionary (LakehouseS3)
¶
Core Data / Dictionary class for retrieving encoded data and dictionary This class handles reading the data and dictionary files from S3 for a given version.
LakehouseS3Compare¶
adele.data.lakehouse2.s3.data_dictionary.compare.LakehouseS3Compare (LakehouseS3DataDictionary)
¶
Class that is responsible for comparing an encoding version to the latest encoding version within the same project.
Examples:
>>> s3 = LakehouseS3Compare('dev')
>>> setattr(s3, 'project_id', 'R211779')
>>> setattr(s3, 'requester_app', 'enctest')
>>> setattr(s3, 'desired_version', '20220302074945644340_v8')
>>> s3.compare_versions()
compare_versions(self)
¶
Compares the desired version with the latest version of the given project
comparer(reference_dictionary, comparator_dictionary, compare_column='friendly_name')
staticmethod
¶
Returns a dictionary denoting differences between dictionaries from the version requested and the latest version. If differences were detected, the dictionary returns a summary of additions, deletions, and changes between the versions while providing row level details of each aggregate.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reference_dictionary |
EncodedDictionary |
Dictionary used to compare |
required |
comparator_dictionary |
EncodedDictionary |
Dictionary used to compare against |
required |
compare_column |
str |
Basis for comparison |
'friendly_name' |
Returns:
Type | Description |
---|---|
message (str) |
A statement denoting if differences were detected. latest_version (dict): File name and size of latest version result (dict): Row level detail additions, deletions, and changes in records |
LakehouseS3Encoded¶
adele.data.lakehouse2.s3.data_dictionary.encoded.LakehouseS3Encoded (LakehouseS3DataDictionary)
¶
Class that handles the DELIVERY of encoded data into various formats
Examples:
>>> from adele.data.lakehouse2.s3.data_dictionary.encoded import LakehouseS3Encoded
>>> s3 = LakehouseS3Encoded('dev')
Example 1.
>>> setattr(s3, "project_id", 'R220178')
>>> setattr(s3, "requester_app", env.get("requester_app"))
>>> setattr(s3, "extension", '.zsav')
>>> encoded = s3.get_encoded()
Example2.
>>> setattr(s3, "download", True)
>>> encoded = s3.get_encoded()
>>> data = pd.DataFrame(json.loads(encoded["data"]))
>>> dictionary = pd.DataFrame(json.loads(encoded["dictionary"]))
Change version by setting desired_version attribute
>>> settatr(s3, "desired_version","20220223025456347540_v8")
>>> encoded = s3.get_encoded()
get_encoded(self)
¶
Retrieves encoded data for a given project version.
Exceptions:
Type | Description |
---|---|
ValidationError |
Pydantic validation error on EncodedRequest |
ProjectIdMissing |
Missing Project Id |
LakehouseS3Instructions¶
adele.data.lakehouse2.s3.data_dictionary.instructions.LakehouseS3Instructions (LakehouseS3DataDictionary)
¶
Class to create "instructed" (i.e. modified) versions of the files.
Creates the instructions.json, data_instructed.csv and dictionary_instructed.csv files. Once these are created, apply_instructions = True can be used on the to read and return instructed data.
Modification instructions have the following format, provided as keys to a dictionary:
overwrite: bool = False
hide_variables: Optional[List[StrictStr]] = []
hide_categories: Optional[List[Dict[StrictStr, List[StrictStr]]]] = []
variable_label_mapping: Optional[List[Dict[StrictStr, StrictStr]]] = []
!!! category_mapping "Optional[List[Dict[StrictStr,"
List[Dict[StrictStr, StrictStr]]]]] = []
Examples:
>>> from adele.data.lakehouse2.s3.data_dictionary.instructions import LakehouseS3Instructions
>>> from adele.data.lakehouse2.schemas.Instructions import Instructions
>>> s3 = LakehouseS3Instructions('dev')
>>> instructions = Instructions()
>>> instructions.dict().items()
>>> instructions.overwrite = True
>>> instructions.hide_variables = ["S31_001","S31_002"]
>>> s3.create_instructions(instructions)
create_instructions(self, instructions)
¶
Validate and Create project instructions file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
Instructions |
Instructions |
Must be a validated Instructions Model Object |
required |
Exceptions:
Type | Description |
---|---|
ValidationError |
Pydantic validation error on Instructions |
Value Error |
Raised if the overwrite boolean in Instructions is invalid |
LakehouseS3Similarity¶
adele.data.lakehouse2.s3.data_dictionary.similarity.LakehouseS3Similarity (LakehouseS3DataDictionary)
¶
Class that contains the logic to run a similarity check for the data in a given version within a given project. Metadata such as the z threshold and distance function is an optional input. It uses the webauthor API to obtain the list of questions in the project.
Examples:
>>> s3 = LakehouseS3Similarity("dev")
>>> setattr(s3,"project_id","R211379")
>>> s3.desired_version='20211025112117154663_v8'
>>> result = s3.similarity()
similarity(self)
¶
Analyzes the quality of the respondants from the encoded data by creating groups of similar response patterns. It first generates a similarity matrix with cosine or euclidean distances, then makes similarity groups based on the resulting connected components of a similarity graph where all the connections with similarity beyond a given amount of standard deviations from the mean are removed. Results (a .csv report and associated plot compressed in a zip file) are returned as a presigned url.
Exceptions:
Type | Description |
---|---|
DataOrDictionaryMissing |
Version within given project must have both data and dictionary files available. |
ValidationError |
Pydantic validation of the parameters. |
Returns:
Type | Description |
---|---|
str |
Presigned url that contains the results zip. |