Modules¶

fifeforspark.base_modelers module¶

fifeforspark.lgb_modelers module¶

fifeforspark.gbt_modelers module¶

fifeforspark.processors module¶

class fifeforspark.processors.DataProcessor(config={}, data=None)¶

Bases: object

Prepare data by identifying features as degenerate or categorical.

check_column_consistency(colname: str) → None¶

Assert column exists, has no missing values, and is not constant.

Parameters: colname – The name of the column to check
Returns: None

is_categorical(col: str) → bool¶

Determine if the given feature should be processed as categorical, as opposed to numeric.

Parameters: col – The column to check
Returns: Boolean value for whether the column is categorical

is_degenerate(colname: str) → bool¶

Determine if a feature is constant or has too many missing values

Parameters: col – The column/feature to check
Returns: Boolean value for whether the column is degenerate

class fifeforspark.processors.PanelDataProcessor(config: Union[None, dict] = {}, data: Union[None, pyspark.sql.DataFrame] = None, shuffle_parts=200)¶

Bases: fifeforspark.processors.DataProcessor

Ready panel data for modelling.

config¶

User-provided configuration parameters.

Type: dict

data¶

Processed panel data.

Type: pd.core.frame.DataFrame

raw_subset¶

An unprocessed sample from the final period of data. Useful for displaying meaningful values in SHAP plots.

Type: pd.core.frame.DataFrame

categorical_maps¶

Contains for each categorical feature a map from each unique value to a whole number.

Type: dict

numeric_ranges¶

Contains for each numeric feature the maximum and minimum value in the training set.

Type: pd.core.frame.DataFrame

build_processed_data()¶

Clean, augment, and store a panel dataset and related information.

Returns: Processed data

build_reserved_cols() → pyspark.sql.DataFrame¶

Add data split and outcome-related columns to the data.

Returns: Spark DataFrame with reserved columns added

check_panel_consistency() → None¶

Ensure observations have unique individual-period combinations.

Returns: None

flag_validation_individuals() → pyspark.sql.DataFrame¶

Flag observations from a random share of individuals.

Returns: Spark DataFrame with flagged validation individuals

process_all_columns() → pyspark.sql.DataFrame¶

Apply data cleaning functions to all data columns.

Returns: Spark DataFrame with processed columns

process_single_column(colname) → pyspark.sql.DataFrame¶

Apply data cleaning functions to a singular data column.

Parameters: colname – The column to process
Returns: dataframe with the processed column
Return type: Dataframe

sort_panel_data() → pyspark.sql.DataFrame¶

Sort the data by individual, then by period.

Returns: Sorted panel data

fifeforspark.utils module¶

class fifeforspark.utils.FIFEArgParser¶

Bases: argparse.ArgumentParser

Argument parser for the FIFE command-line interface.

fifeforspark.utils.create_example_data(n_persons: int = 8192, n_periods: int = 20, seed_value: int = 9999) → pyspark.sql.DataFrame¶

Fabricate an unbalanced panel dataset suitable as FIFE input.

Parameters

n_persons – the number of people to be in the dataset
n_periods – the number of periods to be in the dataset
seed_value – seed for random value generation

Returns

Spark dataframe with example data

fifeforspark.utils.create_example_data_spark(n_persons: int = 1000, n_periods: int = 20) → pyspark.sql.DataFrame¶

fifeforspark.utils.import_data_file(path: str = 'Input Data') → pyspark.sql.DataFrame¶: Read data into a distributed spark dataframe..