Modules

fifeforspark.base_modelers module

fifeforspark.lgb_modelers module

fifeforspark.gbt_modelers module

fifeforspark.processors module

class fifeforspark.processors.DataProcessor(config={}, data=None)

Bases: object

Prepare data by identifying features as degenerate or categorical.

check_column_consistency(colname: str) → None

Assert column exists, has no missing values, and is not constant.

Parameters

colname – The name of the column to check

Returns

None

is_categorical(col: str) → bool

Determine if the given feature should be processed as categorical, as opposed to numeric.

Parameters

col – The column to check

Returns

Boolean value for whether the column is categorical

is_degenerate(colname: str) → bool

Determine if a feature is constant or has too many missing values

Parameters

col – The column/feature to check

Returns

Boolean value for whether the column is degenerate

class fifeforspark.processors.PanelDataProcessor(config: Union[None, dict] = {}, data: Union[None, pyspark.sql.DataFrame] = None, shuffle_parts=200)

Bases: fifeforspark.processors.DataProcessor

Ready panel data for modelling.

config

User-provided configuration parameters.

Type

dict

data

Processed panel data.

Type

pd.core.frame.DataFrame

raw_subset

An unprocessed sample from the final period of data. Useful for displaying meaningful values in SHAP plots.

Type

pd.core.frame.DataFrame

categorical_maps

Contains for each categorical feature a map from each unique value to a whole number.

Type

dict

numeric_ranges

Contains for each numeric feature the maximum and minimum value in the training set.

Type

pd.core.frame.DataFrame

build_processed_data()

Clean, augment, and store a panel dataset and related information.

Returns

Processed data

build_reserved_cols() → pyspark.sql.DataFrame

Add data split and outcome-related columns to the data.

Returns

Spark DataFrame with reserved columns added

check_panel_consistency() → None

Ensure observations have unique individual-period combinations.

Returns

None

flag_validation_individuals() → pyspark.sql.DataFrame

Flag observations from a random share of individuals.

Returns

Spark DataFrame with flagged validation individuals

process_all_columns() → pyspark.sql.DataFrame

Apply data cleaning functions to all data columns.

Returns

Spark DataFrame with processed columns

process_single_column(colname) → pyspark.sql.DataFrame

Apply data cleaning functions to a singular data column.

Parameters

colname – The column to process

Returns

dataframe with the processed column

Return type

Dataframe

sort_panel_data() → pyspark.sql.DataFrame

Sort the data by individual, then by period.

Returns

Sorted panel data

fifeforspark.utils module

class fifeforspark.utils.FIFEArgParser

Bases: argparse.ArgumentParser

Argument parser for the FIFE command-line interface.

fifeforspark.utils.create_example_data(n_persons: int = 8192, n_periods: int = 20, seed_value: int = 9999) → pyspark.sql.DataFrame

Fabricate an unbalanced panel dataset suitable as FIFE input.

Parameters
  • n_persons – the number of people to be in the dataset

  • n_periods – the number of periods to be in the dataset

  • seed_value – seed for random value generation

Returns

Spark dataframe with example data

fifeforspark.utils.create_example_data_spark(n_persons: int = 1000, n_periods: int = 20) → pyspark.sql.DataFrame
fifeforspark.utils.import_data_file(path: str = 'Input Data') → pyspark.sql.DataFrame

Read data into a distributed spark dataframe..