Modules¶
fifeforspark.base_modelers module¶
fifeforspark.lgb_modelers module¶
fifeforspark.gbt_modelers module¶
fifeforspark.processors module¶
-
class
fifeforspark.processors.DataProcessor(config={}, data=None)¶ Bases:
objectPrepare data by identifying features as degenerate or categorical.
-
check_column_consistency(colname: str) → None¶ Assert column exists, has no missing values, and is not constant.
- Parameters
colname – The name of the column to check
- Returns
None
-
is_categorical(col: str) → bool¶ Determine if the given feature should be processed as categorical, as opposed to numeric.
- Parameters
col – The column to check
- Returns
Boolean value for whether the column is categorical
-
is_degenerate(colname: str) → bool¶ Determine if a feature is constant or has too many missing values
- Parameters
col – The column/feature to check
- Returns
Boolean value for whether the column is degenerate
-
-
class
fifeforspark.processors.PanelDataProcessor(config: Union[None, dict] = {}, data: Union[None, pyspark.sql.DataFrame] = None, shuffle_parts=200)¶ Bases:
fifeforspark.processors.DataProcessorReady panel data for modelling.
-
config¶ User-provided configuration parameters.
- Type
dict
-
data¶ Processed panel data.
- Type
pd.core.frame.DataFrame
-
raw_subset¶ An unprocessed sample from the final period of data. Useful for displaying meaningful values in SHAP plots.
- Type
pd.core.frame.DataFrame
-
categorical_maps¶ Contains for each categorical feature a map from each unique value to a whole number.
- Type
dict
-
numeric_ranges¶ Contains for each numeric feature the maximum and minimum value in the training set.
- Type
pd.core.frame.DataFrame
-
build_processed_data()¶ Clean, augment, and store a panel dataset and related information.
- Returns
Processed data
-
build_reserved_cols() → pyspark.sql.DataFrame¶ Add data split and outcome-related columns to the data.
- Returns
Spark DataFrame with reserved columns added
-
check_panel_consistency() → None¶ Ensure observations have unique individual-period combinations.
- Returns
None
-
flag_validation_individuals() → pyspark.sql.DataFrame¶ Flag observations from a random share of individuals.
- Returns
Spark DataFrame with flagged validation individuals
-
process_all_columns() → pyspark.sql.DataFrame¶ Apply data cleaning functions to all data columns.
- Returns
Spark DataFrame with processed columns
-
process_single_column(colname) → pyspark.sql.DataFrame¶ Apply data cleaning functions to a singular data column.
- Parameters
colname – The column to process
- Returns
dataframe with the processed column
- Return type
Dataframe
-
sort_panel_data() → pyspark.sql.DataFrame¶ Sort the data by individual, then by period.
- Returns
Sorted panel data
-
fifeforspark.utils module¶
-
class
fifeforspark.utils.FIFEArgParser¶ Bases:
argparse.ArgumentParserArgument parser for the FIFE command-line interface.
-
fifeforspark.utils.create_example_data(n_persons: int = 8192, n_periods: int = 20, seed_value: int = 9999) → pyspark.sql.DataFrame¶ Fabricate an unbalanced panel dataset suitable as FIFE input.
- Parameters
n_persons – the number of people to be in the dataset
n_periods – the number of periods to be in the dataset
seed_value – seed for random value generation
- Returns
Spark dataframe with example data
-
fifeforspark.utils.create_example_data_spark(n_persons: int = 1000, n_periods: int = 20) → pyspark.sql.DataFrame¶
-
fifeforspark.utils.import_data_file(path: str = 'Input Data') → pyspark.sql.DataFrame¶ Read data into a distributed spark dataframe..