Skip to contents

Performs propensity score matching to identify comparable control units for treatment effect estimation at the person level. This function first aggregates event-level data to person-level characteristics, performs matching on people, then returns all events for matched individuals. This is crucial for employment data where we want to match people (not individual employment spells) based on their stable characteristics.

Usage

propensity_score_matching(
  data,
  treatment_var = "is_treated",
  person_id_var = "cf",
  matching_variables,
  exact_match_vars = NULL,
  person_aggregation = "first",
  method = "nearest",
  ratio = 1,
  caliper = NULL,
  replace = FALSE,
  estimand = "ATT",
  link = "logit",
  distance_metric = "glm",
  missing_data_action = "complete_cases",
  imputation_method = "median",
  min_complete_cases = 0.5,
  max_missing_proportion = 0.3,
  factor_level_threshold = 5,
  verbose = TRUE
)

Arguments

data

A data.table containing treatment and control observations (can be event-level)

treatment_var

Character. Name of treatment indicator variable. Default: "is_treated"

person_id_var

Character. Name of person identifier variable (e.g., "cf"). Default: "cf"

matching_variables

Character vector. Variables to include in propensity score model

exact_match_vars

Character vector. Variables requiring exact matches. Default: NULL

person_aggregation

Character. How to aggregate person characteristics: "first", "last", "mode", "mean". Default: "first"

method

Character. Matching method: "nearest", "optimal", "full", "genetic". Default: "nearest"

ratio

Numeric. Ratio of control to treatment units. Default: 1

caliper

Numeric. Maximum allowable distance for matches. Default: NULL (no caliper)

replace

Logical. Allow replacement in matching? Default: FALSE

estimand

Character. Target estimand: "ATT", "ATE", "ATC". Default: "ATT"

Character. Link function for propensity model: "logit", "probit". Default: "logit"

distance_metric

Character. Distance metric: "glm", "gam", "gbm", "randomforest". Default: "glm"

missing_data_action

Character. How to handle missing data: "complete_cases" (default), "impute", "exclude_vars"

imputation_method

Character. For numeric variables: "median" (default), "mean". For categorical: "mode"

min_complete_cases

Numeric. Minimum proportion of complete cases required to proceed (0-1). Default: 0.5

max_missing_proportion

Numeric. Maximum proportion of missing values allowed per variable (0-1). Default: 0.3

factor_level_threshold

Numeric. Minimum observations per factor level. Default: 5

verbose

Logical. Print detailed missing data diagnostics? Default: TRUE

Value

A list containing:

matched_data

Data.table with ALL events for matched individuals (both treated and control). Includes event_time column with values: "pre" (pre-treatment), "post" (post-treatment), "control" (control group), or NA

matched_persons

Data.table with person-level characteristics used for matching

match_matrix

Matrix showing which persons were matched

propensity_scores

Propensity scores for all persons

balance_before

Balance statistics before matching (person-level)

balance_after

Balance statistics after matching (person-level)

match_summary

Summary of matching procedure

common_support

Information about common support region

data_quality_report

Report on missing data handling and data quality issues

aggregation_report

Report on person-level aggregation process

Examples

if (FALSE) { # \dontrun{
# Process employment data first
employment_data <- vecshift(raw_employment_data)

# Add treatment indicator (e.g., policy intervention)
employment_data[, is_treated := some_treatment_condition]

# Basic person-level propensity score matching
ps_match <- propensity_score_matching(
  data = employment_data,
  person_id_var = "cf",
  matching_variables = c("age", "education", "sector", "region"),
  exact_match_vars = c("gender"),
  person_aggregation = "first",
  method = "nearest",
  ratio = 2,
  missing_data_action = "complete_cases",
  min_complete_cases = 0.7
)

# Result contains all events for matched persons
matched_employment_data <- ps_match$matched_data
person_characteristics <- ps_match$matched_persons

# Advanced matching with imputation for missing values
ps_match_imputed <- propensity_score_matching(
  data = employment_data,
  person_id_var = "cf",
  matching_variables = c("age", "education", "prior_employment", "wage"),
  person_aggregation = "mode",  # Use mode for categorical variables
  method = "optimal",
  caliper = 0.1,
  distance_metric = "gbm",
  missing_data_action = "impute",
  imputation_method = "median"
)
} # }