Mark Records for Employer Consolidation (Lightweight)
Source:R/consolidation_metrics.R
mark_employer_consolidation.RdLightweight Performance-Optimized Function
Marks records that would be consolidated based on employer without performing the actual consolidation. This function adds a single column to the original dataset indicating which records would be consolidated together, providing the same logic as consolidate_employment() but avoiding expensive consolidation operations for performance-critical applications.
Records are marked for consolidation when they meet all criteria:
Same person (cf)
Same employer (employer_var)
Gap between contracts <= min_lag days
Usage
mark_employer_consolidation(
data,
employer_var,
min_lag = 30,
consolidation_col = "consolidation_group"
)Value
The original data.table with one additional column marking consolidation groups. Records with the same consolidation_group value would be consolidated together. Uses data.table reference semantics for maximum performance.
Details
Algorithm:
Sort records by person (cf), employer, and start date (inizio)
For each person-employer combination, check gaps between consecutive contracts
If gap <= min_lag, mark as same consolidation group
If gap > min_lag, start new consolidation group
Create unique consolidation group IDs across all persons
Performance Features:
Uses data.table reference semantics (:=) to avoid copying
Single pass through data with vectorized operations
Handles 14M+ records efficiently
Memory-efficient group ID generation
Use Cases:
Performance-critical applications needing consolidation info without actual consolidation
Quality checks and data exploration before running expensive consolidation
Custom consolidation workflows that need group identification first
Memory-constrained environments where full consolidation is prohibitive
Examples
if (FALSE) { # \dontrun{
# Load sample data
dt <- readRDS("data/sample.rds")
# Mark records for consolidation (modifies dt by reference)
mark_employer_consolidation(
data = dt,
employer_var = "datore",
min_lag = 8,
consolidation_col = "consol_group"
)
# Now dt has a new column 'consol_group'
# Records with the same consol_group value would consolidate together
dt[, .N, by = consol_group] # Count records per consolidation group
# Identify which records would be consolidated
consolidation_candidates <- dt[, .N > 1, by = consol_group][V1 == TRUE]
print(paste("Consolidation groups:", nrow(consolidation_candidates)))
# Use custom column name
mark_employer_consolidation(dt, "employer_id", min_lag = 15, consolidation_col = "merge_groups")
} # }