Understanding Date Logic in vecshift • vecshift

library(vecshift)
library(data.table)

Introduction

The vecshift package processes employment records with temporal data, transforming them into continuous time segments with employment status classifications. At the heart of this transformation lies precise date logic that ensures accurate calculation of employment and unemployment periods.

This vignette explains the updated date handling rules in the new version of vecshift, which has deprecated the FINE+1 logic in favor of a simpler, more intuitive approach.

Core Date Logic Principles

1. Inclusive Contract Periods

Employment contracts in vecshift are treated as inclusive date ranges:

Contract duration: From INIZIO to FINE (both days included)
Person works: ON both the start date and end date

# Example: January contract
contract_start <- as.Date("2023-01-01")
contract_end <- as.Date("2023-01-31")

# Calculate duration
duration <- as.numeric(contract_end - contract_start + 1)
cat("Contract duration:", duration, "days\n")
#> Contract duration: 31 days
cat("Person works from:", format(contract_start, "%B %d"), 
    "to", format(contract_end, "%B %d"), "(both days inclusive)\n")
#> Person works from: January 01 to January 31 (both days inclusive)

2. The Date Logic

Approach: The vecshift implementation creates end events at FINE and adjusts unemployment periods afterward.

The logic: - End events are created at the contract end date (FINE) - Unemployment segments (arco=0) are identified after event processing - Unemployment dates are adjusted: inizio+1 and fine-1 - This maintains temporal accuracy with simpler logic

# Example: Contract ends January 31st
contract_end <- as.Date("2023-01-31")
# The end event is at FINE
event_date <- contract_end
# Unemployment adjustment happens during processing
unemployment_start <- contract_end + 1  # Still starts Feb 1 after adjustment

cat("Last day of work:", format(contract_end, "%B %d, %Y"), "\n")
#> Last day of work: January 31, 2023
cat("End event created at:", format(event_date, "%B %d, %Y"), "\n")
#> End event created at: January 31, 2023
cat("First day unemployed (after adjustment):", format(unemployment_start, "%B %d, %Y"), "\n")
#> First day unemployed (after adjustment): February 01, 2023

3. Event-Based Transformation

vecshift converts each employment contract into exactly two events:

Start Event: Date = INIZIO, Value = +1
End Event: Date = FINE (not FINE+1 in new version), Value = -1

# Create sample employment data
employment_data <- data.table(
  id = 1:2,
  cf = c("PERSON001", "PERSON001"),
  inizio = as.Date(c("2023-01-01", "2023-04-01")),
  fine = as.Date(c("2023-03-31", "2023-06-30")),
  prior = c(1, 0)
)

print("Original employment data:")
#> [1] "Original employment data:"
print(employment_data)
#>       id        cf     inizio       fine prior
#>    <int>    <char>     <Date>     <Date> <num>
#> 1:     1 PERSON001 2023-01-01 2023-03-31     1
#> 2:     2 PERSON001 2023-04-01 2023-06-30     0

# The new vecshift creates events at FINE (not FINE+1)
# End events now occur ON the contract end date
cat("\nIn the new logic:\n")
#> 
#> In the new logic:
cat("- End events are created at FINE\n")
#> - End events are created at FINE
cat("- Unemployment periods are adjusted afterward (inizio+1, fine-1)\n")
#> - Unemployment periods are adjusted afterward (inizio+1, fine-1)

Practical Examples

Scenario 1: Consecutive Contracts (No Gap)

consecutive_data <- data.table(
  id = 1:2,
  cf = rep("PERSON001", 2),
  inizio = as.Date(c("2023-01-01", "2023-04-01")),
  fine = as.Date(c("2023-03-31", "2023-06-30")),
  prior = c(1, 0)
)

print("Consecutive contracts:")
#> [1] "Consecutive contracts:"
print(consecutive_data)
#>       id        cf     inizio       fine prior
#>    <int>    <char>     <Date>     <Date> <num>
#> 1:     1 PERSON001 2023-01-01 2023-03-31     1
#> 2:     2 PERSON001 2023-04-01 2023-06-30     0

# Check if there's unemployment between contracts
first_end <- consecutive_data$fine[1]
second_start <- consecutive_data$inizio[2]
gap_duration <- as.numeric(second_start - first_end - 1)

cat("\nFirst contract ends:", format(first_end, "%B %d"), "\n")
#> 
#> First contract ends: March 31
cat("Second contract starts:", format(second_start, "%B %d"), "\n")
#> Second contract starts: April 01
cat("Unemployment duration:", gap_duration, "days\n")
#> Unemployment duration: 0 days

if (gap_duration == 0) {
  cat("No unemployment gap - contracts are consecutive!\n")
}
#> No unemployment gap - contracts are consecutive!

Scenario 2: Gap Between Contracts

gap_data <- data.table(
  id = 1:2,
  cf = rep("PERSON001", 2),
  inizio = as.Date(c("2023-01-01", "2023-03-15")),
  fine = as.Date(c("2023-02-28", "2023-05-31")),
  prior = c(1, 0)
)

print("Contracts with gap:")
#> [1] "Contracts with gap:"
print(gap_data)
#>       id        cf     inizio       fine prior
#>    <int>    <char>     <Date>     <Date> <num>
#> 1:     1 PERSON001 2023-01-01 2023-02-28     1
#> 2:     2 PERSON001 2023-03-15 2023-05-31     0

# Calculate unemployment period
first_end <- gap_data$fine[1]
second_start <- gap_data$inizio[2]
gap_duration <- as.numeric(second_start - first_end - 1)

unemployment_start <- first_end + 1
unemployment_end <- second_start - 1

cat("\nFirst contract ends:", format(first_end, "%B %d"), "\n")
#> 
#> First contract ends: February 28
cat("Unemployment period:", format(unemployment_start, "%B %d"), "to", 
    format(unemployment_end, "%B %d"), "\n")
#> Unemployment period: March 01 to March 14
cat("Second contract starts:", format(second_start, "%B %d"), "\n")
#> Second contract starts: March 15
cat("Total unemployment days:", gap_duration, "\n")
#> Total unemployment days: 14

Scenario 3: Overlapping Contracts

overlap_data <- data.table(
  id = 1:2,
  cf = rep("PERSON001", 2),
  inizio = as.Date(c("2023-01-01", "2023-03-01")),
  fine = as.Date(c("2023-06-30", "2023-04-30")),
  prior = c(1, 0)
)

print("Overlapping contracts:")
#> [1] "Overlapping contracts:"
print(overlap_data)
#>       id        cf     inizio       fine prior
#>    <int>    <char>     <Date>     <Date> <num>
#> 1:     1 PERSON001 2023-01-01 2023-06-30     1
#> 2:     2 PERSON001 2023-03-01 2023-04-30     0

# The vecshift function would generate events at these dates
cat("\nEvents would be created at:\n")
#> 
#> Events would be created at:
cat("- Contract 1 start:", format(overlap_data$inizio[1]), "(+1)\n")
#> - Contract 1 start: 2023-01-01 (+1)
cat("- Contract 1 end:", format(overlap_data$fine[1]), "(-1)\n")
#> - Contract 1 end: 2023-06-30 (-1)
cat("- Contract 2 start:", format(overlap_data$inizio[2]), "(+1)\n")
#> - Contract 2 start: 2023-03-01 (+1)
cat("- Contract 2 end:", format(overlap_data$fine[2]), "(-1)\n")
#> - Contract 2 end: 2023-04-30 (-1)

# The cumulative sum would show overlapping employment
cat("\nCumulative employment levels (arco):\n")
#> 
#> Cumulative employment levels (arco):
cat("- Before March 1: arco = 1 (single employment)\n")
#> - Before March 1: arco = 1 (single employment)
cat("- March 1 - April 30: arco = 2 (overlapping employment)\n")
#> - March 1 - April 30: arco = 2 (overlapping employment)
cat("- May 1 - June 30: arco = 1 (single employment)\n")
#> - May 1 - June 30: arco = 1 (single employment)

# Identify overlap period
overlap_start <- overlap_data$inizio[2]
overlap_end <- min(overlap_data$fine)
cat("\nOverlapping period:", format(overlap_start, "%B %d"), "to", 
    format(overlap_end, "%B %d"), "\n")
#> 
#> Overlapping period: March 01 to April 30
cat("During this period: arco = 2 (multiple employment)\n")
#> During this period: arco = 2 (multiple employment)

Duration Calculations

Employment vs Unemployment Duration

Duration calculations differ based on employment status due to the event structure:

# Create data with both employment and unemployment periods
mixed_data <- data.table(
  id = 1:2,
  cf = rep("PERSON001", 2),
  inizio = as.Date(c("2023-01-01", "2023-04-01")),
  fine = as.Date(c("2023-02-28", "2023-06-30")),
  prior = c(1, 0)
)

# Calculate employment durations (inclusive)
emp_duration_1 <- as.numeric(mixed_data$fine[1] - mixed_data$inizio[1] + 1)
emp_duration_2 <- as.numeric(mixed_data$fine[2] - mixed_data$inizio[2] + 1)

# Calculate unemployment duration
unemp_duration <- as.numeric(mixed_data$inizio[2] - mixed_data$fine[1] - 1)

cat("Employment periods:\n")
#> Employment periods:
cat("  Contract 1:", emp_duration_1, "days (Jan 1 - Feb 28, inclusive)\n")
#>   Contract 1: 59 days (Jan 1 - Feb 28, inclusive)
cat("  Contract 2:", emp_duration_2, "days (Apr 1 - Jun 30, inclusive)\n")
#>   Contract 2: 91 days (Apr 1 - Jun 30, inclusive)
cat("\nUnemployment period:", unemp_duration, "days (Mar 1 - Mar 31)\n")
#> 
#> Unemployment period: 31 days (Mar 1 - Mar 31)

# Verify total coverage
total_days <- as.numeric(max(mixed_data$fine) - min(mixed_data$inizio) + 1)
accounted_days <- emp_duration_1 + emp_duration_2 + unemp_duration
cat("\nTotal period:", total_days, "days\n")
#> 
#> Total period: 181 days
cat("Accounted for:", accounted_days, "days\n")
#> Accounted for: 181 days
cat("Complete coverage:", total_days == accounted_days, "\n")
#> Complete coverage: TRUE

Data Quality and Validation

Common Date Issues

The date logic module includes comprehensive validation to detect common problems:

# Create data with various quality issues
problem_data <- data.table(
  id = 1:4,
  cf = rep("PERSON001", 4),
  inizio = as.Date(c("2023-01-01", "2023-03-01", "2023-05-01", "2023-07-15")),
  fine = as.Date(c("2023-02-28", "2023-02-15", "2023-05-01", "2023-07-10")),  # Issues!
  prior = c(1, 0, 1, 0)
)

print("Data with quality issues:")
#> [1] "Data with quality issues:"
print(problem_data)
#>       id        cf     inizio       fine prior
#>    <int>    <char>     <Date>     <Date> <num>
#> 1:     1 PERSON001 2023-01-01 2023-02-28     1
#> 2:     2 PERSON001 2023-03-01 2023-02-15     0
#> 3:     3 PERSON001 2023-05-01 2023-05-01     1
#> 4:     4 PERSON001 2023-07-15 2023-07-10     0

# Basic validation can be done with simple R operations
cat("\nValidation Results:\n")
#> 
#> Validation Results:
invalid_ranges <- sum(problem_data$fine < problem_data$inizio, na.rm = TRUE)
zero_duration <- sum(problem_data$fine == problem_data$inizio, na.rm = TRUE)
cat("Invalid date ranges:", invalid_ranges, "\n")
#> Invalid date ranges: 2
cat("Zero duration contracts:", zero_duration, "\n")
#> Zero duration contracts: 1

# Identify specific problems
invalid_rows <- which(problem_data$fine < problem_data$inizio)
zero_duration_rows <- which(problem_data$fine == problem_data$inizio)

if (length(invalid_rows) > 0) {
  cat("\nInvalid ranges (FINE < INIZIO) in rows:", invalid_rows, "\n")
}
#> 
#> Invalid ranges (FINE < INIZIO) in rows: 2 4
if (length(zero_duration_rows) > 0) {
  cat("Zero duration contracts in rows:", zero_duration_rows, "\n")
}
#> Zero duration contracts in rows: 3

Using vecshift for Analysis

# Analyze employment patterns using vecshift
clean_data <- data.table(
  id = 1:3,
  cf = rep("PERSON001", 3),
  inizio = as.Date(c("2023-01-01", "2023-04-01", "2023-07-01")),
  fine = as.Date(c("2023-03-31", "2023-05-31", "2023-09-30")),
  prior = c(1, 0, 1)
)

# Use vecshift to create temporal segments
result <- vecshift(clean_data)

print("Temporal segments with employment status:")
#> [1] "Temporal segments with employment status:"
print(result)
#>       id        cf prior  arco       fine     inizio over_id     durata
#>    <int>    <char> <num> <num>     <Date>     <Date>   <int> <difftime>
#> 1:     1 PERSON001     1     1 2023-03-31 2023-01-01       1    90 days
#> 2:     2 PERSON001     0     1 2023-05-31 2023-04-01       2    61 days
#> 3:     0 PERSON001     0     0 2023-06-30 2023-06-01       0    30 days
#> 4:     3 PERSON001     1     1 2023-09-30 2023-07-01       3    92 days

# Calculate employment statistics
total_duration <- as.numeric(sum(result$durata))
employment_duration <- as.numeric(sum(result$durata[result$arco > 0]))
employment_rate <- employment_duration / total_duration

cat("\nEmployment statistics:\n")
#> 
#> Employment statistics:
cat("Employment rate:", round(employment_rate, 3), "\n")
#> Employment rate: 0.89
cat("Total segments:", nrow(result), "\n")
#> Total segments: 4

Integration with vecshift Processing

Core Function with Status Classification

The main vecshift function now separates core temporal logic from status classification:

# Example data
employment_data <- data.table(
  id = 1:3,
  cf = c("P001", "P001", "P002"),
  inizio = as.Date(c("2023-01-01", "2023-06-01", "2023-02-01")),
  fine = as.Date(c("2023-05-31", "2023-12-31", "2023-11-30")),
  prior = c(1, 0, 1)
)

# Process with default status classification
result_with_status <- vecshift(employment_data)
print(result_with_status[1:5])
#>       id     cf prior  arco       fine     inizio over_id     durata
#>    <int> <char> <num> <num>     <Date>     <Date>   <int> <difftime>
#> 1:     1   P001     1     1 2023-05-31 2023-01-01       1   151 days
#> 2:     2   P001     0     1 2023-12-31 2023-06-01       2   214 days
#> 3:     3   P002     1     1 2023-11-30 2023-02-01       3   303 days
#> 4:    NA   <NA>    NA    NA       <NA>       <NA>      NA    NA days
#> 5:    NA   <NA>    NA    NA       <NA>       <NA>      NA    NA days

# Process without status classification (raw segments only)
result_raw <- vecshift(employment_data)
print(result_raw[1:5])
#>       id     cf prior  arco       fine     inizio over_id     durata
#>    <int> <char> <num> <num>     <Date>     <Date>   <int> <difftime>
#> 1:     1   P001     1     1 2023-05-31 2023-01-01       1   151 days
#> 2:     2   P001     0     1 2023-12-31 2023-06-01       2   214 days
#> 3:     3   P002     1     1 2023-11-30 2023-02-01       3   303 days
#> 4:    NA   <NA>    NA    NA       <NA>       <NA>      NA    NA days
#> 5:    NA   <NA>    NA    NA       <NA>       <NA>      NA    NA days

Custom Status Classification

Status attribution is now handled by the dedicated classify_employment_status function:

# Create custom classification rules
custom_rules <- create_custom_status_rules(
  unemployment_threshold = 30,  # Longer threshold
  custom_labels = list(
    unemployed_short = "job_seeking",
    unemployed_long = "long_term_unemployed"
  )
)

# Apply custom classification
result_custom <- vecshift_result <- vecshift(employment_data)
result_custom <- classify_employment_status(vecshift_result, rules = custom_rules)

This separation ensures: - Core date logic remains optimized and unchanged - Status classification can be customized independently - Clear separation of concerns for maintainability

Best Practices

1. Always Validate Input Data

# Before processing, always check data quality
has_invalid_dates <- any(employment_data$FINE < employment_data$INIZIO, na.rm = TRUE)

if (has_invalid_dates) {
  warning("Date issues detected - review data before processing")
}

2. Handle Different Date Formats

# vecshift expects Date objects - convert if needed
numeric_dates <- c(19358, 19387)  # Days since 1970-01-01
char_dates <- c("2023-01-01", "2023-01-30")

# Convert to Date objects
dates_from_numeric <- as.Date(numeric_dates, origin = "1970-01-01")
dates_from_char <- as.Date(char_dates)

print("Date conversions:")
#> [1] "Date conversions:"
print(list(from_numeric = dates_from_numeric, 
           from_character = dates_from_char))
#> $from_numeric
#> [1] "2023-01-01" "2023-01-30"
#> 
#> $from_character
#> [1] "2023-01-01" "2023-01-30"

3. Understand the Business Context

The vecshift date logic reflects real-world employment patterns:

Legal/Administrative: Employment contracts typically end “end of day” on FINE
Benefits: Unemployment benefits often start the day after employment ends
Taxation: Tax calculations need precise employment period boundaries
Analysis: Labor statistics require continuous temporal coverage

Summary

The vecshift date logic ensures:

Temporal Continuity: Every day is classified as either employed or unemployed
Precision: FINE+1 logic eliminates ambiguity in unemployment start dates
Flexibility: Handles consecutive, overlapping, and gap scenarios correctly
Validation: Comprehensive quality checks prevent common date errors
Performance: Optimized for large-scale employment datasets
Modularity: Status classification is separate from core temporal logic

Key architectural principles: - Core Logic: The main vecshift() function handles event-based temporal transformation - Status Attribution: The classify_employment_status() function applies employment labels - Customization: Status rules can be modified without touching core date logic - Performance: Core transformation remains optimized at ~1.46M records/second

Understanding this date logic is crucial for: - Correctly interpreting vecshift results - Debugging unexpected outputs - Extending the package functionality - Creating custom status classification rules - Integrating with other temporal analysis tools

The vecshift architecture ensures consistent date logic while allowing flexible customization of business rules and status classifications through the integrated status labeling system.