Skip to content

Quality Control

The QualityControl class provides comprehensive data quality assessment and outlier detection methods for aerosol instrument data.

Overview

Quality control is essential for ensuring data reliability in aerosol measurements. The QualityControl class offers various statistical methods to identify and handle outliers, invalid measurements, and data quality issues.

Key Features

  • QCFlagBuilder System - Declarative rule-based quality control
  • Multiple Outlier Detection Methods - N-sigma, IQR, time-aware rolling methods
  • Flexible Thresholds - Configurable parameters for different data types
  • Time-Aware Processing - Considers temporal patterns in data
  • Status Code Filtering - Handles instrument-specific error codes
  • Data Completeness Checks - Validates temporal coverage requirements

QCFlagBuilder System

All instruments use the declarative QCFlagBuilder system for quality control. This system provides:

  • Declarative Rules - Define QC rules as simple dataclass instances
  • Consistent Processing - All instruments use QC_Flag internally for quality control
  • Transparent Results - Failed rules are clearly listed in the flag
  • Clean Output - Final output has invalid data set to NaN, QC_Flag column removed

QCRule Dataclass

from dataclasses import dataclass
from typing import Callable
import pandas as pd

@dataclass
class QCRule:
    name: str                           # Rule identifier (e.g., "Invalid BC")
    condition: Callable[[pd.DataFrame], pd.Series]  # Returns True where data fails
    description: str                    # Human-readable description

QCFlagBuilder Class

class QCFlagBuilder:
    def __init__(self, rules: list[QCRule]):
        self.rules = rules

    def build(self, df: pd.DataFrame) -> pd.Series:
        """Build QC flag column from rules."""
        flags = pd.Series("Valid", index=df.index)

        for rule in self.rules:
            mask = rule.condition(df)
            for idx in df.index[mask]:
                current = flags.loc[idx]
                if current == "Valid":
                    flags.loc[idx] = rule.name
                else:
                    flags.loc[idx] = f"{current}, {rule.name}"

        return flags

Usage Example

from AeroViz.rawDataReader.core.qc import QCRule, QCFlagBuilder

# Define QC rules
rules = [
    QCRule(
        name="Invalid Range",
        condition=lambda df: (df['value'] < 0) | (df['value'] > 1000),
        description="Value outside valid range (0-1000)"
    ),
    QCRule(
        name="Missing Data",
        condition=lambda df: df['value'].isna(),
        description="Value is missing"
    ),
]

# Build flags
builder = QCFlagBuilder(rules)
df['QC_Flag'] = builder.build(df)

# Results:
# - "Valid" if all rules pass
# - "Invalid Range" if only range check fails
# - "Invalid Range, Missing Data" if both fail

Instrument QC Rules Summary

Instrument QC Rules
AE33/AE43 Status Error, Invalid BC, Invalid AAE, Insufficient
BC1054 Status Error, Invalid BC, Invalid AAE, Insufficient
MA350 Status Error, Invalid BC, Invalid AAE, Insufficient
SMPS Status Error, Insufficient, Low Total, High Bin, High Large Bin
APS Status Error, Insufficient, Low Total, High Total
NEPH/Aurora No Data, Invalid Scat Value, Invalid Scat Rel, Insufficient
TEOM High Noise, Negative/Zero, NV > Total, Invalid Vol Frac, Std Outlier, Insufficient
BAM1020 Invalid Range, IQR Outlier
OCEC Invalid Range, Below MDL, IQR Outlier, Missing OC
IGAC Mass Closure, Missing Main, Below MDL, Ion Balance
EPA Negative Value

Methods

Statistical Outlier Detection

N-Sigma Method

from AeroViz.rawDataReader.core.qc import QualityControl

qc = QualityControl()
cleaned_data = qc.n_sigma(df, std_range=3)

Interquartile Range (IQR) Method

# Basic IQR
cleaned_data = qc.iqr(df)

# With log transformation
cleaned_data = qc.iqr(df, log_dist=True)

Time-Aware Rolling IQR

# Rolling IQR with time awareness
cleaned_data = qc.time_aware_rolling_iqr(
    df,
    window_size='24h',
    iqr_factor=3.0,
    min_periods=5
)

Advanced Quality Control

Bidirectional Trend Analysis

# Detect outliers while considering data trends
outlier_mask = qc.bidirectional_trend_std_QC(
    df,
    window_size='6h',
    std_factor=3.0,
    trend_window='30min',
    trend_factor=2.0
)

Status Code Filtering

# Filter based on instrument status codes
error_mask = qc.filter_error_status(
    df,
    error_codes=[1, 2, 4, 16],
    special_codes=[384, 1024]
)

Completeness Validation

# Check hourly data completeness
completeness_mask = qc.hourly_completeness_QC(
    df,
    freq='6min',
    threshold=0.75  # Require 75% data availability
)

Usage Examples

Basic Outlier Removal

from AeroViz.rawDataReader.core.qc import QualityControl
import pandas as pd

# Load your data
df = pd.read_csv('instrument_data.csv', index_col=0, parse_dates=True)

# Initialize QC
qc = QualityControl()

# Apply basic outlier detection
cleaned_df = qc.n_sigma(df, std_range=3)

Advanced Time-Aware Processing

# For time series with trends
cleaned_df = qc.time_aware_rolling_iqr(
    df,
    window_size='12h',
    iqr_factor=2.5,
    min_periods=10
)

# With bidirectional trend consideration
outlier_mask = qc.bidirectional_trend_std_QC(
    df,
    window_size='6h',
    std_factor=3.0,
    trend_window='1h'
)

# Apply the mask
final_df = df.where(~outlier_mask, np.nan)

Instrument-Specific QC

# For instruments with status codes (e.g., AE33)
error_mask = qc.filter_error_status(
    df,
    error_codes=[1, 2, 4, 16, 32],  # Common error codes
    special_codes=[384, 1024, 2048]  # Specific error conditions
)

# Remove error data
qc_df = df.where(~error_mask, np.nan)

API Reference

AeroViz.rawDataReader.core.qc.QualityControl

A class providing various methods for data quality control and outlier detection

Functions

_ensure_dataframe staticmethod
_ensure_dataframe(df: DataFrame | Series) -> DataFrame

Ensure input data is in DataFrame format

_transform_if_log staticmethod
_transform_if_log(df: DataFrame, log_dist: bool) -> DataFrame

Transform data to log scale if required

n_sigma classmethod
n_sigma(df: DataFrame, std_range: int = 5) -> DataFrame

Detect outliers using n-sigma method

Parameters:

Name Type Description Default
df DataFrame

Input data

required
std_range int

Number of standard deviations to use as threshold

5

Returns:

Type Description
DataFrame

Cleaned DataFrame with outliers masked as NaN

iqr classmethod
iqr(df: DataFrame, log_dist: bool = False) -> DataFrame

Detect outliers using Interquartile Range (IQR) method

Parameters:

Name Type Description Default
df DataFrame

Input data

required
log_dist bool

Whether to apply log transformation to data

False

Returns:

Type Description
DataFrame

Cleaned DataFrame with outliers masked as NaN

time_aware_rolling_iqr classmethod
time_aware_rolling_iqr(df: DataFrame, window_size: str = '24h', log_dist: bool = False, iqr_factor: float = 5, min_periods: int = 5) -> DataFrame

Detect outliers using rolling time-aware IQR method with handling for initial periods

Parameters:

Name Type Description Default
df DataFrame

Input data

required
window_size str

Size of the rolling window

'24h'
log_dist bool

Whether to apply log transformation to data

False
iqr_factor float

The factor by which to multiply the IQR

3
min_periods int

Minimum number of observations required in window

4

Returns:

Type Description
DataFrame

Cleaned DataFrame with outliers masked as NaN

time_aware_std_QC
time_aware_std_QC(df: DataFrame, time_window: str = '6h', std_factor: float = 3.0, min_periods: int = 4) -> DataFrame

Time-aware outlier detection using rolling standard deviation

Parameters:

Name Type Description Default
df DataFrame

Input data

required
time_window str

Rolling window size

'6h'
std_factor float

Standard deviation multiplier (e.g., 3 means 3σ)

3.0
min_periods int

Minimum number of observations required in window

4

Returns:

Type Description
DataFrame

Quality controlled DataFrame with outliers marked as NaN

bidirectional_trend_std_QC classmethod
bidirectional_trend_std_QC(df: DataFrame, window_size: str = '6h', std_factor: float = 3.0, trend_window: str = '30min', trend_factor: float = 2, min_periods: int = 4) -> Series

Perform quality control using standard deviation with awareness of both upward and downward trends.

This method identifies outliers considering both upward and downward trends in the data, applying more lenient criteria when consistent trends are detected.

Parameters:

Name Type Description Default
df DataFrame

Input data frame with time series (QC_Flag column is now optional)

required
window_size str

Size of the rolling window for std calculation

'6h'
std_factor float

Base factor for standard deviation threshold

3.0
trend_window str

Window for trend detection

'30min'
trend_factor float

Factor to increase std_factor when trends are detected

2
min_periods int

Minimum number of observations in window

4

Returns:

Type Description
Series

Boolean mask where True indicates outliers

filter_error_status staticmethod
filter_error_status(_df, error_codes, special_codes=None, return_mask=True)

Filter data based on error status codes.

Parameters:

Name Type Description Default
_df DataFrame

Input DataFrame

required
error_codes list or array - like

Codes indicating errors

required
special_codes list or array - like

Special codes to handle differently

None
return_mask bool

If True, returns a boolean mask where True indicates errors; If False, returns filtered DataFrame

True

Returns:

Type Description
Union[DataFrame, Series]

If return_mask=True: boolean Series with True for error points If return_mask=False: Filtered DataFrame with error points masked

spike_detection classmethod
spike_detection(df: DataFrame, max_change_rate: float = 3.0, min_abs_change: float = None) -> Series

Vectorized spike detection using change rate analysis.

Detects sudden unreasonable value changes while allowing legitimate gradual changes during events (pollution episodes, etc.).

This method is much faster than rolling window methods because it uses pure numpy vectorized operations.

Parameters:

Name Type Description Default
df DataFrame

Input data frame with time series

required
max_change_rate float

Maximum allowed ratio of current change to median absolute change. Higher values = more permissive. A value of 3.0 means a change must be 3x larger than the median change to be flagged.

3.0
min_abs_change float

Minimum absolute change required to be considered a spike. If None, uses 10% of the data's standard deviation.

None

Returns:

Type Description
Series

Boolean mask where True indicates detected spikes

Notes

The algorithm: 1. Calculate absolute difference between consecutive points 2. Calculate the median absolute change (robust baseline) 3. Flag points where change > max_change_rate * median_change 4. Also detect "reversals" (spike up then immediately down)

This approach allows gradual changes during events while catching sudden spikes that are likely instrument errors.

Examples:

>>> qc = QualityControl()
>>> spike_mask = qc.spike_detection(df, max_change_rate=3.0)
hourly_completeness_QC classmethod
hourly_completeness_QC(df: DataFrame, freq: str, threshold: float = 0.5) -> Series

Check if each hour has sufficient data points.

Parameters:

Name Type Description Default
df DataFrame

Input data frame with time series

required
freq str

Data frequency (e.g., '6min')

required
threshold float

Minimum required proportion of data points per hour (0-1)

0.5

Returns:

Type Description
Series

Boolean mask where True indicates insufficient data