Quality Control

The QualityControl class provides comprehensive data quality assessment and outlier detection methods for aerosol instrument data.

Overview

Quality control is essential for ensuring data reliability in aerosol measurements. The QualityControl class offers various statistical methods to identify and handle outliers, invalid measurements, and data quality issues.

Key Features

QCFlagBuilder System - Declarative rule-based quality control
Multiple Outlier Detection Methods - N-sigma, IQR, time-aware rolling methods
Flexible Thresholds - Configurable parameters for different data types
Time-Aware Processing - Considers temporal patterns in data
Status Code Filtering - Handles instrument-specific error codes
Data Completeness Checks - Validates temporal coverage requirements

QCFlagBuilder System

All instruments use the declarative QCFlagBuilder system for quality control. This system provides:

Declarative Rules - Define QC rules as simple dataclass instances
Consistent Processing - All instruments use QC_Flag internally for quality control
Transparent Results - Failed rules are clearly listed in the flag
Clean Output - Final output has invalid data set to NaN, QC_Flag column removed

QCRule Dataclass

from dataclasses import dataclass
from typing import Callable
import pandas as pd

@dataclass
class QCRule:
    name: str                           # Rule identifier (e.g., "Invalid BC")
    condition: Callable[[pd.DataFrame], pd.Series]  # Returns True where data fails
    description: str                    # Human-readable description

QCFlagBuilder Class

class QCFlagBuilder:
    def __init__(self, rules: list[QCRule]):
        self.rules = rules

    def build(self, df: pd.DataFrame) -> pd.Series:
        """Build QC flag column from rules."""
        flags = pd.Series("Valid", index=df.index)

        for rule in self.rules:
            mask = rule.condition(df)
            for idx in df.index[mask]:
                current = flags.loc[idx]
                if current == "Valid":
                    flags.loc[idx] = rule.name
                else:
                    flags.loc[idx] = f"{current}, {rule.name}"

        return flags

Usage Example

from AeroViz.rawDataReader.core.qc import QCRule, QCFlagBuilder

# Define QC rules
rules = [
    QCRule(
        name="Invalid Range",
        condition=lambda df: (df['value'] < 0) | (df['value'] > 1000),
        description="Value outside valid range (0-1000)"
    ),
    QCRule(
        name="Missing Data",
        condition=lambda df: df['value'].isna(),
        description="Value is missing"
    ),
]

# Build flags
builder = QCFlagBuilder(rules)
df['QC_Flag'] = builder.build(df)

# Results:
# - "Valid" if all rules pass
# - "Invalid Range" if only range check fails
# - "Invalid Range, Missing Data" if both fail

Rule severity

A rule declares whether firing it invalidates the measurement:

from AeroViz.rawDataReader.core import QCRule

QCRule(
    name='Invalid BC',
    condition=lambda df: (df['BC6'] <= 0) | (df['BC6'] > 20000),
    description='BC outside 0-20000 ng/m³',
    # severity='error' is the default: the row is masked to NaN in the output
)

QCRule(
    name='Below MDL',
    condition=lambda df: df['Thermal_OC'] <= 0.3,
    description='At or below the method detection limit',
    severity='warning',   # advisory: recorded and logged, value kept
)

QCFlagBuilder.apply writes two columns: QC_Flag (every rule that fired) and QC_Invalid (True iff an 'error' rule fired). Only QC_Invalid drives masking, so an advisory flag never deletes data. get_summary reports each rule's severity plus two totals — Valid (passed everything) and Usable (nothing invalidating).

Callers can reclassify per run, without editing a reader:

# tighten an advisory flag back into an invalidating one
RawDataReader('SMPS', path, flag_severity={'Insufficient': 'error'})

Instrument QC Rules Summary

The authoritative per-instrument rule list — names, thresholds, severities and the status-code tables behind Status Error — lives in Instrument Formats & QC, which is generated from the readers rather than restated here.

Methods

Statistical Outlier Detection

N-Sigma Method

from AeroViz.rawDataReader.core.qc import QualityControl

qc = QualityControl()
cleaned_data = qc.n_sigma(df, std_range=3)

Interquartile Range (IQR) Method

# Basic IQR
cleaned_data = qc.iqr(df)

# With log transformation
cleaned_data = qc.iqr(df, log_dist=True)

Time-Aware Rolling IQR

# Rolling IQR with time awareness
cleaned_data = qc.time_aware_rolling_iqr(
    df,
    window_size='24h',
    iqr_factor=3.0,
    min_periods=5
)

Advanced Quality Control

Bidirectional Trend Analysis

# Detect outliers while considering data trends
outlier_mask = qc.bidirectional_trend_std_QC(
    df,
    window_size='6h',
    std_factor=3.0,
    trend_window='30min',
    trend_factor=2.0
)

Status Code Filtering

# Filter based on instrument status codes
error_mask = qc.filter_error_status(
    df,
    error_codes=[1, 2, 4, 16],
    special_codes=[384, 1024]
)

filter_error_status supports four status_type modes — bitwise (AE33, AE43, BC1054, MA350, TEOM), numeric (Aurora, NEPH), text (SMPS) and binary_string (APS). The ignored_values whitelist lets you suppress operator-known benign statuses without rewriting raw files; it is interpreted in the active mode (string tokens / numeric codes / integer bits / bit masks), and entries that don't fit a mode are skipped. This is surfaced on RawDataReader as the ignored_status_errors parameter.

# bitwise: drop one warning bit from the error definition
# (e.g. ignore the TEOM "Dryer A" status bit 0x20000000)
error_mask = qc.filter_error_status(
    df,
    error_codes=[1 << b for b in range(32)],
    status_column='status',
    status_type='bitwise',
    ignored_values=[536870912],
)

# text: whitelist comma-split tokens (SMPS)
error_mask = qc.filter_error_status(
    df,
    status_column='Instrument Errors',
    status_type='text',
    ok_value='',
    ignored_values=['Low aerosol flow', 'Neutralizer not active'],
)

Completeness Validation

# Check hourly data completeness
completeness_mask = qc.hourly_completeness_QC(
    df,
    freq='6min',
    threshold=0.75  # Require 75% data availability
)

Usage Examples

Basic Outlier Removal

from AeroViz.rawDataReader.core.qc import QualityControl
import pandas as pd

# Load your data
df = pd.read_csv('instrument_data.csv', index_col=0, parse_dates=True)

# Initialize QC
qc = QualityControl()

# Apply basic outlier detection
cleaned_df = qc.n_sigma(df, std_range=3)

Advanced Time-Aware Processing

# For time series with trends
cleaned_df = qc.time_aware_rolling_iqr(
    df,
    window_size='12h',
    iqr_factor=2.5,
    min_periods=10
)

# With bidirectional trend consideration
outlier_mask = qc.bidirectional_trend_std_QC(
    df,
    window_size='6h',
    std_factor=3.0,
    trend_window='1h'
)

# Apply the mask
final_df = df.where(~outlier_mask, np.nan)

Instrument-Specific QC

# For instruments with status codes (e.g., AE33)
error_mask = qc.filter_error_status(
    df,
    error_codes=[1, 2, 4, 16, 32],  # Common error codes
    special_codes=[384, 1024, 2048]  # Specific error conditions
)

# Remove error data
qc_df = df.where(~error_mask, np.nan)

API Reference

AeroViz.rawDataReader.core.qc.QualityControl

A class providing various methods for data quality control and outlier detection

Methods:

_ensure_dataframe `staticmethod`

_ensure_dataframe(df: DataFrame | Series) -> DataFrame

Ensure input data is in DataFrame format

_transform_if_log `staticmethod`

_transform_if_log(df: DataFrame, log_dist: bool) -> DataFrame

Transform data to log scale if required

n_sigma `classmethod`

n_sigma(df: DataFrame, std_range: int = 5) -> DataFrame

Detect outliers using n-sigma method

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data	required
`std_range`	`int`	Number of standard deviations to use as threshold	`5`

Returns:

Type	Description
`DataFrame`	Cleaned DataFrame with outliers masked as NaN

iqr `classmethod`

iqr(df: DataFrame, log_dist: bool = False) -> DataFrame

Detect outliers using Interquartile Range (IQR) method

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data	required
`log_dist`	`bool`	Whether to apply log transformation to data	`False`

Returns:

Type	Description
`DataFrame`	Cleaned DataFrame with outliers masked as NaN

time_aware_rolling_iqr `classmethod`

time_aware_rolling_iqr(df: DataFrame, window_size: str = '24h', log_dist: bool = False, iqr_factor: float = 5, min_periods: int = 5) -> DataFrame

Detect outliers using rolling time-aware IQR method with handling for initial periods

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data	required
`window_size`	`str`	Size of the rolling window	`'24h'`
`log_dist`	`bool`	Whether to apply log transformation to data	`False`
`iqr_factor`	`float`	The factor by which to multiply the IQR	`3`
`min_periods`	`int`	Minimum number of observations required in window	`4`

Returns:

Type	Description
`DataFrame`	Cleaned DataFrame with outliers masked as NaN

time_aware_std_QC

time_aware_std_QC(df: DataFrame, time_window: str = '6h', std_factor: float = 3.0, min_periods: int = 4) -> DataFrame

Time-aware outlier detection using rolling standard deviation

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data	required
`time_window`	`str`	Rolling window size	`'6h'`
`std_factor`	`float`	Standard deviation multiplier (e.g., 3 means 3σ)	`3.0`
`min_periods`	`int`	Minimum number of observations required in window	`4`

Returns:

Type	Description
`DataFrame`	Quality controlled DataFrame with outliers marked as NaN

bidirectional_trend_std_QC `classmethod`

bidirectional_trend_std_QC(df: DataFrame, window_size: str = '6h', std_factor: float = 3.0, trend_window: str = '30min', trend_factor: float = 2, min_periods: int = 4) -> Series

Perform quality control using standard deviation with awareness of both upward and downward trends.

This method identifies outliers considering both upward and downward trends in the data, applying more lenient criteria when consistent trends are detected.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data frame with time series (QC_Flag column is now optional)	required
`window_size`	`str`	Size of the rolling window for std calculation	`'6h'`
`std_factor`	`float`	Base factor for standard deviation threshold	`3.0`
`trend_window`	`str`	Window for trend detection	`'30min'`
`trend_factor`	`float`	Factor to increase std_factor when trends are detected	`2`
`min_periods`	`int`	Minimum number of observations in window	`4`

Returns:

Type	Description
`Series`	Boolean mask where True indicates outliers

filter_error_status `staticmethod`

filter_error_status(_df, error_codes=None, special_codes=None, return_mask=True, status_column='Status', status_type='bitwise', ok_value=None, ignored_values=None)

Filter data based on error status codes.

Parameters:

Name	Type	Description	Default
`_df`	`DataFrame`	Input DataFrame	required
`error_codes`	`list or array - like`	Codes indicating errors (for 'bitwise' type)	`None`
`special_codes`	`list or array - like`	Special codes to handle differently (exact match)	`None`
`return_mask`	`bool`	If True, returns a boolean mask where True indicates errors; If False, returns filtered DataFrame	`True`
`status_column`	`str`	Name of the status column in DataFrame	`'Status'`
`status_type`	`str`	Type of status check: - 'bitwise': Use bitwise AND to check error codes (AE33, AE43, BC1054, MA350) - 'numeric': Check if status != ok_value (TEOM, Aurora, NEPH) - 'text': Check if status != ok_value as string (SMPS) - 'binary_string': Parse binary string and check if > 0 (APS)	`'bitwise'`
`ok_value`	`any`	The value indicating OK status (for 'numeric', 'text' types) - For 'numeric': typically 0 - For 'text': typically 'Normal Scan'	`None`
`ignored_values`	`list`	Whitelist of statuses to suppress (treat as OK) without editing the raw files — e.g. an operator-known benign warning. Interpretation is mode-specific; entries that don't fit a mode are silently skipped so a whitelist meant for one instrument is harmless if it reaches another. Defaults to None (no whitelist; behaviour unchanged). `'text'` : string tokens. The status is comma-split and a row passes when every token is `ok_value` or whitelisted. A token `'Low aerosol flow'` matches both the bare string and combined statuses like `'Low aerosol flow,Neutralizer not active'` (when both tokens are whitelisted). `'numeric'` : numeric status codes treated as OK in addition to `ok_value` (e.g. `[4, 16]`). `'bitwise'` : integer error codes/bits dropped from the error definition; a row is flagged only if a NON-whitelisted code still matches (token-level, mirroring text mode; e.g. `[4]`). `'binary_string'` : integer bit masks cleared before testing; a row is flagged only if a NON-whitelisted bit remains set (e.g. `[1, 2]`).	`None`

Returns:

Type	Description
`Union[DataFrame, Series]`	If return_mask=True: boolean Series with True for error points If return_mask=False: Filtered DataFrame with error points masked

spike_detection `classmethod`

spike_detection(df: DataFrame, max_change_rate: float = 3.0, min_abs_change: float = None) -> Series

Vectorized spike detection using change rate analysis.

Detects sudden unreasonable value changes while allowing legitimate gradual changes during events (pollution episodes, etc.).

This method is much faster than rolling window methods because it uses pure numpy vectorized operations.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data frame with time series	required
`max_change_rate`	`float`	Maximum allowed ratio of current change to median absolute change. Higher values = more permissive. A value of 3.0 means a change must be 3x larger than the median change to be flagged.	`3.0`
`min_abs_change`	`float`	Minimum absolute change required to be considered a spike. If None, uses 10% of the data's standard deviation.	`None`

Returns:

Type	Description
`Series`	Boolean mask where True indicates detected spikes

Notes

The algorithm: 1. Calculate absolute difference between consecutive points 2. Calculate the median absolute change (robust baseline) 3. Flag points where change > max_change_rate * median_change 4. Also detect "reversals" (spike up then immediately down)

This approach allows gradual changes during events while catching sudden spikes that are likely instrument errors.

Examples:

>>> qc = QualityControl()
>>> spike_mask = qc.spike_detection(df, max_change_rate=3.0)

hourly_completeness_QC `classmethod`

hourly_completeness_QC(df: DataFrame, freq: str, threshold: float = 0.5) -> Series

Check whether each clock hour holds enough data to be representative.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input data frame with time series	required
`freq`	`str`	Data frequency (e.g. '6min')	required
`threshold`	`float`	Minimum required proportion of the points that hour could hold (0-1)	`0.5`

Returns:

Type	Description
`Series`	Boolean mask where True indicates insufficient data

Notes

This is a statement about representativeness, not validity: the readings in a sparse hour are perfectly good measurements, it is an average over that hour that would misrepresent it. Readers therefore raise it at severity='warning' — see QCRule.

The expectation is scaled by how much of each hour the data actually spans, which matters at the two ends of every read. An hour is compared against the points it could have held given the coverage, not against a full hour it never had the chance to fill: a read starting at 10:54 has six minutes in the 10 o'clock hour, so a full-hour expectation condemns it no matter how perfectly the instrument ran. That is what made short reads unusable — a 22-minute file had every row flagged — while leaving multi-day reads almost untouched, since the effect is always exactly two hours out of however many.

Interior hours are unaffected: they overlap the coverage completely, so their expectation is the full hour and a genuine outage is still caught.

AbstractReader - Base class that uses QualityControl methods
RawDataReader - Factory function with built-in QC options
Instrument Documentation - Instrument-specific QC procedures

Quality Control

Overview

Key Features

QCFlagBuilder System

QCRule Dataclass

QCFlagBuilder Class

Usage Example

Rule severity

Instrument QC Rules Summary

Methods

Statistical Outlier Detection

N-Sigma Method

Interquartile Range (IQR) Method

Time-Aware Rolling IQR

Advanced Quality Control

Bidirectional Trend Analysis

Status Code Filtering

Completeness Validation

Usage Examples

Basic Outlier Removal

Advanced Time-Aware Processing

Instrument-Specific QC

API Reference

AeroViz.rawDataReader.core.qc.QualityControl

Methods:

_ensure_dataframe staticmethod

_transform_if_log staticmethod

n_sigma classmethod

iqr classmethod

time_aware_rolling_iqr classmethod

time_aware_std_QC

bidirectional_trend_std_QC classmethod

filter_error_status staticmethod

spike_detection classmethod

hourly_completeness_QC classmethod

Related Documentation

_ensure_dataframe `staticmethod`

_transform_if_log `staticmethod`

n_sigma `classmethod`

iqr `classmethod`

time_aware_rolling_iqr `classmethod`

bidirectional_trend_std_QC `classmethod`

filter_error_status `staticmethod`

spike_detection `classmethod`

hourly_completeness_QC `classmethod`