AbstractReader

The AbstractReader class is the foundation of AeroViz's data reading system, providing a standardized interface for reading and processing aerosol instrument data.

Core Architecture

AbstractReader serves as the base class for all instrument-specific readers in AeroViz. It defines the common interface and provides shared functionality for data processing, quality control, and output formatting.

Overview

The AbstractReader implements a consistent workflow for all aerosol instruments:

Data Ingestion - Read raw instrument files
Format Detection - Automatically identify data structure
Quality Control - Apply built-in validation and filtering
Standardization - Convert to unified output format
Metadata Handling - Preserve instrument and measurement metadata

Usage Pattern

While you can use AbstractReader directly, it's typically accessed through the RawDataReader factory function which automatically selects the appropriate reader based on your instrument type.

Key Features

Flexible Input Handling - Supports various file formats and structures
Built-in Quality Control - Configurable data validation and filtering
Metadata Preservation - Maintains instrument configuration and measurement context
Extensible Design - Easy to subclass for new instruments
Error Handling - Robust error reporting and recovery

Implementation Note

AbstractReader is an abstract base class. For actual data reading, use instrument-specific implementations or the RawDataReader factory function.

API Reference

AeroViz.rawDataReader.core.AbstractReader

AbstractReader(path: Path | str, reset: bool | str = False, qc: bool | str = True, **kwargs)

Bases: ABC

Abstract class for reading raw data from different instruments.

This class serves as a base class for reading raw data from various instruments. Each instrument should have a separate class that inherits from this class and implements the abstract methods. The abstract methods are _raw_reader and _QC.

The class handles file management, including reading from and writing to pickle files, and implements quality control measures. It can process data in both batch and streaming modes.

Attributes:

Name	Type	Description
`nam`	`str`	Name identifier for the reader class
`path`	`Path`	Path to the raw data files
`meta`	`dict`	Metadata configuration for the instrument
`logger`	`ReaderLogger`	Custom logger instance for the reader
`reset`	`bool`	Flag to indicate whether to reset existing processed data
`append`	`bool`	Flag to indicate whether to append new data to existing processed data
`qc`	`bool or str`	Quality control settings
`qc_freq`	`str or None`	Frequency for quality control calculations

Initialize the AbstractReader.

Parameters:

Name	Type	Description	Default
`path`	`Path or str`	Path to the directory containing raw data files	required
`reset`	`bool or str`	If True, forces re-reading of raw data If 'append', appends new data to existing processed data	`False`
`qc`	`bool or str`	If True, performs quality control If str, specifies the frequency for QC calculations	`True`
`**kwargs`	`dict`	Additional keyword arguments: raw_freq : str Override raw data frequency (e.g., '6min', '1h'). If not set, frequency is auto-inferred from the data. drop_outlier_dates : bool, default=False Stray timestamps far outside the data's bulk (e.g. a year-2000 row in 2023 data) are always detected and warned about, since they balloon the native grid. By default they are kept (the warning explains how to fix the source); set True to drop them automatically before the grid is built. log_level : str Logging level for the log file quiet : bool If True, suppresses all console output	`{}`

Notes

Creates necessary output directories and initializes logging system. Sets up paths for pickle files, CSV files, and report outputs.

Attributes

_freq_mixed `instance-attribute`

_freq_mixed = False

_n_files `instance-attribute`

_n_files = None

_output_folder `instance-attribute`

_output_folder = output_folder

_output_prefix `instance-attribute`

_output_prefix = kwargs.get('output_prefix') or f'output_{self.nam.lower()}'

_qc_summary `class-attribute` `instance-attribute`

_qc_summary = None

_resolved_freq `instance-attribute`

_resolved_freq = None

append `instance-attribute`

append = reset == 'append'

csv_nam `instance-attribute`

csv_nam = output_folder / f'_read_{self.nam.lower()}_qc.csv'

csv_nam_raw `instance-attribute`

csv_nam_raw = output_folder / f'_read_{self.nam.lower()}_raw.csv'

csv_out `instance-attribute`

csv_out = output_folder / f'{self._output_prefix}.csv'

fill_missing `instance-attribute`

fill_missing = kwargs.get('fill_missing', True)

kwargs `instance-attribute`

kwargs = kwargs

logger `instance-attribute`

logger = ReaderLogger(self.nam, output_folder, kwargs.get('log_level', 'INFO').upper(), quiet=self.quiet)

meta `instance-attribute`

meta = meta[self.nam]

nam `class-attribute` `instance-attribute`

nam = 'AbstractReader'

overall_rates `instance-attribute`

overall_rates = None

path `instance-attribute`

path = Path(path)

pkl_nam `instance-attribute`

pkl_nam = output_folder / f'_read_{self.nam.lower()}_qc.pkl'

pkl_nam_raw `instance-attribute`

pkl_nam_raw = output_folder / f'_read_{self.nam.lower()}_raw.pkl'

qc `instance-attribute`

qc = qc

qc_freq `instance-attribute`

qc_freq = qc if isinstance(qc, str) else None

qc_severity_overrides `instance-attribute`

qc_severity_overrides = dict(kwargs.get('flag_severity') or {})

quiet `instance-attribute`

quiet = kwargs.get('quiet', False)

raw_freq `instance-attribute`

raw_freq = kwargs.get('raw_freq', None)

report_dict `instance-attribute`

report_dict = {}

report_out `instance-attribute`

report_out = output_folder / 'report.json'

reset `instance-attribute`

reset = reset is True

save_intermediate_csv `instance-attribute`

save_intermediate_csv = kwargs.get('save_intermediate_csv', True)

save_pkl `instance-attribute`

save_pkl = kwargs.get('save_pkl', True)

save_report `instance-attribute`

save_report = kwargs.get('save_report', True)

Methods:

QC_control `staticmethod`

QC_control()

_QC `abstractmethod`

_QC(df: DataFrame) -> DataFrame

Abstract method for quality control processing.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing raw data	required

Returns:

Type	Description
`DataFrame`	Quality controlled data with QC_Flag column

Notes

Must be implemented by child classes to handle instrument-specific QC. This method should only check raw data quality (status, range, completeness). Derived parameter validation should be done in _process().

call

__call__(start: datetime = None, end: datetime = None, mean_freq: str = None) -> DataFrame

Process data for a specified time range.

Parameters:

Name	Type	Description	Default
`start`	`datetime`	Start time for data processing; defaults to the data's first timestamp	`None`
`end`	`datetime`	End time for data processing; defaults to the data's last timestamp	`None`
`mean_freq`	`str`	Frequency for resampling the output; if None, no resampling is done and the data is returned at its native resolution	`None`

Returns:

Type	Description
`DataFrame`	Processed and resampled data for the specified time range

Notes

The processed data is also saved to a CSV file.

_cache_is_current

_cache_is_current(df: DataFrame) -> bool

True if the cached frame was written by the current cache format.

_flag_outlier_dates

_flag_outlier_dates(df: DataFrame) -> DataFrame

Detect and warn about stray timestamps; drop them only if asked.

A single bad row — e.g. a 2000-01-01 stamp in otherwise-2023 data — stretches the canonical native grid (built over the data's own min->max in _read_raw_files, before any requested range applies) across the whole bogus span, inflating the cached frame to millions of NaN rows even when the caller only asked for 2023. Such stamps are almost always a source-data error, so by default we warn and tell the user how to fix it rather than silently changing their data; pass drop_outlier_dates=True to have them excluded automatically.

_generate_report

_generate_report(raw_data, qc_data, qc_flag=None) -> None

Calculate and log data quality rates for different time periods.

Parameters:

Name	Type	Description	Default
`raw_data`	`DataFrame`	Raw data before quality control	required
`qc_data`	`DataFrame`	Data after quality control	required
`qc_flag`	`Series`	QC flag series indicating validity of each row	`None`

Notes

Calculates rates for specified QC frequency if set. Updates the quality report with calculated rates.

_load_or_parse

_load_or_parse() -> tuple[DataFrame, DataFrame]

Return the canonical parsed (raw, qc) frames, using the pkl cache when it exists and is current.

Canonical = snapped to the native grid over the files' own coverage, NOT padded to any requested range. Parse provenance (n_files, raw_freq, freq_mixed) is persisted in df.attrs so a cache hit restores it onto self. The requested range / fill_missing is applied later, in _run.

_outlier_process

_outlier_process(_df)

Process outliers in the data.

Parameters:

Name	Type	Description	Default
`_df`	`DataFrame`	Input DataFrame containing potential outliers	required

Returns:

Type	Description
`DataFrame`	DataFrame with outliers processed

Notes

Implementation depends on specific instrument requirements.

_partition_compatible_scans

_partition_compatible_scans(df_list: list, files: list) -> list

Drop frames whose scan schema differs from the dominant group.

Default is a no-op — overridden by readers (currently SMPS) where the same instrument can export at different size-bin grids depending on the host software version (AIM 10.3 .TXT vs AIM 11.x .CSV). The outer join inside pd.concat happily concatenates frames with disjoint columns, but the NaN holes break per-bin completeness QC. The well-defined repair is to treat each grid as its own scan: keep the majority group, drop the minority and tell the user which files were skipped so they can re-run them in isolation if they want both.

df_list and files are aligned and contain only successfully parsed entries.

_process

_process(df: DataFrame) -> DataFrame

Process data to calculate derived parameters.

This method is called after _QC() to calculate instrument-specific derived parameters (e.g., absorption coefficients, AAE, SAE).

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Quality-controlled DataFrame carrying `QC_Flag` and `QC_Invalid`	required

Returns:

Type	Description
`DataFrame`	DataFrame with derived parameters added and the QC columns updated

Notes

Default implementation returns the input unchanged. Override in child classes to implement instrument-specific processing.

The method should:

Calculate derived parameters for every row, flagged or not.
Judge the derived parameters and record the verdict with update_qc_flag(df, mask, name, severity=...).
Log the combined summary via extend_qc_summary + log_qc_summary when it added a rule of its own.

Do not skip rows that _QC already flagged. An earlier version of this docstring offered that as an optimisation; it is wrong twice over. L2's contract (rule R2 in docs/guide/data-levels.md) is to judge without destroying, so a derived value belongs in _read_*_qc.csv whatever the verdict — that file is where a user goes to see why a row was rejected, and a NaN there answers nothing. And since severity arrived, a flagged row is not necessarily an invalid one: skipping "flagged" rows would silently drop derived values for rows that are kept.

Rules are independent and deliberately overlap — one row can be both Status Error and Invalid AAE — so the per-rule counts in a QC summary do not sum to the total. Valid and Usable are the totals that mean something.

_raw_reader `abstractmethod`

_raw_reader(file)

Abstract method to read raw data files.

Parameters:

Name	Type	Description	Default
`file`	`Path or str`	Path to the raw data file	required

Returns:

Type	Description
`DataFrame`	Raw data read from the file

Notes

Must be implemented by child classes to handle specific file formats.

_read_raw_files

_read_raw_files() -> tuple[DataFrame | None, DataFrame | None]

Read and process raw data files.

Returns:

Type	Description
`tuple[DataFrame \| None, DataFrame \| None]`	Tuple containing: - Raw data DataFrame or None - Quality controlled DataFrame or None

Notes

Handles file reading and initial processing.

_resample

_resample(df: DataFrame, mean_freq: str | None) -> DataFrame

Average df onto mean_freq; a no-op when none was requested.

mean() silently drops non-numeric columns, which is how text metadata (a status string, an instrument ID) vanishes between the native-resolution and resampled outputs. That is the right behaviour — there is no sensible mean of a status string — but it should not be silent, so the dropped columns are named in the log once.

_restore_parse_meta

_restore_parse_meta(df: DataFrame) -> None

Pull parse provenance off a cached frame back onto self (cache hit).

_run

_run(user_start, user_end)

Main execution method for data processing.

Parameters:

Name	Type	Description	Default
`user_start`	`datetime`	Start time for processing	required
`user_end`	`datetime`	End time for processing	required

Returns:

Type	Description
`tuple[DataFrame, DataFrame]`	Raw and quality-controlled frames for the requested range.

Notes

Two layers. _load_or_parse returns the canonical parsed frames (from the pkl cache when valid, else by reading the raw files). The presentation step below — grid placement to the requested range, with fill_missing — runs on every call, so a cache hit honours the current call's range/fill_missing instead of replaying whatever was stored. Parse provenance restored by _load_or_parse feeds the df.attrs stamp in __call__.

_save_data

_save_data(raw_data: DataFrame, qc_data: DataFrame) -> None

Save processed data to files.

Parameters:

Name	Type	Description	Default
`raw_data`	`DataFrame`	Raw data to save	required
`qc_data`	`DataFrame`	Quality controlled data to save	required

Notes

Saves data in both pickle and CSV formats.

_stamp

_stamp(df: DataFrame, start, end, *, mean_freq=None, with_qc=False) -> DataFrame

Attach reader metadata to df.attrs just before returning.

Always records provenance (instrument, station, coverage, requested range, native frequency). When with_qc is True it additionally records the output frequency and the overall QC rates; the plain raw path (qc=False) gets provenance only.

See core.metadata for why this is the single, final stamping point.

_stamp_parse_meta

_stamp_parse_meta(df: DataFrame) -> None

Persist parse provenance into df.attrs so it survives the pkl cache.

_timeIndex_process

_timeIndex_process(_df, user_start=None, user_end=None, append_df=None)

Process time index of the DataFrame.

Parameters:

Name	Type	Description	Default
`_df`	`DataFrame`	Input DataFrame to process	required
`user_start`	`datetime`	User-specified start time	`None`
`user_end`	`datetime`	User-specified end time	`None`
`append_df`	`DataFrame`	DataFrame to append to	`None`

Returns:

Type	Description
`DataFrame`	DataFrame with processed time index

Notes

Frequency is resolved once per run in _read_raw_files (per-file detection, see self._resolved_freq); this method only places the data on that grid via to_grid — snapping off-grid timestamps to their nearest bin without the duplicate-fill of method='nearest'.

check_status_columns

check_status_columns(df: DataFrame, candidates) -> list[str]

Which of candidates are present, warning loudly when none are.

filter_error_status returns all-False for a column it cannot find, so a renamed status column degrades to "this instrument reported no errors, ever" — indistinguishable from a healthy instrument. Vendors do rename it between host-software versions (SMPS AIM 10.3 vs 11.x split the same information across differently-named columns), so the absence has to be visible.

Returns the present names so a caller can OR their masks together.

extend_qc_summary

extend_qc_summary(summary: DataFrame, df: DataFrame, rule: str, mask: Series, description: str = '', severity: str = ERROR) -> DataFrame

Add a _process-stage rule to a _QC summary and refresh the totals.

_QC builds the summary before derived quantities exist, so a rule like Invalid AAE can only be counted later. The new row is inserted above the trailing Valid / Usable totals, and both totals are then recomputed from df's QC columns so they account for the late rule.

Parameters:

Name	Type	Description	Default
`summary`	`DataFrame`	The table returned by `QCFlagBuilder.get_summary`.	required
`df`	`DataFrame`	The frame after `update_qc_flag` applied `rule`.	required
`rule`	`str`	The late rule's name, boolean mask, description and severity.	required
`mask`	`str`	The late rule's name, boolean mask, description and severity.	required
`description`	`str`	The late rule's name, boolean mask, description and severity.	required
`severity`	`str`	The late rule's name, boolean mask, description and severity.	required

log_below_mdl

log_below_mdl(df: DataFrame, mdl: dict, *, top: int = 10) -> DataFrame

Report, per column, how much of it sits below its detection limit.

A value below the MDL is a valid measurement of a low concentration (or a non-detect), not a broken row — so this is a diagnostic, not a QC rule. Deliberately so: any non-Valid flag NaNs the whole row in __call__, and with tens of species/elements per row "any one below MDL" is true almost always, which would delete the dataset. Use this to see which species are near their limits, then decide per analysis.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The frame to inspect (columns not in `mdl` are ignored).	required
`mdl`	`dict`	`{column: limit}`; entries whose limit is `None` are skipped.	required
`top`	`int`	How many of the worst-affected columns to log.	`10`

Returns:

Type	Description
`DataFrame`	One row per column with `below`, `measured` and `percentage`, sorted worst-first. Empty if nothing could be evaluated.

log_qc_summary

log_qc_summary(summary: DataFrame) -> None

Log a QCFlagBuilder.get_summary table.

Advisory rules are marked so it is obvious which flags kept their data. Valid (passed everything) and Usable (nothing invalidating) are both reported — they differ by the rows carrying only advisory flags.

progress_reading

progress_reading(files: list) -> Generator

Context manager for tracking file reading progress.

Parameters:

Name	Type	Description	Default
`files`	`list`	List of files to process	required

Yields:

Type	Description
`Progress`	Progress bar object for tracking

Notes

Uses rich library for progress display.

qc_builder

qc_builder() -> QCFlagBuilder

A QCFlagBuilder carrying this run's severity overrides.

Readers should use this instead of QCFlagBuilder() directly so that flag_severity={'Insufficient': 'warning'} reaches their rules, and so a rule that raises is reported through the reader's log.

qc_columns `staticmethod`

qc_columns(df: DataFrame) -> list[str]

The QC bookkeeping columns present in df, in a stable order.

Readers that narrow their output to a fixed column list must carry both of them through — QC_Flag (the record) and QC_Invalid (the verdict the presentation layer masks on). Slicing with a hard-coded + ['QC_Flag'] silently drops the verdict, which would make every flag fatal again.

reorder_dataframe_columns `staticmethod`

reorder_dataframe_columns(df, order_lists: list[list], keep_others: bool = False)

Reorder DataFrame columns according to specified lists.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame	required
`order_lists`	`list[list]`	Lists specifying column order	required
`keep_others`	`bool`	If True, keeps unspecified columns at the end	`False`

Returns:

Type	Description
`DataFrame`	DataFrame with reordered columns

update_qc_flag `staticmethod`

update_qc_flag(df: DataFrame, mask: Series, flag_name: str, severity: str = ERROR) -> DataFrame

Add a flag to QC_Flag for rows matching the mask, after _QC ran.

Used by _process to flag something that can only be judged once derived quantities exist (e.g. Invalid AAE).

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with QC_Flag column	required
`mask`	`Series`	Boolean mask indicating rows to flag	required
`flag_name`	`str`	Name of the flag to add	required
`severity`	`(error, warning)`	`'error'` also marks the rows invalid, so they are masked in the public output; `'warning'` records the flag only.	`'error'`

Returns:

Type	Description
`DataFrame`	DataFrame with updated `QC_Flag` (and `QC_Invalid` when the flag is invalidating)

RawDataReader Factory - High-level interface for instrument data reading
Quality Control - Data validation and filtering options
Supported Instruments - Available instrument implementations

Quick Example

from AeroViz import RawDataReader
from datetime import datetime

# Using the factory function (recommended)
data = RawDataReader(
    instrument='AE33',
    path='/path/to/data',
    start=datetime(2024, 1, 1),
    end=datetime(2024, 12, 31)
)

# Direct usage (advanced - for custom implementations)
from AeroViz.rawDataReader.core import AbstractReader


class MyInstrumentReader(AbstractReader):
    nam = 'MyInstrument'

    def _raw_reader(self, file):
        # Custom file reading logic
        pass

    def _QC(self, df):
        # Custom QC logic
        return df

AbstractReader

Overview

Key Features

API Reference

AeroViz.rawDataReader.core.AbstractReader

Attributes

_freq_mixed instance-attribute

_n_files instance-attribute

_output_folder instance-attribute

_output_prefix instance-attribute

_qc_summary class-attribute instance-attribute

_resolved_freq instance-attribute

append instance-attribute

csv_nam instance-attribute

csv_nam_raw instance-attribute

csv_out instance-attribute

fill_missing instance-attribute

kwargs instance-attribute

logger instance-attribute

meta instance-attribute

nam class-attribute instance-attribute

overall_rates instance-attribute

path instance-attribute

pkl_nam instance-attribute

pkl_nam_raw instance-attribute

qc instance-attribute

qc_freq instance-attribute

qc_severity_overrides instance-attribute

quiet instance-attribute

raw_freq instance-attribute

report_dict instance-attribute

report_out instance-attribute

reset instance-attribute

save_intermediate_csv instance-attribute

save_pkl instance-attribute

save_report instance-attribute

Methods:

QC_control staticmethod

_QC abstractmethod

__call__

_cache_is_current

_flag_outlier_dates

_generate_report

_load_or_parse

_outlier_process

_partition_compatible_scans

_process

_raw_reader abstractmethod

_read_raw_files

_resample

_restore_parse_meta

_run

_save_data

_stamp

_stamp_parse_meta

_timeIndex_process

check_status_columns

extend_qc_summary

log_below_mdl

log_qc_summary

progress_reading

qc_builder

qc_columns staticmethod

reorder_dataframe_columns staticmethod

update_qc_flag staticmethod

Related Documentation

_freq_mixed `instance-attribute`

_n_files `instance-attribute`

_output_folder `instance-attribute`

_output_prefix `instance-attribute`

_qc_summary `class-attribute` `instance-attribute`

_resolved_freq `instance-attribute`

append `instance-attribute`

csv_nam `instance-attribute`

csv_nam_raw `instance-attribute`

csv_out `instance-attribute`

fill_missing `instance-attribute`

kwargs `instance-attribute`

logger `instance-attribute`

meta `instance-attribute`

nam `class-attribute` `instance-attribute`

overall_rates `instance-attribute`

path `instance-attribute`

pkl_nam `instance-attribute`

pkl_nam_raw `instance-attribute`

qc `instance-attribute`

qc_freq `instance-attribute`

qc_severity_overrides `instance-attribute`

quiet `instance-attribute`

raw_freq `instance-attribute`

report_dict `instance-attribute`

report_out `instance-attribute`

reset `instance-attribute`

save_intermediate_csv `instance-attribute`

save_pkl `instance-attribute`

save_report `instance-attribute`

QC_control `staticmethod`

_QC `abstractmethod`

call

_raw_reader `abstractmethod`

qc_columns `staticmethod`

reorder_dataframe_columns `staticmethod`

update_qc_flag `staticmethod`