Skip to content

AbstractReader

The AbstractReader class is the foundation of AeroViz's data reading system, providing a standardized interface for reading and processing aerosol instrument data.

Core Architecture

AbstractReader serves as the base class for all instrument-specific readers in AeroViz. It defines the common interface and provides shared functionality for data processing, quality control, and output formatting.

Overview

The AbstractReader implements a consistent workflow for all aerosol instruments:

  1. Data Ingestion - Read raw instrument files
  2. Format Detection - Automatically identify data structure
  3. Quality Control - Apply built-in validation and filtering
  4. Standardization - Convert to unified output format
  5. Metadata Handling - Preserve instrument and measurement metadata

Usage Pattern

While you can use AbstractReader directly, it's typically accessed through the RawDataReader factory function which automatically selects the appropriate reader based on your instrument type.

Key Features

  • Flexible Input Handling - Supports various file formats and structures
  • Built-in Quality Control - Configurable data validation and filtering
  • Metadata Preservation - Maintains instrument configuration and measurement context
  • Extensible Design - Easy to subclass for new instruments
  • Error Handling - Robust error reporting and recovery

Implementation Note

AbstractReader is an abstract base class. For actual data reading, use instrument-specific implementations or the RawDataReader factory function.

API Reference

AeroViz.rawDataReader.core.AbstractReader

AbstractReader(path: Path | str, reset: bool | str = False, qc: bool | str = True, **kwargs)

Bases: ABC

Abstract class for reading raw data from different instruments.

This class serves as a base class for reading raw data from various instruments. Each instrument should have a separate class that inherits from this class and implements the abstract methods. The abstract methods are _raw_reader and _QC.

The class handles file management, including reading from and writing to pickle files, and implements quality control measures. It can process data in both batch and streaming modes.

Attributes:

Name Type Description
nam str

Name identifier for the reader class

path Path

Path to the raw data files

meta dict

Metadata configuration for the instrument

logger ReaderLogger

Custom logger instance for the reader

reset bool

Flag to indicate whether to reset existing processed data

append bool

Flag to indicate whether to append new data to existing processed data

qc bool or str

Quality control settings

qc_freq str or None

Frequency for quality control calculations

Initialize the AbstractReader.

Parameters:

Name Type Description Default
path Path or str

Path to the directory containing raw data files

required
reset bool or str

If True, forces re-reading of raw data If 'append', appends new data to existing processed data

False
qc bool or str

If True, performs quality control If str, specifies the frequency for QC calculations

True
**kwargs dict

Additional keyword arguments: raw_freq : str Override raw data frequency (e.g., '6min', '1h'). If not set, frequency is auto-inferred from the data. drop_outlier_dates : bool, default=False Stray timestamps far outside the data's bulk (e.g. a year-2000 row in 2023 data) are always detected and warned about, since they balloon the native grid. By default they are kept (the warning explains how to fix the source); set True to drop them automatically before the grid is built. log_level : str Logging level for the log file quiet : bool If True, suppresses all console output

{}
Notes

Creates necessary output directories and initializes logging system. Sets up paths for pickle files, CSV files, and report outputs.

Attributes

_freq_mixed instance-attribute
_freq_mixed = False
_n_files instance-attribute
_n_files = None
_output_folder instance-attribute
_output_folder = output_folder
_output_prefix instance-attribute
_output_prefix = get('output_prefix') or f'output_{lower()}'
_resolved_freq instance-attribute
_resolved_freq = None
append instance-attribute
append = reset == 'append'
csv_nam instance-attribute
csv_nam = output_folder / f'_read_{lower()}_qc.csv'
csv_nam_raw instance-attribute
csv_nam_raw = output_folder / f'_read_{lower()}_raw.csv'
csv_out instance-attribute
csv_out = output_folder / f'{_output_prefix}.csv'
fill_missing instance-attribute
fill_missing = get('fill_missing', True)
kwargs instance-attribute
kwargs = kwargs
logger instance-attribute
logger = ReaderLogger(nam, output_folder, upper(), quiet=quiet)
meta instance-attribute
meta = meta[nam]
nam class-attribute instance-attribute
nam = 'AbstractReader'
overall_rates instance-attribute
overall_rates = None
path instance-attribute
path = Path(path)
pkl_nam instance-attribute
pkl_nam = output_folder / f'_read_{lower()}_qc.pkl'
pkl_nam_raw instance-attribute
pkl_nam_raw = output_folder / f'_read_{lower()}_raw.pkl'
qc instance-attribute
qc = qc
qc_freq instance-attribute
qc_freq = qc if isinstance(qc, str) else None
quiet instance-attribute
quiet = get('quiet', False)
raw_freq instance-attribute
raw_freq = get('raw_freq', None)
report_out instance-attribute
report_out = output_folder / 'report.json'
reset instance-attribute
reset = reset is True
save_intermediate_csv instance-attribute
save_intermediate_csv = get('save_intermediate_csv', True)
save_pkl instance-attribute
save_pkl = get('save_pkl', True)
save_report instance-attribute
save_report = get('save_report', True)

Functions

QC_control staticmethod
QC_control()
_QC abstractmethod
_QC(df: DataFrame) -> DataFrame

Abstract method for quality control processing.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing raw data

required

Returns:

Type Description
DataFrame

Quality controlled data with QC_Flag column

Notes

Must be implemented by child classes to handle instrument-specific QC. This method should only check raw data quality (status, range, completeness). Derived parameter validation should be done in _process().

__call__
__call__(start: datetime = None, end: datetime = None, mean_freq: str = None) -> DataFrame

Process data for a specified time range.

Parameters:

Name Type Description Default
start datetime

Start time for data processing; defaults to the data's first timestamp

None
end datetime

End time for data processing; defaults to the data's last timestamp

None
mean_freq str

Frequency for resampling the output; if None, no resampling is done and the data is returned at its native resolution

None

Returns:

Type Description
DataFrame

Processed and resampled data for the specified time range

Notes

The processed data is also saved to a CSV file.

_cache_is_current
_cache_is_current(df: DataFrame) -> bool

True if the cached frame was written by the current cache format.

_flag_outlier_dates
_flag_outlier_dates(df: DataFrame) -> DataFrame

Detect and warn about stray timestamps; drop them only if asked.

A single bad row — e.g. a 2000-01-01 stamp in otherwise-2023 data — stretches the canonical native grid (built over the data's own min->max in _read_raw_files, before any requested range applies) across the whole bogus span, inflating the cached frame to millions of NaN rows even when the caller only asked for 2023. Such stamps are almost always a source-data error, so by default we warn and tell the user how to fix it rather than silently changing their data; pass drop_outlier_dates=True to have them excluded automatically.

_generate_report
_generate_report(raw_data, qc_data, qc_flag=None) -> None

Calculate and log data quality rates for different time periods.

Parameters:

Name Type Description Default
raw_data DataFrame

Raw data before quality control

required
qc_data DataFrame

Data after quality control

required
qc_flag Series

QC flag series indicating validity of each row

None
Notes

Calculates rates for specified QC frequency if set. Updates the quality report with calculated rates.

_load_or_parse
_load_or_parse() -> tuple[DataFrame, DataFrame]

Return the canonical parsed (raw, qc) frames, using the pkl cache when it exists and is current.

Canonical = snapped to the native grid over the files' own coverage, NOT padded to any requested range. Parse provenance (n_files, raw_freq, freq_mixed) is persisted in df.attrs so a cache hit restores it onto self. The requested range / fill_missing is applied later, in _run.

_outlier_process
_outlier_process(_df)

Process outliers in the data.

Parameters:

Name Type Description Default
_df DataFrame

Input DataFrame containing potential outliers

required

Returns:

Type Description
DataFrame

DataFrame with outliers processed

Notes

Implementation depends on specific instrument requirements.

_partition_compatible_scans
_partition_compatible_scans(df_list: list, files: list) -> list

Drop frames whose scan schema differs from the dominant group.

Default is a no-op — overridden by readers (currently SMPS) where the same instrument can export at different size-bin grids depending on the host software version (AIM 10.3 .TXT vs AIM 11.x .CSV). The outer join inside pd.concat happily concatenates frames with disjoint columns, but the NaN holes break per-bin completeness QC. The well-defined repair is to treat each grid as its own scan: keep the majority group, drop the minority and tell the user which files were skipped so they can re-run them in isolation if they want both.

df_list and files are aligned and contain only successfully parsed entries.

_process
_process(df: DataFrame) -> DataFrame

Process data to calculate derived parameters.

This method is called after _QC() to calculate instrument-specific derived parameters (e.g., absorption coefficients, AAE, SAE).

Parameters:

Name Type Description Default
df DataFrame

Quality-controlled DataFrame with QC_Flag column

required

Returns:

Type Description
DataFrame

DataFrame with derived parameters added and QC_Flag updated

Notes

Default implementation returns the input unchanged. Override in child classes to implement instrument-specific processing.

The method should: 1. Skip calculation for rows where QC_Flag != 'Valid' (optional optimization) 2. Calculate derived parameters 3. Validate derived parameters and update QC_Flag if invalid

_raw_reader abstractmethod
_raw_reader(file)

Abstract method to read raw data files.

Parameters:

Name Type Description Default
file Path or str

Path to the raw data file

required

Returns:

Type Description
DataFrame

Raw data read from the file

Notes

Must be implemented by child classes to handle specific file formats.

_read_raw_files
_read_raw_files() -> tuple[DataFrame | None, DataFrame | None]

Read and process raw data files.

Returns:

Type Description
tuple[DataFrame | None, DataFrame | None]

Tuple containing: - Raw data DataFrame or None - Quality controlled DataFrame or None

Notes

Handles file reading and initial processing.

_restore_parse_meta
_restore_parse_meta(df: DataFrame) -> None

Pull parse provenance off a cached frame back onto self (cache hit).

_run
_run(user_start, user_end)

Main execution method for data processing.

Parameters:

Name Type Description Default
user_start datetime

Start time for processing

required
user_end datetime

End time for processing

required

Returns:

Type Description
tuple[DataFrame, DataFrame]

Raw and quality-controlled frames for the requested range.

Notes

Two layers. _load_or_parse returns the canonical parsed frames (from the pkl cache when valid, else by reading the raw files). The presentation step below — grid placement to the requested range, with fill_missing — runs on every call, so a cache hit honours the current call's range/fill_missing instead of replaying whatever was stored. Parse provenance restored by _load_or_parse feeds the df.attrs stamp in __call__.

_save_data
_save_data(raw_data: DataFrame, qc_data: DataFrame) -> None

Save processed data to files.

Parameters:

Name Type Description Default
raw_data DataFrame

Raw data to save

required
qc_data DataFrame

Quality controlled data to save

required
Notes

Saves data in both pickle and CSV formats.

_stamp
_stamp(df: DataFrame, start, end, *, mean_freq=None, with_qc=False) -> DataFrame

Attach reader metadata to df.attrs just before returning.

Always records provenance (instrument, station, coverage, requested range, native frequency). When with_qc is True it additionally records the output frequency and the overall QC rates; the plain raw path (qc=False) gets provenance only.

See core.metadata for why this is the single, final stamping point.

_stamp_parse_meta
_stamp_parse_meta(df: DataFrame) -> None

Persist parse provenance into df.attrs so it survives the pkl cache.

_timeIndex_process
_timeIndex_process(_df, user_start=None, user_end=None, append_df=None)

Process time index of the DataFrame.

Parameters:

Name Type Description Default
_df DataFrame

Input DataFrame to process

required
user_start datetime

User-specified start time

None
user_end datetime

User-specified end time

None
append_df DataFrame

DataFrame to append to

None

Returns:

Type Description
DataFrame

DataFrame with processed time index

Notes

Frequency is resolved once per run in _read_raw_files (per-file detection, see self._resolved_freq); this method only places the data on that grid via to_grid — snapping off-grid timestamps to their nearest bin without the duplicate-fill of method='nearest'.

progress_reading
progress_reading(files: list) -> Generator

Context manager for tracking file reading progress.

Parameters:

Name Type Description Default
files list

List of files to process

required

Yields:

Type Description
Progress

Progress bar object for tracking

Notes

Uses rich library for progress display.

reorder_dataframe_columns staticmethod
reorder_dataframe_columns(df, order_lists: list[list], keep_others: bool = False)

Reorder DataFrame columns according to specified lists.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame

required
order_lists list[list]

Lists specifying column order

required
keep_others bool

If True, keeps unspecified columns at the end

False

Returns:

Type Description
DataFrame

DataFrame with reordered columns

update_qc_flag staticmethod
update_qc_flag(df: DataFrame, mask: Series, flag_name: str) -> DataFrame

Update QC_Flag column for rows matching the mask.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with QC_Flag column

required
mask Series

Boolean mask indicating rows to flag

required
flag_name str

Name of the flag to add

required

Returns:

Type Description
DataFrame

DataFrame with updated QC_Flag column

Quick Example

from AeroViz import RawDataReader
from datetime import datetime

# Using the factory function (recommended)
data = RawDataReader(
    instrument='AE33',
    path='/path/to/data',
    start=datetime(2024, 1, 1),
    end=datetime(2024, 12, 31)
)

# Direct usage (advanced - for custom implementations)
from AeroViz.rawDataReader.core import AbstractReader


class MyInstrumentReader(AbstractReader):
    nam = 'MyInstrument'

    def _raw_reader(self, file):
        # Custom file reading logic
        pass

    def _QC(self, df):
        # Custom QC logic
        return df