Skip to content

RawDataReader Tutorial

RawDataReader is the core data reading component of AeroViz, providing a unified interface for reading various aerosol instrument data.

Behaviour change

mean_freq no longer defaults to '1h'. The default is now no resampling — data is returned at its native resolution. Pass mean_freq='1h' (or '30min', '1D') explicitly to average. start and end are also optional now (previously both were required); omit them to read the files' full coverage.

Basic Usage

from pathlib import Path
from AeroViz import RawDataReader

# Minimal: read the files' full coverage at native resolution
data = RawDataReader(
    instrument='AE33',           # Instrument type
    path=Path('/path/to/data'),  # Data path
)
from datetime import datetime

# Bounded range with hourly averaging
data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/data'),
    start=datetime(2024, 1, 1),  # optional start time
    end=datetime(2024, 12, 31),  # optional end time
    mean_freq='1h'               # optional — resample to hourly means
)

Parameter Description

Parameter Type Description Default
instrument str Instrument name Required
path Path Data folder path Required
start datetime Start time (optional) None (files' first record)
end datetime End time (optional) None (files' last record)
mean_freq str Averaging frequency ('1h', '30min', '1D'); omit for native resolution None (no resampling)
fill_missing bool True pads output to the requested range; False clamps to data coverage True
reset bool Force re-read (ignore cache) False
qc bool/str Quality control on/off, or QC report frequency ('1MS', '1D') True

Optional date range

start and end are independent. Omit both to read everything the files contain, or pass just one side to bound only that end:

# Full coverage of the files in the folder
data = RawDataReader('AE33', path)

# Everything from a start date onwards (no upper bound)
data = RawDataReader('AE33', path, start=datetime(2024, 6, 1))

# Everything up to an end date (no lower bound)
data = RawDataReader('AE33', path, end=datetime(2024, 6, 30))

Native resolution (mean_freq)

By default no resampling is applied and the data keeps the instrument's native time resolution (e.g. 1 min, 5 min, 1 h). Pass mean_freq only when you want averaged output:

# Native resolution (default) — no resampling
data = RawDataReader('AE33', path, start, end)

# Hourly means
hourly = RawDataReader('AE33', path, start, end, mean_freq='1h')

# 30-minute means
half_hourly = RawDataReader('AE33', path, start, end, mean_freq='30min')

Reading result metadata (df.attrs)

Every result carries provenance and coverage metadata in df.attrs. Because the default fill_missing=True pads the frame to the requested range (so it may be mostly NaN), df.attrs['coverage_*'] is the quickest way to see what the files actually contained:

df = RawDataReader('AE33', path, start='2024-01-01', end='2024-12-31')

df.attrs['coverage_start']   # first row backed by real data
df.attrs['coverage_end']     # last row backed by real data (None if none in range)
df.attrs['requested_start']  # what you asked for (omitted when not given)
df.attrs['n_files']          # how many raw files were read
df.attrs['raw_freq']         # native resolution, auto-detected per file
df.attrs['total_rate']       # overall % valid (only present when qc is on)
Key When Meaning
instrument, station, source_path, n_files always provenance
coverage_start / coverage_end always real data span (ignores NaN padding)
requested_start / requested_end always the range you passed (omitted when not given)
raw_freq, freq_mixed always native frequency + whether files disagreed
fill_missing always grid padded to the request, or clamped to coverage
aeroviz_version, processed_at always build / run stamp
mean_freq, qc_applied, qc_freq qc on output frequency + QC mode
acquisition_rate, yield_rate, total_rate qc on overall rates (%)

attrs survive to_pickle/read_pickle and resample (pandas >= 2) but are dropped by a concat of frames with conflicting attrs — re-stamp if you merge.

fill_missing: pad vs. clamp the time grid

# Default (True): pad the output to the full requested range,
# leaving NaN where the files have no data
padded = RawDataReader('AE33', path, start='2024-01-01', end='2024-12-31')

# False: clamp the grid to the data's actual coverage —
# no leading/trailing NaN rows, no mostly-empty frame from a short file
trimmed = RawDataReader(
    'AE33', path,
    start='2024-01-01', end='2024-12-31',
    fill_missing=False,
)

Supported Instruments

Black Carbon / Absorption

# AE33 - Magee Scientific 7-wavelength
ae33 = RawDataReader('AE33', path, start, end)

# AE43 - Real-time black carbon
ae43 = RawDataReader('AE43', path, start, end)

# BC1054 - MetOne high resolution
bc1054 = RawDataReader('BC1054', path, start, end)

# MA350 - AethLabs multi-angle
ma350 = RawDataReader('MA350', path, start, end)

Scattering

# NEPH - TSI integrating nephelometer
neph = RawDataReader('NEPH', path, start, end)

# Aurora - Ecotech 3-wavelength
aurora = RawDataReader('Aurora', path, start, end)

Size Distribution

# SMPS - Scanning Mobility Particle Sizer
smps = RawDataReader('SMPS', path, start, end, size_range=(11.8, 593.5))

# APS - Aerodynamic Particle Sizer
aps = RawDataReader('APS', path, start, end)

# GRIMM - Optical Particle Sizer
grimm = RawDataReader('GRIMM', path, start, end)

Chemical Composition

# IGAC - Ion Chromatograph
igac = RawDataReader('IGAC', path, start, end)

# OCEC - Organic Carbon/Elemental Carbon Analyzer
ocec = RawDataReader('OCEC', path, start, end)

# Xact - Xact 625i XRF Analyzer
xact = RawDataReader('Xact', path, start, end)

# VOC - Volatile Organic Compounds Monitor
# (deprecated — read VOC CSVs with pandas and pass to voc_potentials; see VOC docs)
voc = RawDataReader('VOC', path, start, end)

Quality Control

QC Report

# Monthly QC report
data = RawDataReader(
    instrument='AE33',
    path=path,
    start=start,
    end=end,
    qc='1MS'  # Monthly report
)

Output example:

> Processing: 2024-01-01 to 2024-01-31
    > BC Mass Conc. (880 nm)
        +-- Sample Rate    :  100.0%
        +-- Valid  Rate    :   99.5%
        +-- Total  Rate    :   99.5%

Force Re-read

# Ignore cache, re-read raw files
data = RawDataReader(
    instrument='AE33',
    path=path,
    reset=True
)

Output Files

After processing, files are generated in {instrument}_outputs/:

File Description
_read_{inst}_raw.csv Merged raw data
_read_{inst}_raw.pkl Raw data (pickle)
_read_{inst}.csv QC processed data
_read_{inst}.pkl QC data (pickle)
Output_{inst} Final processed data
{inst}.log Processing log

Advanced Usage

Specify Size Range (SMPS/APS)

smps = RawDataReader(
    instrument='SMPS',
    path=path,
    start=start,
    end=end,
    size_range=(10, 500)  # nm
)

Multi-instrument Integration

# Read multiple instruments
ae33 = RawDataReader('AE33', path_ae33, start, end)
neph = RawDataReader('NEPH', path_neph, start, end)
smps = RawDataReader('SMPS', path_smps, start, end)

# Merge using pandas
import pandas as pd
combined = pd.concat([ae33, neph, smps], axis=1)

Common Issues

Data Path Format

# Correct
path = Path('/Users/name/data/AE33')

# Also works
path = Path('./data/AE33')

Time Format

from datetime import datetime

# Correct
start = datetime(2024, 1, 1)
end = datetime(2024, 12, 31)

# Can also specify hours, minutes, seconds
start = datetime(2024, 1, 1, 0, 0, 0)
end = datetime(2024, 12, 31, 23, 59, 59)

Insufficient Memory

For large datasets, read in segments:

# Read by month
for month in range(1, 13):
    start = datetime(2024, month, 1)
    end = datetime(2024, month + 1, 1) if month < 12 else datetime(2025, 1, 1)
    data = RawDataReader('AE33', path, start, end)
    # Process...