Skip to content

USAAF-WWII-Datasets – Structured Historical Data on the United States Army Air Forces -WIP-

GitHub Repository

USAAF-WWII-Datasets is a curated collection of structured data derived from historical records concerning the United States Army Air Forces (USAAF) during the Second World War.

This project is a work in progress. Not all data has been fully validated, and some entries may contain errors.


Project Scope

The objective of this dataset is to make complex and unstructured archival information accessible in a machine-readable format for use in:

  • Historical research
  • Data analysis
  • Timeline reconstruction
  • Geospatial visualization of WWII events and movements

The current release includes two major data components:

  • USAAF Chronology of World War II: A structured, date-based record of significant events involving USAAF units and operations.
  • USAAF Squadrons History: A detailed dataset of squadron histories including activations, deployments, reassignments, and associated elements.

Source Materials

The data is derived from the following authoritative publications:

  • USAAF Chronology of World War II (official USAF historical record)
  • Combat Squadrons of the Air Force, World War II (Office of Air Force History)

Data Structure

USAAF Chronology (CSV)

Each row corresponds to a single event, including the following fields:

  • index: Unique entry ID
  • event#: Sequential event number from the source
  • date: Date of the event
  • theatre: The theater of operations, as identified in the source
  • old_text: Original OCR text, minimally processed
  • text: Cleaned and reformatted text following AI and procedural processing
  • locations: JSON-compatible string containing geographic data
  • name_location: Parsed name of the primary location
  • latitude / longitude: Coordinates, when available
  • source: Method used to determine coordinates (e.g., auto-GNS, auto-Wiki, manual)

USAAF Squadrons History (XML)

Each XML entry documents the full profile of a squadron, including:

  • name: Official unit name
  • lineage: Historical lineage and redesignations
  • assignments: Command structure and higher-level assignments
  • stations: All recorded duty stations
  • aircraft: List of aircraft flown
  • aircraft_types: Structured list of airframe types
  • operations: Descriptions of major operational roles
  • service_streamers, campaigns, decorations, emblem: Honors and insignia

Methodology

The datasets were created using a multi-stage pipeline:

  1. OCR Extraction from official PDF and scanned source documents
  2. Procedural Cleaning using regular expressions and parsing logic
  3. AI Processing for text normalization and feature extraction
  4. Location Identification with AI-supported entity recognition
  5. Geocoding through:
  6. Geonames (GNS)
  7. Wikipedia (automated matching)
  8. Manual validation (when required)
  9. Ongoing Review to identify inconsistencies and improve accuracy

Tools Used

  • Python (Pandas, OpenPyXL, JSON)
  • Excel (manual data auditing). However, I am writing an app with Python to facilitate the validation and correction process. for this reason I am writing CanvaMap
  • ArcGIS / QGIS (geospatial review and mapping)

License

This dataset is released for research, educational, and non-commercial use.
Please cite original sources when referencing derived content.


Contact

For questions, feedback, or collaboration inquiries:

Andrea Siotto
LinkedIn Profile