USAAF-WWII-Datasets – Structured Historical Data on the United States Army Air Forces -WIP-¶
USAAF-WWII-Datasets is a curated collection of structured data derived from historical records concerning the United States Army Air Forces (USAAF) during the Second World War.
This project is a work in progress. Not all data has been fully validated, and some entries may contain errors.
Project Scope¶
The objective of this dataset is to make complex and unstructured archival information accessible in a machine-readable format for use in:
- Historical research
- Data analysis
- Timeline reconstruction
- Geospatial visualization of WWII events and movements
The current release includes two major data components:
- USAAF Chronology of World War II: A structured, date-based record of significant events involving USAAF units and operations.
- USAAF Squadrons History: A detailed dataset of squadron histories including activations, deployments, reassignments, and associated elements.
Source Materials¶
The data is derived from the following authoritative publications:
- USAAF Chronology of World War II (official USAF historical record)
- Combat Squadrons of the Air Force, World War II (Office of Air Force History)
Data Structure¶
USAAF Chronology (CSV)¶
Each row corresponds to a single event, including the following fields:
index
: Unique entry IDevent#
: Sequential event number from the sourcedate
: Date of the eventtheatre
: The theater of operations, as identified in the sourceold_text
: Original OCR text, minimally processedtext
: Cleaned and reformatted text following AI and procedural processinglocations
: JSON-compatible string containing geographic dataname_location
: Parsed name of the primary locationlatitude
/longitude
: Coordinates, when availablesource
: Method used to determine coordinates (e.g.,auto-GNS
,auto-Wiki
,manual
)
USAAF Squadrons History (XML)¶
Each XML entry documents the full profile of a squadron, including:
name
: Official unit namelineage
: Historical lineage and redesignationsassignments
: Command structure and higher-level assignmentsstations
: All recorded duty stationsaircraft
: List of aircraft flownaircraft_types
: Structured list of airframe typesoperations
: Descriptions of major operational rolesservice_streamers
,campaigns
,decorations
,emblem
: Honors and insignia
Methodology¶
The datasets were created using a multi-stage pipeline:
- OCR Extraction from official PDF and scanned source documents
- Procedural Cleaning using regular expressions and parsing logic
- AI Processing for text normalization and feature extraction
- Location Identification with AI-supported entity recognition
- Geocoding through:
- Geonames (GNS)
- Wikipedia (automated matching)
- Manual validation (when required)
- Ongoing Review to identify inconsistencies and improve accuracy
Tools Used¶
- Python (Pandas, OpenPyXL, JSON)
- Excel (manual data auditing). However, I am writing an app with Python to facilitate the validation and correction process. for this reason I am writing CanvaMap
- ArcGIS / QGIS (geospatial review and mapping)
License¶
This dataset is released for research, educational, and non-commercial use.
Please cite original sources when referencing derived content.
Contact¶
For questions, feedback, or collaboration inquiries:
Andrea Siotto
LinkedIn Profile