NAIRR Pilot Resources
This page aggregates government and government-supported resources aligned with the NAIRR Pilot goals, such as pre-trained models, AI ready datasets, and relevant platforms. This list does not include allocatable computational resources; please see the call for allocations for those resources.
The RAI Toolkit is a collection of guidance, processes, and resources to aid in RAI development. This includes the interactive SHIELD assessment tool, as well as a large tools list. This is supported by the Department of Defense Chief Digital and Artificial Intelligence Office (CDAO).
DOE’s Argonne Leadership Computing Facility (ALCF) will offer video lectures on an introduction to AI on High Performance Computing for undergraduate and graduate students at U.S. universities and community colleges. The Spring 2024 training will focus on large-language models (LLMs) and their scientific applications. Basic experience with Python is needed but supercomputing or AI knowledge is not required.
This repository contains several NASA Earth Science AI training datasets, including synthetic aperture radar (SAR) raster imagery for various flood events in the United States and Bangladesh, tropical storm wind speed data from the Atlantic and East Pacific Oceans from 2000 to 2019, Marine debris detection on ocean surface, and field boundary segmentation for Rwanda.
HLS FM is a geospatial foundation model created in partnership with IBM using NASA’s Harmonized Landsat Sentinel-2 (HLS) dataset available on Hugging Face. Source code is available on GitHub. See also the publication, 2023 Summer school training material on using the HL FM, and a Youtube video on HLS FM.
AI training dataset for satellite image-based burn scar detection.
AI training dataset for satellite image-based multi-temporal crop classification.
A set of principles and best practices for applying AI within scientific research and applications, built in partnership with the American Geophysical Union (AGU).
Created in collaboration with IBM, this specialized language model trained on scientific corpus collected from relevant publications (NASA ADS, AGU, AMS, PubMed). The initial version and a sentence-transformer model built upon this domain-adapted encoder are accessible on Hugging Face.
This repository of AI training datasets resides on AWS and includes datasets from all science divisions
ImmPort from NIAID is a publicly accessible data sharing platform supporting immunology research and clinical studies. ImmPort offers curated datasets and reference datasets that adhere to the FAIR Principles and is certified by CoreTrustSeal.
MIDRC is a curated, open medical imaging data repository and commons to enable AI algorithm development. MIDRC is also available via the NCATS N3C secure data enclave (see N3C). Open MIDRC data with de-identified medical imaging data available directly via the link.
N3C is the largest open access longitudinal de-identified repository of row-level data on COVID-19 patients and matched controls in the USA. At the present time N3C contains 21 million patients and 30 billion rows from all 50 states. In addition, N3C clinical data is multi-modal data with linkage to claims, mortality, viral variant, SDOH and registry data. N3C is associated with the NAIRR Secure Effort.
The NIH/NIMHD Science Collaborative for Health disparities and Artificial intelligence bias REduction (ScHARe) is a cloud-based platform for population science including social determinants of health (SDOH), and data sets designed to accelerate research in health disparities, health and healthcare delivery outcomes, and artificial intelligence (AI) bias mitigation strategies.
The NOAA Open Data Dissemination program includes hundreds of NOAA datasets for earth systems and AI/ML applications; updated on a quarterly basis. The datasets are organized by the NOAA Line Office and programmatic area that generated the original dataset.
TC PRIMED is a dataset that is suitable for multiple AI/ML applications centered around passive microwave observations of global tropical cyclones from low-Earth-orbiting satellites. Jupyter Notebooks are available for AI developers at: github.com/noaa-ncai/learning-journey/tree/main/tcprimed
The Census of Agriculture is a complete survey of U.S. farms and ranches conducted every five years by USDA’s National Agricultural Statistics Service. It covers land use and ownership, operator characteristics, production practices, income and expenditures, and more. Quick Stats allows searchable access to the 6 million data points in the 2022 census (released February 2024) and other NASS data products. Downloadable compressed files are also available.
These data consist of down-looking images of Lake Michigan benthos, collected in 2020 and 2021 with an autonomous underwater vehicle (AUV). Information about each image (i.e., latitude, longitude, depth from surface, altitude, roll, pitch, yaw, and creation time) can be found in the associated csv file. Substrate type was divided into 9 classes based on the Coastal and Marine Ecological Classification Standard (CMECS) and each image was assigned a substrate class by at least 3 trained labelers.
These data were compiled for the use of training natural feature machine learning (GeoAI) detection and delineation. The natural feature classes include the Geographic Names Information System (GNIS) feature types Basins, Bays, Bends, Craters, Gaps, Guts, Islands, Lakes, Ridges and Valleys, and are an aerial representation of those GNIS point features. Features were produced using heads-up digitizing from 2018 to 2019 by Dr. Sam Arundel's team at the U.S. Geological Survey, Center of Excellence for Geospatial Information Science, Rolla, Missouri, USA, and Dr. Wenwen Li's team in the School of Geographical Sciences at Arizona State University, Tempe, Arizona, USA.
These image/label pairs were created to add to larger benchmark text datasets to enable the detection and recognition of spot elevation on historical topographic maps in the U.S. (HTMC). The pairs were created by manually heads-up digitizing bounding boxes around the 350 spot elevations in whole and one around each of the individual characters in the spot elevation. These rectangles were used to request images from the HTMC service at https://services.arcgisonline.com/arcgis/rest/services/USA_Topo_Maps/MapServer, and automatically create the accompanying label.
The National Geospatial Program (NGP) is contributing essential data holdings through The National Map (TNM). TNM offers a comprehensive range of topographic information vital for various applications. These resources encompass topographic maps and geographic information system (GIS) data covering a wide array of geographic features and attributes crucial for research, education, and decision-making processes.
The data include detailed information on elevation, hydrography, watersheds, geographic names, orthoimagery, governmental units/boundaries, transportation networks, and land cover. This wealth of data serves as a foundational resource for understanding the landscape and environment of the nation, facilitating analysis, visualization, and modeling across diverse disciplines.
Extracting useful and accurate information from scanned geologic and other earth science maps is a time-consuming and laborious process involving manual human effort. To address this limitation, the USGS partnered with the Defense Advanced Research Projects Agency (DARPA) to run the AI for Critical Mineral Assessment Competition, soliciting innovative solutions for automatically georeferencing and extracting features from maps. The competition opened for registration in August 2022 and concluded in December 2022. Training and validation data from the competition are provided here, as well as competition details and baseline solutions. The data are derived from published sources and are provided to the public to support continued development of automated georeferencing and feature extraction tools. References for all maps are included with the data.
The USGS collaborates with local, state, and federal partners to gather and incorporate water-use data with other datasets covering climate, population, geography, system characteristics, land use, social factors, and economics. This information is then integrated into a modeling framework to generate national estimates of water withdrawal and consumption (evapotranspiration of withdrawn water) from both groundwater and surface-water sources. These estimates are crucial for understanding how water is used and for assessing the balance between water supply and demand. Withdrawals for public supply water use, and withdrawals and consumptive use for irrigation water use are estimated for each month of the period 2000-2020 for all watersheds at the 12-digit hydrologic unit code level (HUC-12) in the conterminous United States. The withdrawal and consumptive use estimates for thermoelectric power water use are available for each month of the period from 2008-2020, by power plant. The models provide estimates at finer temporal and spatial resolution than previous annual, county-level estimates published by the USGS.
See https://www.usgs.gov/mission-areas/water-resources/science/water-use-united-states#overview for details.
USPTO patents and trademarks offer a rich and diverse dataset of complex language, covering a wide range of technical topics in a structured format. This page provides a single repository for obtaining raw-bulk download patent and trademarks publications and grants for use in the development of robust language models.
This page hosts a collection of patent and trademark datasets related to intellectual property, entrepreneurship and innovation topics. The datasets and corresponding supplementary documentation is specifically formatted to aid academic researchers.