Dataerai Provides Input to DOE on “Partnerships for Transformational AI Models”
Dataerai supports DOE’s plan to establish a public-private consortium to automate AI-ready data curation at its source to accelerate the DOE’s Genesis Mission. To achieve this, Dataerai recommends the following: automating AI-ready data curation at the source, enforcing federal sharing mandates through performance metrics and budget evaluations, enabling multi-modal curation at scale, and providing secure platforms for cross-institutional collaboration. This strategy ensures that siloed research data is transformed into a foundation for national AI leadership.
Dataerai is pleased to submit our response to the U.S. Department of Energy (DOE) and Advanced Scientific Computing Research (ASCR) RFI DE-ASCR-26-0001. As an organization born out of the Oak Ridge National Laboratory (ORNL) ecosystem, with academic community co-development, Dataerai understands that the Genesis Mission requires a paradigm shift to enable collaborative discoveries that leverage the best of human and artificial intelligence.
About Dataerai
Dataerai (pronounced data-array) is a technology company dedicated to “Building Collective Intelligence” from research data. We develop platforms that help research institutions, agencies, and companies curate and share scientific data securely, efficiently, and in formats ready for AI analysis. Our goal is to provide the tools needed to deliver AI-powered translational science.
Dataerai was founded to address data management and storage gaps in scientific research. The company traces its origins to DOE’s Oak Ridge National Laboratory. In 2018, facing significant data management challenges in his research group, Dataerai co-founder and CTO Dr. Joshua Agar partnered with ORNL researchers to pilot for the first time in a university setting DataFed, a federated scientific data management system. Seeing the need for improved data management systems at enterprise-level scale, Dataerai was later founded by Dr. Agar and entrepreneur Tim Keuhlhorn to provide a platform for scientific researchers to manage and store their data, transforming this critical resource into AI-ready, high-value assets for science and innovation.
Today, Dataerai’s cloud-based platform provides a one-stop data management solution for researchers by reducing the burden of data curation, collation, preservation, and publication, while ensuring FAIR compliance. This turns existing data into AI assets. At the same time, agencies can use the platform to track the productivity of their awardees, projects, and downstream data use to measure the full impact of data assets. The platform simplifies data curation and turns what is currently a burden into a badge of achievement, while simultaneously making enforcement efficient and facilitating the production of AI-ready data records from the start.
Summary
Dataerai commends DOE for its leadership in launching the Genesis Mission. By establishing a public-private consortium to curate scientific data across the National Laboratory complex, DOE is taking the necessary steps to ensure that the United States maintains its global dominance in AI as outlined in America’s AI Action Plan.
The fundamental bottleneck to transformational AI for Science is not a lack of compute or talent, but the lack of AI-ready, structured data. Dataerai proposes a strategy centered on automating data curation at the point of origin, leveraging federated architectures to bridge the gap between “locked” lab data and the broader scientific community.
Detailed Recommendations
Establish a public-private AI for Science consortium that includes data management and storage companies as key contributors. A successful AI for Science consortium must include more than just AI model developers – it requires all aspects of the supply chain, including data, hardware, and software to be represented. Solving data curation, management, and storage challenges is critical to unlocking AI for Science models and systems. As such, companies providing data management and storage solutions must be an integral part of any such consortium.
Automate AI-ready data curation at the source, from shared scientific facilities and instruments at the National Laboratories. Science is a data-driven discipline. Typically, scientific researchers manually conduct data analysis and management on local file systems, limiting the scale and impact of such resources. As recognized in America’s AI Action Plan, high-quality, AI-ready scientific datasets can enable automated and agentic measurement systems to provide crucial insight into complex phenomena. However, most scientists lack the resources and knowledge to preserve, curate, search, and compute with their data. Researchers need an optimized solution that provides a simple gateway to AI-ready data. By automating data curation at its source, DOE can ensure that data is AI-ready the moment it is generated.
Dataerai is working with scientific instrument manufacturers to create turnkey solutions that allow automatic data collection. This includes automated extraction of metadata and data transfer to a central data repository, ensuring data is AI-ready the moment it is generated. This platform can be adapted to serve the needs of the National Laboratories.
Enforce existing requirements for federally-funded researchers to disclose non-proprietary scientific datasets by requiring 1) metrics for data sharing and reuse and 2) evaluations of budget justifications for data preservation and sharing. To realize the full potential of the Genesis Mission and maximize the ROI for taxpayers, DOE must move beyond “suggested” data-sharing practices and actively enforce existing mandates for federally-funded research. This will maximize the potential for federal research investments to unlock AI-enabled scientific discovery.
As required by the OPEN Government Data Act of 2019 (Title II of the Foundations for Evidence-Based Policymaking Act of 2018, Public Law 115-435), signed into law by President Trump, scientists who receive federal research funding are required to make public non-proprietary scientific data. Despite this requirement, enforcement of this policy has been weak across U.S. federal agencies. A significant portion of high-value scientific data remains siloed behind institutional firewalls, and institutions feign alternative policies to justify their minimal compliance. Strengthening the enforcement of these disclosures ensures that taxpayer-funded research serves as a foundation for national AI dominance rather than an untapped resource.
DOE should require researchers to disclose non-proprietary scientific datasets created as a result of effort on their funded projects. By linking data disclosure directly to research output, DOE can build a self-sustaining ecosystem of AI-ready, FAIR data repositories that scale alongside the models they power. To enforce these policies, DOE should require metrics for data sharing and reuse as part of annual project reports, proposals, and current support, as well as require reviewers to evaluate budget justifications to ensure costs for data preservation and sharing are reasonable.
Dataerai’s platform provides a complete solution for scientists to comply with these federal requirements. With its automated systems for data curation, metadata extraction, data transfer and storage, and secure administrative controls, our solution is purpose-built to support DOE’s enforcement of this critical policy.
Enable multi-modal scientific data curation at scale. By its nature, scientific data is heterogeneous and multi-modal, ranging from a few megabytes of local experimental results to petabytes of data from national user facilities. A scientific data collection and management system for the National Laboratories must provide the flexibility to handle these variants seamlessly. Dataerai’s platform relies on technology that enables performant data transfer of numerous small and massive files, making it a scalable solution for almost all scientific workflows.
Provide a platform for data-sharing across institutions with low barriers to access, while providing options for secure access to sensitive Federal data. To accelerate the Genesis Mission, DOE must implement a data-sharing platform that lowers the barrier to entry for the scientific community while maintaining rigorous controls over sensitive Federal assets.
For non-proprietary datasets, Dataerai provides an access gateway that adheres to FAIR principles, allowing academic and private-sector researchers to quickly ingest AI-ready data with seamless sharing across the National Laboratory complex, universities, and industry partners, breaking down the silos that currently hinder U.S. AI dominance. For sensitive or proprietary Federal data, Dataerai provides controlled access to data to specified users or institutions. This ensures that transformational AI models can be trained on high-value data within the DOE's security perimeter.
We thank you for the opportunity to provide our input on this important undertaking. We look forward to working with DOE to ensure the Genesis Mission is a success.