{"id":862,"date":"2026-05-07T16:37:32","date_gmt":"2026-05-07T08:37:32","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=862"},"modified":"2026-05-07T16:37:32","modified_gmt":"2026-05-07T08:37:32","slug":"meta-ai-releases-neuralbench-a-unified-open-source-framework-to-benchmark-neuroai-models-across-36-eeg-tasks-and-94-datasets","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=862","title":{"rendered":"Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets"},"content":{"rendered":"<p>Evaluating AI models trained on brain signals has long been a messy, inconsistent topic. Different research groups use different preprocessing pipelines, train models on different datasets, and report results on a narrow set of tasks \u2014 making it nearly impossible to know which model actually works best, or for what. A new framework from Meta AI team is designed to fix that.<\/p>\n<p>Meta Researchers have released <strong>NeuralBench<\/strong>, a unified, open-source framework for benchmarking AI models of brain activity. Its first release, <strong>NeuralBench-EEG v1.0<\/strong>, is the largest open benchmark of its kind: 36 downstream tasks, 94 datasets, 9,478 subjects, 13,603 hours of electroencephalography (EEG) data, and 14 deep learning architectures evaluated under a single standardized interface.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1820\" height=\"612\" data-attachment-id=\"79620\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/07\/meta-ai-releases-neuralbench-a-unified-open-source-framework-to-benchmark-neuroai-models-across-36-eeg-tasks-and-94-datasets\/screenshot-2026-05-07-at-1-34-54-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.34.54-AM-1.png\" data-orig-size=\"1820,612\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-07 at 1.34.54\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.34.54-AM-1-1024x344.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.34.54-AM-1.png\" alt=\"\" class=\"wp-image-79620\" \/><figcaption class=\"wp-element-caption\">https:\/\/ai.meta.com\/research\/publications\/neuralbench-a-unifying-framework-to-benchmark-neuroai-models\/<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Problem NeuralBench Solves<\/strong><\/h3>\n<p>The broader field of NeuroAI where deep learning meets neuroscience has exploded in recent years. Self-supervised learning techniques originally developed for language, speech and images are now being adapted to build <em>brain foundation models<\/em>: large models pretrained on unlabeled brain recordings and fine-tuned for downstream tasks ranging from clinical seizure detection to decoding what a person is seeing or hearing.<\/p>\n<p>But the evaluation landscape has been badly fragmented. Existing benchmarks like MOABB cover up to 148 brain-computer interfacing (BCI) datasets but limit evaluation to just 5 downstream tasks. Other efforts \u2014 EEG-Bench, EEG-FM-Bench, AdaBrain-Bench \u2014 are each constrained in their own ways. For modalities like magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI), there is no systematic benchmark at all.<\/p>\n<p>The result \u2014 claims about foundation models being \u201cgeneralizable\u201d or \u201cfoundational\u201d often rest on cherry-picked tasks with no common reference point.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What is NeuralBench?<\/strong><\/h3>\n<p>NeuralBench is built on <strong>three core Python packages that form a modular pipeline.<\/strong><\/p>\n<p><strong>NeuralFetch<\/strong> handles dataset acquisition, pulling curated data from public repositories including OpenNeuro, DANDI, and NEMAR. <strong>NeuralSet<\/strong> prepares data as PyTorch-ready dataloaders, wrapping existing neuroscience tools like MNE-Python and nilearn for preprocessing, and HuggingFace for extracting stimulus embeddings (for tasks involving images, speech, or text). <strong>NeuralTrain<\/strong> provides modular training code built on PyTorch-Lightning, Pydantic, and the <code>exca<\/code> execution and caching library.<\/p>\n<p>Once installed via <code>pip install neuralbench<\/code>, the framework is controlled via a command-line interface (CLI). Running a task is as simple as three commands: download the data, prepare the cache, and execute. Every task is configured through a lightweight YAML file that specifies the data source, train\/validation\/test splits, preprocessing steps, target processing, training hyperparameters, and evaluation metrics.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1618\" height=\"1270\" data-attachment-id=\"79622\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/07\/meta-ai-releases-neuralbench-a-unified-open-source-framework-to-benchmark-neuroai-models-across-36-eeg-tasks-and-94-datasets\/screenshot-2026-05-07-at-1-35-55-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.35.55-AM-1.png\" data-orig-size=\"1618,1270\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-07 at 1.35.55\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.35.55-AM-1-1024x804.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.35.55-AM-1.png\" alt=\"\" class=\"wp-image-79622\" \/><figcaption class=\"wp-element-caption\">https:\/\/ai.meta.com\/research\/publications\/neuralbench-a-unifying-framework-to-benchmark-neuroai-models\/<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What NeuralBench-EEG v1.0 Covers<\/strong><\/h3>\n<p>The first release focuses on EEG and spans eight task categories: <strong>cognitive decoding<\/strong> (image, sentence, speech, typing, video, and word decoding), <strong>brain-computer interfacing (BCI)<\/strong>, <strong>evoked responses<\/strong>, <strong>clinical tasks<\/strong>, <strong>internal state<\/strong>, <strong>sleep<\/strong>, <strong>phenotyping<\/strong>, and <strong>miscellaneous<\/strong>.<\/p>\n<p><strong>Three classes of models are compared:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Task-specific architectures<\/strong> (~1.5K\u20134.2M parameters, trained from scratch): ShallowFBCSPNet, Deep4Net, EEGNet, BDTCN, ATCNet, EEGConformer, SimpleConvTimeAgg, and CTNet.<\/li>\n<li><strong>EEG foundation models<\/strong> (~3.2M\u2013157.1M parameters, pretrained and fine-tuned): BENDR, LaBraM, BIOT, CBraMod, LUNA, and REVE.<\/li>\n<li><strong>Handcrafted feature baselines<\/strong>: sklearn-style pipelines using symmetric positive definite (SPD) matrix representations fed into logistic or Ridge regression.<\/li>\n<\/ul>\n<p>All foundation models are fine-tuned end-to-end using a shared training recipe \u2014 AdamW optimizer, learning rate of 10\u207b\u2074, weight decay of 0.05, cosine-annealing with 10% warmup, up to 50 epochs with early stopping (patience=10). The sole exception is BENDR, for which the learning rate is lowered to 10\u207b\u2075 and gradient clipping is applied at 0.5 to obtain stable learning curves. This intentional standardization otherwise removes model-specific optimization tricks \u2014 such as layer-wise learning rate decay, two-stage probing, or LoRA \u2014 so that architecture and pretraining methodology are what actually gets evaluated.<\/p>\n<p>Data splitting is handled differently per task type to reflect real-world generalization constraints: predefined splits where provided by dataset research team, <em>leave-concept-out<\/em> for cognitive decoding tasks (all subjects seen in training, but a held-out set of stimuli used for testing), cross-subject splits for most clinical and BCI tasks, and within-subject splits for datasets with very few participants. Each model is trained three times per task using three different random seeds.<\/p>\n<p>Evaluation metrics are standardized by task type: balanced accuracy for binary and multiclass classification, macro F1-score for multilabel classification, Pearson correlation for regression, and top-5 accuracy for retrieval tasks. All results are additionally reported as normalized scores (s\u0303), where 0 corresponds to dummy-level performance and 1 corresponds to perfect performance, enabling fair cross-task comparisons regardless of metric scale.<\/p>\n<p>One important methodological note: some EEG foundation models were pretrained on datasets that overlap with NeuralBench\u2019s downstream evaluation sets. Rather than discarding these results, the benchmark flags them with hashed bars in result figures so readers can identify potential pretraining data leakage \u2014 no strong trend suggesting leakage inflates performance was observed, but the transparency is preserved.<\/p>\n<p>The benchmark offers two variants: <strong>NeuralBench-EEG-Core v1.0<\/strong>, which uses a single representative dataset per task for broad coverage, and <strong>NeuralBench-EEG-Full v1.0<\/strong>, which expands to up to 24 datasets per task to study within-task variability across recording hardware, labs, and subject populations. A Kendall\u2019s \u03c4 of 0.926 (p &lt; 0.001) between Core and Full rankings confirms that the Core variant is a reliable proxy \u2014 though a few model positions do shift, including CTNet overtaking LUNA when more datasets are included.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1594\" height=\"1066\" data-attachment-id=\"79624\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/07\/meta-ai-releases-neuralbench-a-unified-open-source-framework-to-benchmark-neuroai-models-across-36-eeg-tasks-and-94-datasets\/screenshot-2026-05-07-at-1-36-48-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.36.48-AM-1.png\" data-orig-size=\"1594,1066\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-07 at 1.36.48\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.36.48-AM-1-1024x685.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-07-at-1.36.48-AM-1.png\" alt=\"\" class=\"wp-image-79624\" \/><figcaption class=\"wp-element-caption\">https:\/\/ai.meta.com\/research\/publications\/neuralbench-a-unifying-framework-to-benchmark-neuroai-models\/<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Two Key Findings<\/strong><\/h3>\n<p><strong>Finding 1: Foundation models only marginally outperform task-specific models.<\/strong> The top-ranked models overall are REVE (69.2M parameters, mean normalized rank 0.20), LaBraM (5.8M, rank 0.21), and LUNA (40.4M, rank 0.30). But several task-specific models trained from scratch \u2014 CTNet (150K parameters, rank 0.32), SimpleConvTimeAgg (4.2M, rank 0.35), and Deep4Net (146K, rank 0.43) \u2014 trail closely behind. CTNet actually overtakes the LUNA foundation model to rank third in the Full variant, despite having roughly 270\u00d7 fewer parameters. This shows the gap between task-specific and foundation models is narrow enough that expanding dataset coverage alone is sufficient to change global rankings.<\/p>\n<p><strong>Finding 2: Many tasks remain genuinely hard.<\/strong> Cognitive decoding tasks \u2014 recovering dense representations of images, speech, sentences, video, or words from brain activity \u2014 are particularly challenging, with even the best models scoring well below ceiling. Tasks like mental imagery, sleep arousal, psychopathology decoding, and cross-subject motor imagery and P300 classification frequently yield performance close to dummy level. These tasks represent the best benchmarks for stress-testing the next generation of EEG foundation models.<\/p>\n<p>Tasks approaching saturation include SSVEP classification, pathology detection, seizure detection, sleep stage classification, and phenotyping tasks like age regression and sex classification.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Beyond EEG: MEG and fMRI<\/strong><\/h3>\n<p>Even in this initial EEG-focused release, NeuralBench already supports MEG and fMRI tasks as proof of concept. Notably, the REVE model \u2014 pretrained exclusively on EEG data \u2014 achieves the best performance among all tested models on the typing decoding task in MEG. This is a striking early signal that EEG-pretrained representations may transfer meaningfully across brain recording modalities, a hypothesis the framework is positioned to rigorously test in future releases.<\/p>\n<p>The infrastructure is explicitly designed for expansion to intracranial EEG (iEEG), functional near-infrared spectroscopy (fNIRS), and electromyography (EMG).<\/p>\n<h2 class=\"wp-block-heading\">How to Get Started<\/h2>\n<p>Installation takes a single command: <code>pip install neuralbench<\/code>. From there, running the audiovisual stimulus classification task on EEG looks like this:<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">neuralbench eeg audiovisual_stimulus --download   # Download data\nneuralbench eeg audiovisual_stimulus --prepare    # Prepare cache\nneuralbench eeg audiovisual_stimulus              # Run the task<\/code><\/pre>\n<\/div>\n<\/div>\n<p>To run all 36 tasks against all 14 EEG models, the <code>-m all_classic all_fm<\/code> flag handles the orchestration. Full benchmark storage requirements are substantial: approximately 11 TB total (~3.2 TB raw data, ~7.8 TB preprocessed cache, ~333 GB logged results), with one GPU of at least 32 GB VRAM per job \u2014 though average peak GPU usage measured across experiments is only ~1.3 GB (maximum ~30.3 GB).<\/p>\n<p>The full NeuralBench-EEG-Full v1.0 run requires approximately 1,751 GPU-hours across 4,947 experiments.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li>Meta AI\u2019s NeuralBench-EEG v1.0 is an open EEG benchmark \u2014 36 tasks, 94 datasets, 9,478 subjects, and 14 deep learning architectures under one standardized interface.<\/li>\n<li>Despite up to 270\u00d7 more parameters, EEG foundation models like REVE only marginally outperform lightweight task-specific models like CTNet (150K params) across the benchmark.<\/li>\n<li>Cognitive decoding tasks (speech, video, sentence, word decoding from brain activity) and clinical predictions remain highly challenging, with most models scoring near dummy level.<\/li>\n<li>REVE, pretrained only on EEG data, outperformed all models on MEG typing decoding \u2014 an early signal of meaningful cross-modality transfer.<\/li>\n<li>NeuralBench is MIT-licensed.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/ai.meta.com\/research\/publications\/neuralbench-a-unifying-framework-to-benchmark-neuroai-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>\u00a0<\/strong>and<strong>\u00a0<a href=\"https:\/\/github.com\/facebookresearch\/neuroai\/tree\/main\/neuralbench-repo\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub Repo<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/07\/meta-ai-releases-neuralbench-a-unified-open-source-framework-to-benchmark-neuroai-models-across-36-eeg-tasks-and-94-datasets\/\">Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Evaluating AI models trained o&hellip;<\/p>\n","protected":false},"author":1,"featured_media":863,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-862","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/862","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=862"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/862\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/863"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=862"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=862"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=862"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}