Современные методы машинного обучения в задачах интерпретации электрической активности головного мозга тема диссертации и автореферата по ВАК РФ 00.00.00, кандидат наук Петросян Артур Тигранович

  • Петросян Артур Тигранович
  • кандидат науккандидат наук
  • 2023, ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики»
  • Специальность ВАК РФ00.00.00
  • Количество страниц 104
Петросян Артур Тигранович. Современные методы машинного обучения в задачах интерпретации электрической активности головного мозга: дис. кандидат наук: 00.00.00 - Другие cпециальности. ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики». 2023. 104 с.

Оглавление диссертации кандидат наук Петросян Артур Тигранович

Содержание

1 Введение

1.1 Объект исследования

1.2 Цели и задачи исследования

1.3 Основные идеи, результаты и выводы диссертации

1.4 Теоретическая и практическая значимость приведенных результатов исследований

1.5 Вклад автора в проведенное исследование

1.6 Публикации и апробация работы

1.6.1 Публикации повышенного уровня

1.6.2 Публикации обычного уровня

1.6.3 Остальные публикации

1.6.4 Доклады на конференциях и семинарах

Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Введение диссертации (часть автореферата) на тему «Современные методы машинного обучения в задачах интерпретации электрической активности головного мозга»

2 Содержание работы 13

2.1 Архитектура компактной нейронной сети, отражающая современные научные представления о происхождении нейроэлектрофизиологиче-ской активности................................ 13

2.1.1 Феноменологическая модель..................... 13

2.1.2 Архитектура нейронной сети..................... 15

2.1.3 Две задачи регрессии и интерпретация весов нейронной сети . . 17

2.1.4 Реалистичные симуляции....................... 19

2.2 Декодирование и интерпретация кортикальных сигналов с помощью компактной сверточной нейронной сети ....................................21

2.2.1 Описание существующих методов в задаче декодирования моторных данных ........................................................22

2.2.2 Декодирование движений на Berlin BCI competition IV..... 24

2.2.3 Декодирования кинематики пальцев по ЭКоГ данным...... 25

2.2.4 Декодирование классификации движения по ЭЭГ данным ... 25

2.3 Декодирование речи с помощью небольшого набора пространственно-разделенных минимально инвазивных внутричерепных электродов ЭЭГ с компактной и интерпретируемой нейронной сетью........ 27

2.3.1 Введение и существующие методы................. 27

2.3.2 Архитектура нейронной сети и ее интерпретация......... 30

2.3.3 Исследование влияния внутреннего представления речи на качество декодирования......................... 32

2.3.4 Синхронный и асинхронный режим ................ 33

3 Заключение 37

3.1 Список выносимых на защиту результатов ................ 37

Приложение 39

Приложение 1. Статья. Decoding and interpreting cortical signals with a

compact convolutional neural network.................... 39

Приложение 2. Статья. Speech Decoding From A Small Set Of Spatially Segregated Minimally Invasive Intracranial EEG Electrodes With A Compact

And Interpretable Neural Network ...................... 58

Приложение 3. Статья. Linear Systems Theoretic Approach to Interpretation

of Spatial and Temporal Weights in Compact CNNs: Monte-Carlo Study . 82 Приложение 4. Статья. Decoding neural signals with a compact and interpretable

convolutional neural network ......................... 89

Приложение 5. Статья. Compact and interpretable architecture for speech

decoding from stereotactic EEG ........................ 99

Благодарности 104

1 Введение

1.1 Объект исследования

Интерфейсы мозг-компьютер (ИМК) напрямую связывают нервную систему с внешними устройствами [51] или другим мозгом [41]. Несмотря на то, что существует множество применений [34], клинически значимые интерфейсы мозг-компьютер представляют первостепенный интерес, поскольку они обещают реабилитировать пациентов с сенсорными, моторными и когнитивными нарушениями [53, 31].

Интерфейсы мозг-компьютер могут работать с различными сигналами, отражающими электрическую активность нейронов головного мозга [44, 27], такими как, например, электроэнцефалографические (ЭЭГ) потенциалы, измеряемые с помощью электродов, размещенных на поверхности головы [49], или сигналы, регистрируемые инвазивно с помощью внутрикорковых электродов, проникающих в кору головного мозга [40] или размещенных на кортикальной поверхности [48]. В целом, методы регистрации мозговой активности можно разделить на инвазивные и неинвазивные. В первом случае предполагается медицинская процедура имплантации электродов на поверхность коры головного мозга (субдуральная или эпидуральная) и последующего подсчета и интерпретации сигналов активности нейронной популяции. На данный момент, интерфейсы, которые регистрируют мозговую активность неинва-зивным способом, не обеспечивают необходимую ширину информационного канала. Объем информации, содержащейся в инвазивно записанных сигналах, значительно перевешивает сложности и технические проблемы, связанные с этой технологией.

Перспективным и минимально инвазивным способом прямого доступа к активности коры головного мозга является использование стерео-ЭЭГ электродов (sEEG), вставляемых стереотаксическим методом через спиральное сверло или отверстие, выполненное в черепе. Последние достижения в методах имплантации, включая использование 3Б-ангиографии головного мозга, МРТ и роботизированной хирургии, помогают свести риски такой имплантации практически к нулю, что делает sEEG идеальным компромиссом для приложений ИМК [10]. Полоски ЭКоГ электродов — еще один метод достижения прямого электрического контакта с кортикальной тканью с минимальным дискомфортом для пациента [22].

Одной из важных составляющих технологий нейроинтерфейсов является использование продвинутых и усовершенствованных методов машинного обучения. Из множества имеющихся подходов можно использовать как классические модели, так и методы глубокого обучения. Применение глубинных нейронных сетей в ряде математических и медицинских задач показывает хорошие результаты по сравнению с другими методами, что обуславливает попытку опробовать эти технологии в предсказательных задачах по декодированию сигналов активности мозга [50, 46]. Тем не менее, одна из проблем при декодировании сигналов мозга с помощью алгоритмов глубинного обучения связана с низкой интерпретируемостью получаемых решающих правил, что зачастую приводит к невозможности провести цензурирование получаемых решений. Это необходимо для того, чтобы, например, исключить участие электрических коррелятов нервномышечной активности в автоматическом формировании информативных признаков.

Алгоритмы, используемые для извлечения соответствующих нейронных модуляций, являются ключевым компонентом ИМК систем. Чаще всего они реализуют этапы формирования сигнала, извлечения признаков и декодирования. Современное машинное обучение предписывает выполнение двух последних шагов одновременно с использованием глубоких нейронных сетей (ГНС). [46]. ГНС автоматически находят значимые признаки в контексте задач регрессии или классификации. Корректная интерпретация вычислений, выполняемых ГНС, позволит удостовериться, что декодирование происходит непосредственно из активности мозга. Для обеспечения физиологически значимой интерпретации структура ГНС должна отвечать определенным требованиям и иметь в основе своей доменные знания, которые в данном случае предписывают использование пространственно-распределенной ритмической активности мозга [55] в качестве информационного субстрата. Интерпретируемость моделей в соответствии с традициями Explainable AI может принести пользу автоматизированному процессу обнаружения знаний [3].

Исходя из описанного выше, методы машинного обучения и, в частности, методы глубинного обучения, применяемые в задачах декодирования сигналов головного мозга, а также их интерпретация и построение интерпретируемых архитектур являются основным объектом исследования.

1.2 Цели и задачи исследования

Из всего вышесказанного становится очевидно, что применение методов машинного машинного обучения к задачам декодирования информации из сигналов активности головного мозга и развитие этих подходов с целью обеспечения интерпретируемости получаемых решающих правил являются актуальными задачами, которые напрямую влияют на практическую применимость ИМК. Искомая интерпретируемость дает гарантии надежности получаемых результатов и открывает новые возможности для изучения принципов работы головного мозга. Соответственно, основной целью исследования является разработка доменно-информированных архитектур нейронных сетей в сочетании с построением алгоритмов интерпретации соответствующих весовых коэффициентов и применение данного аппарата к задачам декодирования нейрональной активности в в идеомоторных и речевых нейроинтерфейсах. Диссертационное исследование выполнялось на базе Центра биоэлектрических интерфейсов НИУ ВШЭ, в котором ведутся работы по созданию инвазивных нейроинтерфейсов для замещения моторной и речевой функции. Были сформулированы следующие задачи исследования:

1. Разработать архитектуру компактной нейронной сети, согласованную с современными научными данными о происхождении электрофизиологической активности, механизме ее распространения в тканях и физических принципах ее регистрации.

2. Провести сравнительный анализ качества декодирования из ЭКоГ и стерео-ЭЭГ данных кинематики пальца и параметров артикуляционного тракта, достижимых при помощи предложенной компактной нейронной сети и других конкурирующих решений.

3. Разработать методы интерпретации весовых коэффициентов в предложенной архитектуре нейронной сети с целью выявления геометрических характеристик ключевых популяций нейронов и динамических свойств их активности.

4. Реализовать декодирование кинематики движения рук в реальном времени.

5. Реализовать декодирование речи на основе минимального числа пространственно-сегрегированных электродов.

1.3 Основные идеи, результаты и выводы диссертации

Мы предложили и всесторонне исследовали компактную архитектуру на основе сверточной сети для адаптивного декодирования моторных и речевых явлений из ЭКоГ и стерео-ЭЭГ данных. Также, мы предложили новый теоретически обоснованный подход к интерпретации пространственных и временных весов в нашей и подобных архитектурах, сочетающих адаптацию как в пространстве, так и во времени. Полученные пространственные и частотные паттерны, характеризующие популяции нейронов, имеющие решающее значение для конкретной задачи декодирования, подлежат дальнейшему анализу при помощи электромагнитных и динамических моделей с целью охарактеризовать локализацию и параметры активности ключевых нейронных популяций.

Сначала мы протестировали наше решение с помощью реалистичного моделирования методом Монте-Карло. Затем, применительно к данным ЭКоГ из набора данных Berlin BCI Competition IV, наша архитектура работала сравнимо с победителями конкурса, не требуя при этом никакой ручной предобработки данных. Используя предложенный подход к интерпретации весов, мы смогли раскрыть пространственные и спектральные паттерны нейронных процессов, лежащие в основе успешного декодирования кинематики пальцев из набора данных ЭКоГ, записанных в Центре биоэлектрических интерфейсов. Наконец, мы применили метод к анализу 32-канального набора данных воображаемых движений ЭЭГ и увидели физиологически правдоподобные пространственные и частотные паттерны ключевых популяций, характерные для задачи моторного воображения. Также, мы применили нашу архитектуры в реальном времени на реальном пациенте и добились высокого качества декодирования кинематики пальцев пациента исключительно из данных активности мозга. Соответствующие детали описаны в работах [12, 13].

Далее, мы дополнили нашу архитектуру LSTM слоем, применили её к задаче декодирования речи из инвазивных ЭКоГ и стерео-ЭЭГ данных. Для этого мы собрали 60 минут данных (из двух сеансов) для каждого из двух пациентов, которым

были имплантированы инвазивные электроды. Затем мы использовали только контакты, относящиеся к одному стержню стерео-ЕЕО или одной ЭКоГ-полоске, чтобы декодировать нейронную активность в 26 слов и один класс тишины. Интерпретация весов сети дала физиологически правдоподобный результат, который совпал с результатами стимуляционного картирования.

Мы достигли в среднем 55% точности, используя только 6 каналов данных, записанных с помощью одного минимально инвазивного электрода эЕЕС у первого пациента, и точность 70%, используя только 8 каналов данных, записанных для одной полоски ЭКоГ, у второго пациента в классификации 26+1 произносимых слов. Наша компактная архитектура не требовала использования предварительно отобранных признаков, быстро обучалась и приводила к стабильному, интерпретируемому и физиологически значимому решающему правилу. Пространственные характеристики основных популяций нейронов подтверждают результаты картирования активной и пассивной речи и демонстрируют обратную пространственно-частотную зависимость, характерную для нейронной активности. При сравнении с другими архитектурами наше компактное решение обеспечивало более высокую точность классификации, чем алгоритмы, которые недавно упоминались в литературе по нейронному декодированию речи, и использовало во много раз меньшее число минимально-инвазивных электродов и обучалось на компактном объеме данных.

Наше исследование представляет собой первый шаг к экологичным инвазивным речевым протезам и демонстрирует принципиальную возможность их создания на основе минимально-инвазивной технологии регистрации активности головного мозга. Подробности данного исследования описаны в работе [2].

1.4 Теоретическая и практическая значимость приведенных результатов исследований

С точки зрения теории, мы:

• Впервые обосновали архитектуру нейронной сети, исходя из общепринятой в электрофизиологии модели наблюдения электрической активности мозга при помощи распределенного набора электродов.

• Впервые предложили теоретически-обоснованную методику интерпретации весов компактной нейронной сети с факторизованной пространственно-временной обработкой и провели необходимое моделирование для демонстрации работоспособности предложенной методики.

• Продемонстрировали физиологичность получаемых пространственных и частотных паттернов, характеризующих ключевые нейронные популяции. Полученная информация полностью совпала с результатами активного исследования коры головного мозга пациентов в целях поиска речевой коры. В моторной задаче, соматотопия, наблюдаемая в пространственных паттернах, полностью соответствует устоявшемуся представлению об организации моторной коры.

С практической точки зрения, мы:

• Реализовали прототип инвазивного моторного нейроинтерфейса в реальном времени.

• Предложили архитектуру и методику интерпретации весовых коэффициентов, которые могут быть использованы для построения классификаторов в нейрофизиологических исследованиях. Интерпретация весовых коэффициентов таких классификаторов позволяет добывать новые знания об изучаемых нейрофизиологических процессах.

• Реализовали и апробировали систему декодирования речи из ЭКоГ данных. Наш алгоритм работал в каузальном режиме, то есть использовал данные из прошлого по отношению к моменту времени декодирования. Это позволяет нам надеяться на успешное перенесение достигнутого качества работы нашего декодера у реального пациента с нарушениями речевой функции.

• Мы исследовали возможность работы нашего речевого интерфейса в асинхронном режиме, что имеет большое практическое значение при трансляции нашего решения в клиническую практику.

1.5 Вклад автора в проведенное исследование

Автор этого исследования является разработчиком предлагаемой методики и архитектуры нейронной сети в применении к анализу модельных и реальных данных. Разработанный подход к интерпретации весов широкого семейства архитектур был детально исследован автором в режиме Монте-Карло моделирования. Автором были получены все результаты, касающиеся точности работы предлагаемых алгоритмов в применении к реальным данным. Результаты этой работы описаны в двух статьях, опубликованных в международных журналах повышенного уровня, и в трех статьях по результатам конференций. Во всех этих работах автор является первым и основным автором.

1.6 Публикации и апробация работы

1.6.1 Публикации повышенного уровня

• Petrosyan A. et al. Decoding and interpreting cortical signals with a compact convolutional neural network //Journal of Neural Engineering (Q1). - 2021. -Т. 18. - №. 2. - С. 026019 [7].

• Petrosyan A. et al. Speech Decoding From A Small Set Of Spatially Segregated Minimally Invasive Intracranial EEG Electrodes With A Compact And Interpretable Neural Network //Journal of Neural Engineering (Q1). . - 2022. - Т.. - №. . -С. [2].

1.6.2 Публикации обычного уровня

• Petrosyan A., Lebedev M., Ossadtchi A. Linear Systems Theoretic Approach to Interpretation of Spatial and Temporal Weights in Compact CNNs: Monte-Carlo Study //Biologically Inspired Cognitive Architectures Meeting (Q4). - Springer, Cham, 2020. - С. 365-370 [13].

• Petrosyan A., Lebedev M., Ossadtchi A. Decoding neural signals with a compact and interpretable convolutional neural network //International Conference on Neuroinformatics (Q4). - Springer, Cham, 2020. - С. 420-428 [12].

• Arthur Petrosyan, Alexey Voskoboinikov, Alexei Ossadtchi, Compact and interpretable architecture for speech decoding from stereotactic EEG // 2021 Third International Conference Neurotechnologies and Neurointerfaces - IEEE, 2021. - С. 79-82 [5].

1.6.3 Остальные публикации

• Petrosyan A. et al. Compact and Interpretable Architecture for Speech Decoding From iEEG //International Journal of Psychophysiology. - 2021. - Т. 168. - С. S195 [6].

• Volkova Ksenia, Arthur Petrosyan, Dubyshkin Ignatii, Ossadtchi Alexei, «decoding movement time-course from ecog using deep learning and implications for bidirectional brain-computer interfacing» [30].

1.6.4 Доклады на конференциях и семинарах

• 2020 Annual International Conference on Brain-Inspired Cognitive Architectures for Artificial Intelligence (BICA*AI 2020) "Linear systems theoretic approach to interpretation of spatial and temporal weights incompact CNNs: Monte-Carlo study" (2020).

• XXII International Conference "Neuroinformatics-2020 «Decoding neural signals with a compact and interpretable convolutional neural network» (2020).

• BCI Samara - «Decoding neural signals with a compact and interpretable convolutional neural network» (2020).

• Report at the forum «Center for Bioelectric Interfaces» (2020).

• BCI Samara - «Compact and interpretable architecture for speech decoding from sEEG» (2021).

• 20th World Congress of Psychophysiology - "Compact and interpretable architecture for speech decoding from sEEG" (2021).

• The Third International Conference «Neurotechnologies and Neurointerfaces» -"Compact and interpretable architecture for speech decoding from sEEG" (2021).

2 Содержание работы

2.1 Архитектура компактной нейронной сети, отражающая современные научные представления о происхождении нейроэлек-трофизиологической активности

В данном разделе приведены основные идеи статьи [13].

Вклад автора: разработана архитектура нейронной сети, разработан метод ее интерпретации, реализованы компьютерные симуляции (включая симуляции Монте-Карло).

2.1.1 Феноменологическая модель

Рисунок 1 иллюстрирует возможную взаимосвязь между моторным поведением (движениями рук), активностью мозга и записями ЭКоГ. Активность, &[п] = ^[п],... [п]]Т € М7, из набора I нейронных популяции, С1 — С1, участвующих в управлении движением, преобразуются в траекторию движения, г[п], посредством нелинейного преобразования Н: г[п) = Н(в[п]) где в[п] = [е^п],... ,е/[п]]Т - вектор огибающих б[п]. Активность другого набора ■] популяций А\ — А,] не связана с движением. Записи этого действия с набором датчиков Ь в момент времени п представлены вектором сигналов датчиков Ь х 1, х[п] € М^. В каждый момент времени п этот вектор может быть смоделирован как линейная смесь сигналов, полученных в результате применения матриц прямой модели О = [п],...,£1 [п]] € и

А = [а1[п],... [п]] € МLхJ к столбцу вектора активности источников, связанных с задачей, в момент времени п, в[п] = [«1[п],...,«/[п]]т, и несвязанные с задачей источники, {[п] = [/1[та],... ,fJ[п]]Г, соответственно:

13 1

х[п] = С8[п] + А{ [п] =^2 ёг8г [п] + ^ а^ [п] =^2 ёг8г[п] + ц[п]. (1)

г=1 ]=1 г=1

Векторы столбцов g¿, г = 1,... ,1 и а^-, ] = 1,...,.] - это топографии источников, связанных с задачей и не связанных с ней соответсвенно. Мы ссылаемся на зашум-ленный, не связанный с задачей компонент записи как ц[п] = ^'^=1 а^fj[п] € М^. Аналогичная генеративная модель была недавно описана в [14].

Учитывая линейную генеративную модель электрофизиологических данных, обратное отображение, используемое для получения активности источников из сигналов датчиков, так же обычно ищется в линейной форме: ¡з[п] = \¥тХ[п], где столбцы образуют пространственный фильтр, который противодействует эффекту объемной проводимости и уменьшает вклад зашумленных, не связанных с задачей источников.

Рис. 1: Феноменологическая модель

Нейронные корреляты моторного планирования и исполнения были тщательно изучены [60]. В области кортикального ритма альфа- и бета-компоненты сенсомо-торного ритма десинхронизируются непосредственно перед выполнением движения и восстанавливаются со значительным превышением после завершения двигательного акта [35]. Величина этих модуляций коррелирует со способностью человека контролировать ИМК двигательных образов [36]. Кроме того, частота бета-всплесков в первичной соматосенсорной коре обратно коррелирует со способностью обнаруживать тактильные стимулы, а также влияет на другие двигательные функции. Внутричерепные записи, такие как ЭКоГ, позволяют надежно измерять активность более быстрого гамма-диапазона, которая во времени и пространстве специфична

для моделей движения [20] и, как полагают, сопровождает контроль движений и их выполнение. В целом, основываясь на очень солидном объеме исследований, ритмические компоненты источников мозга, б[п], по-видимому, полезны для реализаций ВС1. Учитывая линейность генеративной модели (1), эти ритмические сигналы, отражающие активность определенных популяций нейронов, могут быть вычислены как линейные комбинации узкополосных отфильтрованных сенсорных данных х[п].

Самый простой подход к извлечению кинематики, г[п], из записей мозга, х[п], заключается в использовании одновременно записанных данных и непосредственном изучении отображения г[п] = Н(х[п]). Чтобы практически реализовать его, необходимо параметрически описать это отображение. Длля этой цели мы использовали специфическую нейросетевую архитектуру. Архитектура была построена в тесном соответствии с уравнением наблюдения (1) и нейрофизиологическим описанием наблюдаемых явлений, проиллюстрированным на рисунке 1, что улучшило способность интерпретировать результаты.

2.1.2 Архитектура нейронной сети

Компактная и адаптируемая архитектура (ЕБ-пе^, которую мы использовали здесь, показана на рисунке 2. Данная архитектура состоит из М ветвей. Каждая ветвь представляет собой адаптивный детектор огибающей со своей собственной парой временных фильтров, которым предшествует специфичный для ветви пространственный фильтр. Наш детектор огибающей аппроксимирует извлеченную огибающую как абсолютное значение аналитического сигнала, вычисленное с использованием преобразования Гильберта для входного сигнала. Используемый нами процесс обработки имитирует процесс аналогового детекторного приемника. Он использовался в других подобных компактных архитектурах сверточных нейронных сетей, которые используют раздельную обработку пространственных и временных измерений [28, 21]. Каждая ветвь нашей сети способна извлекать мгновенную мощность входного сигнала и адаптироваться к конкретной популяции нейронов и полосе частот путем соответствующей настройки пространственных и временных весов фильтров.

Как показано на схеме 2, детектор огибающей может быть реализован с использованием современных примитивов ГНС, а именно пары сверточных операций,

которые выполняют полосовую фильтрацию и фильтрацию нижних частот с одним коэффициентом нелинейности ReLU(-l) между ними, что соответствует вычислению абсолютного значения выходного сигнала первого 1-D сверточного слоя. Этот шаг выделяет сигнал, за которым следует фильтр нижних частот, который сглаживает выходной сигнал гт[п] для получения аппроксимации огибающей ет[п]. Обратите внимание, что ReLU(a) теперь является стандартной нелинейностью, используемой в современных нейронных сетях и определяемой как ReLU(.T,a) = {х,х > 0]ах,х < 0}. Чтобы нелинейность являлась явным выделением мощности сигнала, мы использовали необучаемый слой batch norm. Таким образом, мы можем использовать возможности инструментов оптимизации, реализованных в рамках подхода глубокого обучения, для настройки параметров нашей сети, которая использует пространственные фильтры с последующей оценкой огибающей в качестве блока извлечения признаков.

Рис. 2: Архитектура, основанная на компактной сверточной нейронной сети, включает в себя несколько ветвей - адаптивный детектор огибающей, прием пространственно несмешанных входных сигналов и вывод огибающих, чьи N самых последних значений с индексами п — N + 1,..., п объединяются в декодируемой переменной г полносвязанным слоем.

В нашей архитектуре детектор огибающей т-й ветви принимает в качестве входного сигнала пространственно отфильтрованный сигнал датчика 8т [?г], вычисленный точечным сверточным слоем. Этот слой предназначен для инвертирования процессов объемной проводимости, представленных матрицами прямой модели С и А в нашей

феноменологической модели (рисунок 1). Затем, мы аппроксимировали оператор Н как линейную комбинацию запаздывающей мгновенной мощности (огибающей) временных рядов узкополосного источника blue s(t) = [si(i), s2(t),..., sj(i)j с коэффициентами из матрицы U = {umi}, m = 1,..., M, I = 1,..., N. Это было сделано с помощью полносвязанного слоя, который смешал огибающие, ет[п], в единую оценку кинематического параметра z[n] = m=i Sz=i em[n — l]umi + u0, где u0 моделирует смещение постоянного тока, которое может присутствовать в кинематическом профиле.

2.1.3 Две задачи регрессии и интерпретация весов нейронной сети

Описанная архитектура обрабатывает данные в виде блоков заранее определенной длины из N выборок. Предположим, что длина фрагмента равна длине фильтра в 1D свертке. Рассмотрим фрагмент входных данных из L каналов, наблюдаемых за интервалом в N моментов времени, который может быть представлен с помощью матрицы Теплица X[n] = [x[n], x[n — 1],..., x[n — N + 1]] e RLxN. Обработка X[n] первыми двумя слоями, выполняющими пространственную и временную фильтрацию, может быть описана для m-й ветви следующим образом:

bm[n] = w^ X[n]hm, (2)

где wm e - пространственные веса, а hm e RN - временные веса ветви т. Нелинейность, ReLU(—1), в сочетании с фильтрацией нижних частот, выполняемой вторым сверточным слоем, извлекает огибающие ритмических сигналов.

Аналитический сигнал сопоставляется один к одному с его огибающей [57], а для исходных вещественных данных мнимая часть аналитического сигнала однозначно вычисляется с помощью преобразования Гильберта. Следовательно, исходный вещественный сигнал однозначно сопоставляется с его огибающей. Наш детектор огибающей вычисляет близкое приближение абсолютного значения аналитического сигнала, и поэтому мы можем утверждать, что ет[п] однозначно определяется Ьт[п]. Таким образом, для того, чтобы получить надлежащую огибающую ет[п], достаточно получить надлежащую Ьт[п], что достигается путем корректировки весов пространственной и временной свертки каждой ветви сверточной нейронной сети.

Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Список литературы диссертационного исследования кандидат наук Петросян Артур Тигранович, 2023 год

/ / / / /

y / /

0.10

0.15

0.20

0.25

0.30

0.6 0.5 0.4 0.3 0.2

* /

/ M /

i ✓ /

/ /

V* • - 4

0.2 0.3 0.4 0.5

ISR decoding correlation

A

_LPC

0.5 0.4 0.3 0.2 0.1

0.10 0.15 0.20 0.25 0.30 0.35 0.6

0.5

0.4

0.3

0.2

¥ / A

/ / /

• _ •

0.3 0.4 0.5

ISR decoding correlation

Figure 12: Dependence of the final word classification accuracy on the decoded vs. true ISR correlation. Red line is the third order trend fitted to the data to facilitate visual perception.

4.4 Comparative analysis

In this work we employed the compact architecture, see Figure 3, that comprises multiple branches of envelope detectors (ED) of spatially filtered data whose output is fed into the LSTM layer followed by a fully connected network. This architecture uses factorized spatial and temporal filters that get adapted during training and allows for interpretation of the filter weights into the spatial and spectral patterns as demonstrated in Figure 10. These patterns can then be used to infer location and dynamical properties of the underlying neuronal populations.

Here we compared this network to several other architectures. We found that out of several neural networks only Resent-18 offers a comparable, although significantly worse, performance when used instead of the ED block in our architecture, see Figure 3. The LSTM layer also appears to be very useful in capturing the dynamics of features extracted either with ED or ResNet blocks, see Figure 11.a. We hypothesize that this situation may be caused by the adequate balance in the number of parameters to be tuned for the ED-based network and the amount of data available for training as compared to several other more sophisticated architectures.

Words decoding accuracy results reported in Figure 9 correspond to the case when 40 LMSCs were used to train the front-end ISR decoder network, see Figure 3. We have also experimented with several other ISRs as described in section 3.3 and presented the results in Figure 1 l.b.

Interestingly, the differences in the individual ISR decoding fidelity, see Figure 7, does not transfer into the corresponding words classification accuracy where all of the ISRs yield more or less comparable performance. A possible explanation here could be that some ISRs in addition to the information regarding the sequence of the articulatory tract configurations (corresponding to a specific sequence of phonemes and invariant to the pitch, timbre, loudness, etc.)

contain the information about purely acoustic features of the utterance such as fundamental frequency, voice timbre, local volume, etc., which could be easier to decode than the articulatory tract parameters critical for the words classification task. The subsequent words classification largely requires only the first type of information and therefore may yield comparable words classification performance for the different ISRs as long as all of them contain this essential information.

The reported ISR and words decoding accuracy results are presented for the causal processing mode, i.e. when the data window strictly precedes the time-point the prediction is made for. We have also experimented with anti-causal (the window is strictly in the future w.r.t. to the predicted time-point) and non-causal (when the data window covers pre- and post- intervals around the point in question). These results are plotted in Figure 11.c. In both patients we see the best performance when the data-window is allowed to be located both in the future and in the past w.r.t. the point to be predicted. This result is expected since in the non-causal setting the algorithm can use information about the cortical activity that occurs in response to the uttered word.

In this work we mainly focused on decoding from a small number of contacts confined either to a single stereo-EEG shaft or an ECoG stripe. In both cases the electrodes can be implanted without a full-blown craniotomy via a drill hole in the skull. We have chosen the particular subset of contacts using the mutual information (MI) metric, see Figure 1 which closely matched stimulation-based mapping results. Both of our patients were implanted with several sEEG shafts or ECoG stripes, see Figure 1. In Figure 11.d we show the results of a similar analysis but using other subsets of electrodes located on the other shafts or stripes. Noteworthy is that MI based selection yielded significantly better performance as compared to the other spatially segregated electrode groups.

According to Figure 1.d electrodes 25-27 also show the increased MI values between the ECoG and acoustic envelope. This corroborates with the results in Figure 11.d where the use of the stripe with these electrodes yielded the second best decoding accuracy. The stripe is placed in the inferior region of the left anterior temporal cortex and the MNI coordinates of the first (25) and the last (30) electrodes from this stripe are given in Table 1. According to [53] these areas appear to be active during the implicit comprehension of spoken and written language. Given that the sentences we used slightly deviate from the standard sentences used in daily life and are likely to require some additional effort and very mild emotional response beyond just mechanical reading. According to Figure 1 of [24], our electrodes 25-30 fall in the area 6e that appears to host representations of emotional words, see their Table 2. Finally, based on [10], the temporal pole region where electrodes 25-30 are placed could be a part of the network that links temporal pole with posterior structures to support thematic semantic processing during language production. When interpreting these results we can not discount the mounting evidence that speech production and comprehension share neural representation and speech production processes are not only localized to the left hemisphere but also involve bilaterally distributed linguistic network [50] which explains advanced decoding accuracy in the speech decoding setting reliant on bilaterally distributed electrodes [23].

4.5 Asynchronous decoding of words

Traditionally, BCI can be used in two different settings: synchronous and asynchronous. In the synchronous setting a command is to be issued within a specific time window. Usually, a synchronous BCI user is prompted at the start of such a time window and has to produce a command (alter his or her brain state) within a specified time frame. Therefore, the decoding algorithm is aware of the specific segment of data to process in order to extract the information about the command. In the asynchronous mode the BCI needs to not only decipher the command but also determine the fact that the command is actually being issued. The delineation between synchronous and asynchronous modes is most clearly pronounced in BCIs with discrete commands implying the use of a categorical decoder.

In BCIs that decode a continuous variable, e.g. hand kinematics, such delineation between synchronous and asynchronous modes is less clear. The first part of our BCI implements a continuous decoder of the internal speech representation (ISR) features. Should this decoding appear of sufficient accuracy it could have been simply used as an input to a voice synthesis engine. Such a scenario has already been implemented in several reports [4, 3] but these solutions use a large number of electrodes which may explain better quality of ISR decoding. In our setting we aimed at building a decoder operating with a small number of ecologically implanted electrodes and decided to focus on decoding individual words. We first used the continuously decoded ISRs to classify 26 discrete words and one silence state in the synchronous manner. To implement this we cut the decoded ISR timeseries around each word's utterance and use them as data samples for our classification engine.

To gain insight into the ability of our BCI to operate in a fully asynchronous mode we performed the additional analysis as described in section 3.3.2. Figure 13 .b illustrates the performance of our BCI operating in a fully asynchronous mode when the decoder is running over the succession of overlapping time windows of continuously decoded ISRs and the decision about the specific word being uttered is made for each of such windows, see Figure 2. To quantify the

moment of detected word

■ "babushka" word probabili • "lilia" word probability

■ word alignment threshold

b)

Correct word detected within alignment is true positive, lut repeated one is false positive

- P - C atient 2 hance

.0 0 1 0 2 0 3 0 4 0 5 0

Figure 13: a) For each i—th word we compute smoothed probability profiles p;(t) for each time instance t. The decision is then made about a word being pronounced only at time points corresponding to the local maximums of p;(t) that cross the threshold 0. In case the chosen i—th word matches the one that is currently being uttered we mark this event as true positive (TP). If after such a detection p;(t) remains above the threshold and exhibits another local maximum which exceeds the values of all other smoothed probability profiles the i—th word is "uttered" again, but this event is marked as false positive (FP) even if t belongs to the time range corresponding to the actual i—th word, b) PR curves for asynchronous words decoding task. As in regular binary classification problem, in order to get PR curves we vary the detection threshold from 0 to 1 and for every fixed threshold value we compute the corresponding precision-recall pair. The detection threshold affects how many words will be «uttered» by our algorithm. Low detection threshold "utter" a lot of words and lead to high recall and low precision. And vice versa, high detection threshold "utter" only high confident words and lead to low recall and high precision. Note that definition of precision and recall is slightly different from conventional binary classification PR curves (see equation 1, figure 13.a and section 3.5 for details). We also show a chance level PR curve.

performance of our asynchronous speech decoder we used precision-recall curves as detailed in section 3.5 and Figure 13.a.

Although the observed performance significantly exceeds the chance level, it is not yet sufficient for building a full blown asynchronous speech interface operating using a small number of minimally invasive electrodes. In our view and based on the experience with motor interfaces, specific protocols to train the patient including those with immediate feedback to the user [6] are likely to significantly improve the decoding accuracy in such systems which will boost the overall feasibility of minimally invasive speech prosthetic solutions.

5 Conclusion

We have explored the possibility of building a practically feasible speech prosthesis solution operating on the basis of neural activity recorded with a small set of minimally invasive electrodes. Implantation of such electrode systems does not require a full craniotomy and combined with algorithmic solutions equipped with a joint human-machine training protocol may form a basis for the future minimally invasive speech prosthesis.

There exist several reports exploiting intracortical activity recorded with Utah array like systems for speech prosthesis purposes [60,58,18]. These recordings give access to the activity of individual neurons but remain potentially harmful to the cortical tissue. In contrast, stentrodes [43], electrodes located inside blood vessels and implanted using stent technology, offer a potentially plausible solution for obtaining high quality brain activity signals without any kind of craniotomy. These electrodes, however, unlike the intracortical arrays, register the superposition of neuronal activity stemming from a large number of neuronal populations. Also, unlike the ECoG grids used in the majority of speech prosthesis research these stent electrodes are confined to a relatively small volume. The signals measured in our setting with a small number of spatially confined sEEG and ECoG contacts can be considered as a proxy of the data collected by the stentrodes and the signal processing approaches developed here could be potentially applied to stentrode data in order to to pave the road towards craniotomy-free speech BCI solutions.

We build our decoder using a two-step procedure. First we construct an interpretable architecture to decode the continuous internal speech representation (ISR) profiles from the neural activity and fix the weights of this compact neural network. In this case the particular ISR (LMSC, MFCC, LPC coefficients) is merely a target to train this

front-end network. Then, when applying this network to neural activity data we take its hidden state before the last fully connected layer and use its activation as an input to the discrete classifier to distinguish between neural activity patterns corresponding to 26 words and one silence state. This approach resembles [36]. However, based on our experiments we found that replacing concurrent training of two classifiers with such a two step process improved the achieved decoding accuracy in our setting.

We have also paid particular attention to interpreting the obtained decision rule. Our main concern here was to exclude the possibility of using non-neural activity patterns in the overt speech decoding setting. To do so we exploited the concept of spatial and frequency domain patterns that pertain to the neuronal populations that each of the branches of our front-end network got tuned to. Several reports [16, 31, 7] explored the spatial and frequency domain patterns that manifest muscular activity in the subdural space. These are typically hallmarked with high-frequency spectra and large spatial extent which is the opposite to neural activity where we expect higher frequency activity to be more spatially confined as compared to the signals in the lower frequency bands. We applied the methodology described in [45] to recover spatial and frequency patterns of the underlying pivotal activity and found that they well adhered to the described properties of neural activity. We also did not find any evidence of microphone effect [47] in our data.

The accuracy we obtained in the synchronous mode appears sufficient to make a system usable in a real-life scenario when each word is "uttered" within a specific time slot, starting, for example, with a beep prompt. The extent to which the observed accuracy is transferred to a patient who lacks the ability to speak greatly depends on the specific medical case. Although we explored various arrangements of the data time window around the decision point our main results correspond to the decoder operating causally, i.e. utilizing neural activity strictly from the past which is expected not to depend on the perceived speech, see also [32]. This ensures that the observed accuracy can potentially transfer to real patients with speech function deficits given the appropriate patient training tools are developed.

Asynchronous BCI setting is clearly a more natural one for speech prosthesis operation. We experimented with our decoder in this scenario and observed a reasonable performance which however, needs to be improved before it can be used in practice. We recall 40% moments when one of the 26 words is uttered and in 60% of cases we correctly guessed this word out of 26 possible alternatives.

The use of a language model is known to improve speech decoding accuracy [55] and can also be added to improve the performance of the final consumer solution. However, our goal here was to assess to which extent the neural activity alone can be informative with regard to individual words classification and therefore we have deliberately refrained from using any language model in this study.

Overall our study showcases the possibility of building speech prosthesis with a small number of electrodes and based on a compact feature engineering free decoder derived from several tens of minutes worth of training data. To be translated into clinical practice this solution needs to be augmented with patient training procedures and a methodology to non-invasively determine implantation sites that would yield the best speech decoding accuracy.

Acknowledgment

This work is supported by the Center for Bioelectric Interfaces NRU HSE, RF Government grant, AG. No. 075-152021-624

References

[1] Sarah N Abdulkader, Ayman Atia, and Mostafa-Sami M Mostafa. Brain computer interfacing: Applications and challenges. Egyptian Informatics Journal, 16(2):213-230, 2015.

[2] Abidemi B Ajiboye and Robert F Kirsch. Invasive brain-computer interfaces for functional restoration. In Neuromodulation, pages 379-391. Elsevier, 2018.

[3] Hassan Akbari, Bahar Khalighinejad, Jose L Herrero, Ashesh D Mehta, and Nima Mesgarani. Towards reconstructing intelligible speech from the human auditory cortex. Scientific reports, 9(1):1-12, 2019.

[4] Miguel Angrick, Christian Herff, Emily Mugler, Matthew C Tate, Marc W Slutzky, Dean J Krusienski, and Tanja Schultz. Speech synthesis from ecog using densely connected 3d convolutional neural networks. Journal of neural engineering, 16(3):036019, 2019.

[5] Miguel Angrick, Maarten Ottenhoff, Lorenz Diener, Darius Ivucic, Gabriel Ivucic, Sofoklis Goulis, Jeremy Saal, Albert J Colon, Louis Wagner, Dean J Krusienski, et al. Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity. bioRxiv, 2020.

[6] Miguel Angrick, Maarten Ottenhoff, Lorenz Diener, Darius Ivucic, Gabriel Ivucic, Sophocles Goulis, Albert J Colon, Louis Wagner, Dean J Krusienski, Pieter L Kubben, et al. Towards closed-loop speech synthesis from stereotactic eeg: A unit selection approach. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1296-1300. IEEE, 2022.

[7] Tonio Ball, Markus Kern, Isabella Mutschler, Ad Aertsen, and Andreas Schulze-Bonhage. Signal quality of simultaneously recorded invasive and non-invasive eeg. Neuroimage, 46(3):708-716, 2009.

[8] Kalaba Bellman. Bellman r., kalaba r. On adaptive control processes, IRE Trans. Autom. Control, 4(2):1-9, 1959.

[9] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289-300, 1995.

[10] Deena Schwen Blackett, Jesse Varkey, Janina Wilmskoetter, Rebecca Roth, Keeghan Andrews, Natalie Busby, Ezequiel Gleichgerrcht, Rutvik Harshad Desai, Nicholas Riccardi, Alexandra Basilakos, et al. Neural network bases of thematic semantic processing in language production. Cortex, 2022.

[11] Peter Brunner, Anthony L Ritaccio, Timothy M Lynch, Joseph F Emrich, J Adam Wilson, Justin C Williams, Erik J Aarnoutse, Nick F Ramsey, Eric C Leuthardt, Horst Bischof, et al. A practical procedure for real-time functional mapping of eloquent cortex using electrocorticographic signals in humans. Epilepsy & Behavior, 15(3):278-286, 2009.

[12] Gyorgy Buzsaki. Rhythms of the Brain. Oxford University Press, 2006.

[13] Gyorgy Buzsaki, Costas A Anastassiou, and Christof Koch. The origin of extracellular fields and currentseeg, ecog, lfp and spikes. Nature reviews neuroscience, 13(6):407-420, 2012.

[14] Ujwal Chaudhary, Niels Birbaumer, and Ander Ramos-Murguialday. Brain-computer interfaces for communication and rehabilitation. Nature Reviews Neurology, 12(9):513, 2016.

[15] Jacquelyn A. Corley, Pouya Nazari, Vincent J. Rossi, Nora C. Kim, Louis F. Fogg, Thomas J. Hoeppner, Travis R. Stoub, and Richard W. Byrne. Cortical stimulation parameters for functional mapping. Seizure, 45:36-41, 2017.

[16] Andrey Eliseyev and Tatiana Aksenova. Stable and artifact-resistant decoding of 3d hand trajectories from ecog signals using the generalized additive model. Journal of neural engineering, 11(6):066005, 2014.

[17] Michael J Fagan, Stephen R Ell, James M Gilbert, E Sarrazin, and Peter M Chapman. Development of a (silent) speech recognition system for patients following laryngectomy. Medical engineering & physics, 30(4):419-425, 2008.

[18] Ananya Ganesh, Andre J Cervantes, and Philip R Kennedy. Slow firing single units are essential for optimal decoding of silent speech. Frontiers in human neuroscience, 16, 2022.

[19] Joris Guerin, Stephane Thiery, Eric Nyiri, Olivier Gibaru, and Byron Boots. Combining pretrained cnn feature extractors to enhance clustering of complex natural images. Neurocomputing, 423:551-571, 2021.

[20] Nicholas G Hatsopoulos and John P Donoghue. The science of neural interface systems. Annual review of neuroscience, 32:249-266, 2009.

[21] Stefan Haufe, Frank Meinecke, Kai Görgen, Sven Dähne, John-Dylan Haynes, Benjamin Blankertz, and Felix Bießmann. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage, 87:96-110, 2014.

[22] Christian Herff, Lorenz Diener, Miguel Angrick, Emily Mugler, Matthew C Tate, Matthew A Goldrick, Dean J Krusienski, Marc W Slutzky, and Tanja Schultz. Generating natural, intelligible speech from brain activity in motor, premotor, and inferior frontal cortices. Frontiers in neuroscience, 13:1267, 2019.

[23] Christian Herff, Dean J Krusienski, and Pieter Kubben. The potential of stereotactic-eeg for brain-computer interfaces: current progress and future directions. Frontiers in neuroscience, 14:123, 2020.

[24] Ingo Hertrich, Susanne Dietrich, and Hermann Ackermann. The margins of the language network in the brain. Frontiers in Communication, 5:519955, 2020.

[25] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.

[26] Mark L Homer, Arto V Nurmikko, John P Donoghue, and Leigh R Hochberg. Sensors and decoding for intracortical brain computer interfaces. Annual review of biomedical engineering, 15:383-405, 2013.

[27] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700-4708, 2017.

[28] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Reddy. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, USA, 1st edition, 2001.

[29] Prasanna Jayakar, Jean Gotman, A. Simon Harvey, André Palmini, Laura Tassi, Donald Schomer, Francois Dubeau, Fabrice Bartolomei, Alice Yu, Pavel Krek, Demetrios Velis, and Philippe Kahane. Diagnostic utility of invasive eeg for epilepsy surgery: Indications, modalities, and techniques. Epilepsia, 57(11):1735-1747, 2016.

[30] Rachel Kaye, Christopher G Tang, and Catherine F Sinclair. The electrolarynx: voice restoration after total laryngectomy. Medical Devices (Auckland, NZ), 10:133, 2017.

[31] Christopher K Kovach, Naotsugu Tsuchiya, Hiroto Kawasaki, Hiroyuki Oya, Mathew A Howard III, and Ralph Adolphs. Manifestation of ocular-muscle emg contamination in human intracranial recordings. Neuroimage, 54(1):213-233, 2011.

[32] Jan Kubanek, Peter Brunner, Aysegul Gunduz, David Poeppel, and Gerwin Schalk. The tracking of speech envelope in the human cortex. PloS one, 8(1):e53398, 2013.

[33] Mikhail A Lebedev and Miguel AL Nicolelis. Brain-machine interfaces: From basic science to neuroprostheses and neurorehabilitation. Physiological reviews, 97(2):767-837, 2017.

[34] Sergio Machado, Fernanda Araujo, Flavia Paes, Bruna Velasques, Mario Cunha, Henning Budde, Luis F Basile, Renato Anghinah, Oscar Arias-Carrion, Mauricio Cagy, et al. Eeg-based brain-computer interfaces: an overview of basic concepts and clinical applications in neurorehabilitation. Reviews in the Neurosciences, 21(6):451-468, 2010.

[35] Joseph N Mak and Jonathan R Wolpaw. Clinical applications of brain-computer interfaces: current state and future prospects. IEEE reviews in biomedical engineering, 2:187-199, 2009.

[36] Joseph G Makin, David A Moses, and Edward F Chang. Machine translation of cortical activity to text with an encoder-decoder framework. Nature Neuroscience, 23(4):575-582, 2020.

[37] L. Marple. A new autoregressive spectrum analysis algorithm. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28:441-454, 1980.

[38] Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, pages 18-25. Citeseer, 2015.

[39] David A Moses, Sean L Metzger, Jessie R Liu, Gopala K Anumanchipalli, Joseph G Makin, Pengfei F Sun, Josh Chartier, Maximilian E Dougherty, Patricia M Liu, Gary M Abrams, et al. Neuroprosthesis for decoding speech in a paralyzed person with anarthria. New England Journal of Medicine, 385(3):217-227, 2021.

[40] Emily M Mugler, James L Patton, Robert D Flint, Zachary A Wright, Stephan U Schuele, Joshua Rosenow, Jerry J Shih, Dean J Krusienski, and Marc W Slutzky. Direct classification of all american english phonemes using signals from functional speech motor cortex. Journal of neural engineering, 11(3):035015, 2014.

[41] Klaus-Robert Muller, Matthias Krauledat, Guido Dornhege, Gabriel Curio, and Benjamin Blankertz. Machine learning techniques for brain-computer interfaces. Biomed. Tech, 49(1):11-22, 2004.

[42] Luis Fernando Nicolas-Alonso and Jaime Gomez-Gil. Brain computer interfaces, a review. Sensors, 12(2):1211-1279, 2012.

[43] Thomas J Oxley, Nicholas L Opie, Sam E John, Gil S Rind, Stephen M Ronayne, Tracey L Wheeler, Jack W Judy, Alan J McDonald, Anthony Dornom, Timothy JH Lovell, et al. Minimally invasive endovascular stent-electrode array for high-fidelity, chronic recordings of cortical neural activity. Nature biotechnology, 34(3):320-327, 2016.

[44] Miguel Pais-Vieira, Mikhail Lebedev, Carolina Kunicki, Jing Wang, and Miguel Nicolelis. A brain-to-brain interface for real-time sharing of sensorimotor information. Scientific reports, 3:1319, 02 2013.

[45] Artur Petrosyan, Mikhail Sinkin, Mikhail Lebedev, and Alexei Ossadtchi. Decoding and interpreting cortical signals with a compact convolutional neural network. Journal of Neural Engineering, 18(2):026019, 2021.

[46] Nick F Ramsey, Efrai'm Salari, Erik J Aarnoutse, Mariska J Vansteensel, Martin G Bleichner, and ZV Freudenburg. Decoding spoken phonemes from sensorimotor cortex with high-density ecog grids. Neuroimage, 180:301311, 2018.

[47] Philemon Roussel, Gael Le Godais, Florent Bocquelet, Marie Palma, Jiang Hongjie, Shaomin Zhang, AnneLise Giraud, Pierre Megevand, Kai Miller, Johannes Gehrig, et al. Observation and assessment of acoustic contamination of electrophysiological brain signals during speech production and sound perception. Journal of Neural Engineering, 17(5):056028, 2020.

[48] Philemon Roussel, Gael Le Godais, Florent Bocquelet, Marie Palma, Jiang Hongjie, Shaomin Zhang, Philippe Kahane, Stephan Chabardes, and Blaise Yvert. Acoustic contamination of electrophysiological brain signals during speech production and sound perception. BioRxiv, page 722207, 2019.

[49] Gerwin Schalk and Eric C Leuthardt. Brain-computer interfaces using electrocorticographic signals. IEEE reviews in biomedical engineering, 4:140-154, 2011.

[50] Lauren J Silbert, Christopher J Honey, Erez Simony, David Poeppel, and Uri Hasson. Coupled neural systems underlie the production and comprehension of naturalistic narrative speech. Proceedings of the National Academy of Sciences, 111(43):E4687-E4696, 2014.

[51] M Sinkin, A Osadchiy, M Lebedev, K Volkova, M Kondratova, I Trifonov, et al. High resolution passive speech mapping in dominant hemisphere glioma surgery. Russ. J. Neurosurg, 21:12-18, 2019.

[52] Frank Soong and B Juang. Line spectrum pair (lsp) and speech data compression. In ICASSP'84. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 9, pages 37-40. IEEE, 1984.

[53] Galina Spitsyna, Jane E Warren, Sophie K Scott, Federico E Turkheimer, and Richard JS Wise. Converging language streams in the human temporal lobe. Journal of Neuroscience, 26(28):7328-7336, 2006.

[54] Stanley Smith Stevens, John Volkmann, and Edwin Broomell Newman. A scale for the measurement of the psychological magnitude pitch. The journal of the acoustical society ofamerica, 8(3):185-190, 1937.

[55] Pengfei Sun, Gopala K Anumanchipalli, and Edward F Chang. Brain2char: a deep architecture for decoding text from brain recordings. Journal of Neural Engineering, 17(6):066015, 2020.

[56] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1-9, 2015.

[57] Ksenia Volkova, Mikhail A Lebedev, Alexander Kaplan, and Alexei Ossadtchi. Decoding movement from elec-trocorticographic activity: A review. Frontiers in neuroinformatics, 13:74, 2019.

[58] Sarah K Wandelt, Spencer Kellis, David A Bjanes, Kelsie Pejsa, Brian Lee, Charles Liu, and Richard A Andersen. Decoding grasp and speech signals from the cortical grasp circuit in a tetraplegic human. Neuron, 2022.

[59] Francis R Willett, Donald T Avansino, Leigh R Hochberg, Jaimie M Henderson, and Krishna V Shenoy. Highperformance brain-to-text communication via handwriting. Nature, 593(7858):249-254, 2021.

[60] Guy H Wilson, Sergey D Stavisky, Francis R Willett, Donald T Avansino, Jessica N Kelemen, Leigh R Hochberg, Jaimie M Henderson, Shaul Druckmann, and Krishna V Shenoy. Decoding spoken english from intracortical electrode arrays in dorsal precentral gyrus. Journal of Neural Engineering, 17(6):066007, 2020.

[61] Min Xu, Ling-Yu Duan, Jianfei Cai, Liang-Tien Chia, Changsheng Xu, and Qi Tian. Hmm-based audio keyword generation. In Pacific-Rim Conference on Multimedia, pages 566-574. Springer, 2004.

Приложение 3. Статья. Linear Systems Theoretic Approach to Interpretation of Spatial and Temporal Weights in Compact CNNs: Monte-Carlo Study

Авторы: Petrosyan A., Lebedev M., Ossadtchi A.

Опубликовано: Advances in Intelligent Systems and Computing. 2021. Vol. 1310. P. 365-370.

DOI: 10.1007/978-3-030-65596-9_44

Разрешение на копирование: ©The Author(s), under exclusive license to Springer Nature Switzerland AG 2021

<§>

Check for updates

Linear Systems Theoretic Approach to Interpretation of Spatial and Temporal Weights in Compact CNNs: Monte-Carlo

Study

Artur Petrosyan(B), Mikhail Lebedev, and Alexey Ossadtchi

National Research University Higher School of Economics, Moscow, Russia petrosuanartur@gmail.com, aossadtchi@hse.ru https://bioelectric.hse.ru/en/

Abstract. Interpretation of the neural networks architectures for decoding the signals of the brain usually reduced to the analysis of spatial and temporal weights. We propose a theoretically justified method of their interpretation within the simple architecture based on a priori knowledge of the subject area. This architecture is comparable in decoding quality to the winner of the BCI IV competition and allows for automatic engineering of physiologically meaningful features. To demonstrate the operation of the algorithm, we performed Monte Carlo simulations and received a significant improvement in the restoration of patterns for different noise levels and also investigated the relation between the decoding quality and patterns reconstruction fidelity.

Keywords: ECoG • Weights interpretation • Limb kinematics decoding • Deep learning • Machine learning • Monte Carlo

1 Introduction

A step towards improving the performance of neurointerfaces is the use of advanced machine learning methods - Deep Neural Networks (DNN). DNNs learn a complete signal processing pipeline and do not require hand-crafted features. Interpretation of DNN solution plays a crucial role to 1) identify optimal spectral and temproral patterns that appear pivotal in providing the decoding quality (knowledge discovery) 2) ensure that the decoding relies on the neural activity and not on the unrelated physiological or external artefacts.

Recently, a range of compact neural architectures has been suggested for EEG, ECoG and MEG data analysis: EEGNet [4], DeepConvNet [10], VAR-CNN and LF-CNN [11]. The weights of these architectures are readily interpretable using the standard linear estimation theoretic approaches [3]. However, a special attention is needed to make the correct weights interpretation in the architectures with simultaneously adaptable temporal and spatial weights.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. V. Samsonovich et al. (Eds.): BICA 2020, AISC 1310, pp. 365-370, 2021. https://doi.org/10.1007/978-3-030-65596-9_44

2 Generating Data Model

The data generative model is illustrated in Fig. 1. Neural populations Gi — GI, which are responsible for movement, generate activity e(t) = [e1(t),..., eI(t)]T that is further translated into a trajectory of movements with some non-linear transform H, i.e. z(t) = H(e(t)). We assume that there are populations A1 — AJ with unrelated movement activity. Its activity is mixed into the sensors as well. At each time step t, we observe K-dimensional x(t) vector of sensor signals instead of the intensity of firing e(t) of individual populations. Vector x(t) is traditionally modelled as a linear mixture with A and G matrices, reflecting local field potentials f (t) and s(t) formed around both populations:

ji

x(t) = Af (t) + Gs(t) = £ aj f (t) + £ gi Si (t) (1)

j=i i=i

The local field potentials (LFPs) result from the nearby populations' activity and their characteristic frequency is typically related to the size [1] of each population. The firing intensity of the proximal neuronal population is approximated by the envelope of LPF. To counter the volume conduction effect, we will seek to obtain the estimates of LFPs as S(t) = WTX(t) and columns of W = [w1,..., wM] are referred to as spatial filters.

Fig. 1. Phenomenological model

Our regression task is to decode the kinematics z(t) from simultaneously recorded neural populations activity x(t). Generally, we do not have any true knowledge on G and other parameters of the forward mapping and transform H, therefore we need to parameterize and learn the entire mapping z(t) = F(x(t)).

3 Network Architecture

84

Figure 2 demonstrates our adaptable and compact CNN architecture based on the idea of (1). It consists of spatial filtering, adaptive envelope extractor, and a fully-connected layer. Spatial filtering is done via a pointwise convolution layer

Pointwise Conv 1D Conv ReLu(-1) 1D Conv

xk Sm Band-pa

l-pass filter

II Low pass

Fully-Connected

t t -1 ^ t - N+1 O O

o ...........■

Y

Spatial filtering

Adaptive envelope extractor

N most recent samples

Fig. 2. Proposed compact DNN

b

used to unmix the sources. Adaptive envelope extractor is in a form of two depth-wise convolutions for bandpass and lowpass filtering, used with non-trainable batch normalization (for explicit power extraction) and absolute value nonlin-earity in-between. The fully-connected layer is used in order to model envelope to kinematics z(t) transformation with H as a function of lagged envelopes of layers signals extracted previously. This architecture is implemented using the standard DNN layers. In principle, the temporal filtering layers can be replaced by a sinc-layer [8].

4 Spatial and Temporal Weights Interpretation

Assume that the data are processed in chunks of size N equal to the length of the temporal convolutional layer weights hm X(t) = [x(t), x(t — 1),... x(t — N + 1)]. Since the set of envelopes maps isomorphically onto a set of analytic signals [2], perhaps with the accuracy to a sign, the task of tuning the weights of the first three layers of our architecture to predict envelopes em(t) can be replaced with a regression problem of learning and correcting spatial and temporal weights to get the analytic signal bm(t) giving rise to the envelope. Assume that the temporal weights are fixed to their optimum value hm, then the optimal spatial filter weights can be obtained as:

wm = argrnrnwm{|| Mn) — wmx(t)hm ll2} (2)

and therefore assuming statistical independence of the rhythmic LFPs {sm(t)}, m = 1,...,M the spatial pattern of the underlying neuronal population is [3]

gm = e {Y(t)YT (t)}wm = Rm wm, (3)

where Y(t) = X(t)hm is a temporally filtered chunk of multichannel data and Rm = E {Y(t)YT (t)} is a branch-specific K xK covariance matrix of temporally filtered data, assuming that (t), k = 1, ...,K are zero-mean processes.

Symmetrically we can write an expression for the temporal weights interpretation as

qm = e {v(t)vT (t)}hm = Rmhm, (4)

85

where V(t) = XT(t)wm is a piece of spatially filtered data and Rm = E{V(t)VT(t)} is a branch specific N x N covariance matrix of spatially filtered data, assuming that (t), k = 1,...,K are all zero-mean processes. To

make sense out of the temporal pattern we explore it in the frequency domain, i.e. Qm [f ] = -1 qm [k]e-j2nfk, where qm [k] if the k-th element of qm.

Importantly, as it is the case with spatial pattern, that the obtained vectors gm can be usually used to fit dipolar models [6] and locate the corresponding source [3], the temporal patterns hm found according to (4) can be used to fit dynamical models such as those, for example, implemented in [7].

low noise medium noise high noise very high noise

0.25 0.50 0.75 1.00 0.25 0.50 0.75 0.2 0.4 0.6 0.2 0.4

Envelope decoding accuracy Weights Patterns naive • Patterns

Fig. 3. Monte Carlo simulations. Point coordinates reflect the achieved at each Monte Carlo trial envelope decoding accuracy (x-axis) and correlation coefficient with the true pattern (y-axis). Each point of a specific color corresponds to a single Monte Carlo trial and codes for a method used to compute patterns. Weights direct weights interpretation. Patterns naive Spatial patterns interpretation without taking branch specific temporal filters into account, Patterns - the proposed method

Table 1. Correlation between true and predicted kinematics of the winning solution for BCI competition IV dataset (Winner) and proposed architecture (NET)

Subject 1—2—3 Thumb Index Middle Ring Little

Winner NET .58—.51—.69 .54—.50—.71 .71—.37—.46 .70—.36—.48 .14—.24—.58 .20—.22—.50 .53—.47—.58 .58—.40—.52 .29—.35—.63 .25—.23—.61

5 Comparative Decoding Accuracy

In the context of the electrophysiologic^ data processing, the main benefit of deep learning solutions is their end-to-end learning method which does not require task-specific features preparation [9]. To make sure that our implementation of a simple CNN is capable of learning the needed mapping, we applied

it to the collected by Kubanek et al. publicly available data from the BCI Competition IV and compared its performance to that of the winning solution [5].

The results of both algorithms are listed in Table 1. Our simple neural network has comparable decoding quality as the linear model [5] but does not require upfront feature engineering but rather learns the features itself.

6 Monte-Carlo Simulations

We followed the setting described in Fig. 1 to generate the data. We simulated I = 4 sources related to the task with rhythmic LFPs occupying the

different ranges: 170-220 Hz, 120-170 Hz, 80-120 Hz and 30-80 Hz bands. The target kinematics z(t) was simulated as a linear combination of 4 envelopes of rhythmic LFPs with a vector of random coefficients. We used J = 40 unrelated to the task rhythmic LFP sources in the bands of 180-210 Hz, 130-160 Hz, 90-110 Hz and 40-70 Hz. Each band contained ten sources. Matrices G and A which model the volume conduction effects at each Monte Carlo trial were randomly generated according to N(0,1) distribution. We created 20min worth of data sampled 1000 Hz.

For neural network training we use Adam optimiser. We made about 15k steps. At 5k and 10k step we halved the learning rate to get more accurate patterns. In total, we have performed more then 3k simulations.

We performed Monte-Carlo study with different spatial configuration of sources at each trial. For each realisation of the generated data we have trained the DNN to predict the kinematic variable z(t) and then computed the patterns of sources the individual branches of our architecture got "connected" to as a result of training.

Figure 3 shows that only the spatial Patterns interpreted using branch-specific temporal filters match well the simulated topographies of the true underlying sources. Moreover Patterns naive and Weights correlation decreasing with noise raises, while Patterns is almost perfectly recover true patterns for all noise level settings.

The spectral patterns recovered using the proposed approach also appear to match well with the true spectral profiles of the underlying sources, while directly considering the Fourier coefficients of the temporal convolution layer weights results into erroneous spectral profiles. Using the proper spectral patterns of the underlying neuronal population it is now possible to fit biologically plausible models, e.g. [7], and recover true neurophysiological mechanisms underlying the decoded process.

7 Conclusion

We proposed a theoretically justifie^-method for the interpretation of spatial and temporal weights of the CNN architecture composed of simple envelope extractors. This result extends already existing approaches [3] to weights interpretation. With Monte-Carlo simulations we were able to demonstrate that the

proposed approach accurately recovers both spatial and temporal patterns of the underlying phenomenological model for a broad range of signal to noise ratio values.

Acknowledgments. This work is supported by the Center for Bioelectric Interfaces NRU HSE, RF Government grant, ag. No. 14.641.31.0003.

References

1. Buzsaki, G.: Rhythms of the Brain. Oxford University Press, New York (2006)

2. Hahn, S.L.: On the uniqueness of the definition of the amplitude and phase of the analytic signal. Sig. Process. 83(8), 1815-1820 (2003)

3. Haufe, S., Meinecke, F., Gorgen, K., Dahne, S., Haynes, J.D., Blankertz, B., Biefi-mann, F.: On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage 87, 96-110 (2014)

4. Lawhern, V.J., Solon, A.J., Waytowich, N.R., Gordon, S.M., Hung, C.P., Lance, B.J.: EEGNet: a compact convolutional network for EEG-based brain-computer interfaces. arXiv preprint arXiv:161108024 (2016)

5. Liang, N., Bougrain, L.: Decoding finger flexion from band-specific ECoG signals in humans. Front. Neurosci. 6, 91 (2012). https://doi.org/10.3389/fnins.2012.00091

6. Mosher, J., Leahy, R., Lewis, P.: EEG and MEG: forward solutions for inverse methods. Neuroimage 46, 245-259 (1999). https://doi.org/10.1109/10.748978

7. Neymotin, S.A., Daniels, D.S., Caldwell, B., McDougal, R.A., Carnevale, N.T., Jas, M., Moore, C.I., Hines, M.L., Hamalainen, M., Jones, S.R.: Human Neocortical Neurosolver (HNN), a new software tool for interpreting the cellular and network origin of human MEG/EEG data. eLife 9, e51214 (2020). https://doi.org/10.7554/ eLife.51214

8. Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with SincNet. In: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp. 10211028 (2018)

9. Roy, Y., Banville, H., Albuquerque, I., Gramfort, A., Falk, T.H., Faubert, J.: Deep learning-based electroencephalography analysis: a systematic review. J. Neural Eng. 16(5), 051001 (2019)

10. Schirrmeister, R.T., Springenberg, J.T., Fiederer, L.D.J., Glasstetter, M., Eggensperger, K., Tangermann, M., Hutter, F., Burgard, W., Ball, T.: Deep learning with convolutional neural networks for brain mapping and decoding of movement-related information from the human EEG. arXiv preprint arXiv:170305051 (2017)

11. Zubarev, I., Zetter, R., Halme, H.L., Parkkonen, L.: Adaptive neural network classifier for decoding MEG signals. NeuroImage 197, 425-434 (2019)

Приложение 4. Статья. Decoding neural signals with a compact and interpretable convolutional neural network

Авторы: Petrosyan A., Lebedev M., Ossadtchi A.

Опубликовано: Studies in Computational Intelligence. 2021. SCI, Vol. 925. P. 420-428 DOI: 10.1007/978-3-030-60577-3_50

Разрешение на копирование: ©The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021

Check for updates

Decoding Neural Signals with a Compact and Interpretable Convolutional Neural

Network

Artur Petrosyan(B), Mikhail Lebedev, and Alexey Ossadtchi

National Research University Higher School of Economics, Moscow, Russia petrosuanartur@gmail.com, aossadtchi@hse.ru https://bioelectric.hse.ru/en/

Abstract. In this work, we motivate and present a novel compact CNN. For the architectures that combine the adaptation in both space and time, we describen a theoretically justified approach to interpreting the temporal and spatial weights. We apply the proposed architecture to Berlin BCI IV competition and our own datasets to decode electrocor-ticogram into finger kinematics. Without feature engineering our architecture delivers similar or better decoding accuracy as compared to the BCI competition winner. After training the network, we interpret the solution (spatial and temporal convolution weights) and extract physiologically meaningful patterns.

Keywords: Limb kinematics decoding • Ecog • Machine learning • Convolutional neural network

1 Introduction

The algorithms used to extract relevant neural modulations are a key component of the brain-computer interface (BCI) system. Most often, they implement signal conditioning, feature extraction, and decoding steps. Modern machine learning prescribes performing the two last steps simultaneously with the Deep Neural Networks (DNN) [5]. DNNs automatically derive features in the context of assigned regression or classification tasks. Interpretation of the computations performed by a DNN is an important step to ensure the decoding is based on brain activity and not artifacts only indirectly related to the neural phenomena at hand. A proper features interpretation obtained from the first several layers of a DNN can also benefit the automated knowledge discovery process. In case of BCI development, one way to enable this is to use specific DNN architectures that reflect prior knowledge about the neural substrate of the specific neuromodulation used in a particular BCI.

Several promising and compact neural0architectures have been developed in the context of EEG, MEG and ECoG data analysis over recent years: EEGNet

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021

B. Kryzhanovsky et al. (Eds.): NEUROINFORMATICS 2020, SCI 925, pp. 420-428, 2021. https://doi.org/10.1007/978-3-030-60577-3_50

[4], DeepConvNet [8], LF-CNN and VAR-CNN [9]. By design the weights of these DNNs are readily interpretable with the use of well-known approaches for understanding the linear model weights [3]. However, to make such interpretations correct, extra care is needed.

Here we present another compact architecture, technically very similar to LF-CNN, but motivated by somewhat different arguments than those in [9]. We also provide a theoretically based approach to the interpretation of the temporal and spatial convolution weights and illustrate it using a realistically simulated and real data.

2 Methods

We assume the phenomenological setting presented in Fig. 1. The activity e(t) of a complex set of neural populations G1 — G/, responsible for performing a movement act, gets translated into a movement trajectory by means of some most likely non-linear transformation H, i.e. z(t) = H(e(t)). There are also populations A1 — Aj whose activity is not related to movement but impinges onto the sensors. We do not have a direct access to the intensity of firing e(t) of individual populations. Instead, we observe a K-dimensional vector of sensor signals x(t), which is traditionally modeled as a linear mixture of local field potentials (LFPs) s(t) formed around task relevant populations and taks-irrelevant LFPs f (t). The task-relevant and task-irrelevant LFPs impinge onto the sensors with forward model matrices G and A correspondingly, i.e.

/j

x(t) = Gs(t) + Af (t) = £ giSi (t) + £ aj fj (t) (1)

i=1 j=1

We will refer to the task-irrelevant term recorded by our K sensors as n(t) =

EJ=i aj fj(t).

The LFPs are thought to be the result of activity of the nearby populations and the characteristic frequency of LFPs is related to the population size [1]. The envelope of LFP then approximates the firing intensity of the proximal neuronal population. The inverse mapping is also most commonly sought in the linear form so that the estimates of LFPs are obtained as a linear combination of the sensor signals, i.e. S(t) = WTX(t) where columns of W = [w1,..., wM] are the spatial filters that aim to counteract the volume conduction effect and tune away from the activity of interference sources.

Our goal is to approximate the kinematics z(t) using concurrently obtained indirect records x(t) of activity of neural populations. In general, we do not know G and the most straightforward approach is to learn the direct mapping z(t) = F (x(t)).

3 Network Architecture 91

Based on the above considerations, we have developed a compact adaptable architecture shown in Fig. 2. The key component of this architecture is an

Fig. 1. Phenomenological model

adaptive envelope extractor. Interestingly, the envelope extractor, a typical module widely used in signal processing, can be readily implemented using deep learning primitives. It comprises several convolutions used for band-pass and low-pass filtering and computing the absolute value. We also use non-trainable batch-norm before activation and standardize input signals.

PointwiseConv IDConv ReLu(-l) lDConv

bi A

II

hm fin

II

Dm II Tm

Spatial filtering

Adaptive envelope extractor

N most recent samples

Fig. 2. The proposed compact DNN architecture

The envelope detectors receive spat^y filtered sensor signals sm obtained by the pointwise convolution layer, which counteracts the volume conduction processes modeled by the forward model matrix G, see Fig. 1. Then, as mentioned earlier, we approximate operator H as some function of the lagged power

of the source time series by means of a fully connected layer that mixes lagged samples of envelopes [em (n), ...,em(n — N +1)] from all branches into a single prediction of the kinematic z(n).

4 Two Regression Problems and DNN Weights Interpretation

The proposed architecture processes data in chunks X(t) = [x(t), x(t — 1),...x(t — N + 1)] of some prespecified duration of N samples. In the case when chunk size N equals to the length of the first convolution layer weight vector hm, the processing of X(t) by the first two layers applying spatial and temporal filtering can be simply presented as

bm (n) = wmX(i)hm (2)

By design ReLu(—1) non-linearity followed by the low-pass filtering performed by the second convolution layer extracts envelopes of the estimates of the underlying rhythmic LFPs.

Given the one-to-one mapping between the analytic signal and its envelope [2] we can mentally replace the task of optimizing the parameters of the first three layers of the architecture in Fig. 2 to predict envelopes em (t) with a simple regression task of adjusting the spatial and temporal filter weights to obtain envelope's generating analytic signal bm(t), see Fig. 2. Fixing temporal weights to their optimal value hm, the optimal spatial weights can be received as a solution to the following convex optimization problem:

wm = argminwm hi bm(n)—wmx(t)hm ||2} (3)

and similarly for the temporal convolution weights:

hm = argminhm{|| bm(t) — w^X(t)hm ||2} (4)

If we assume statistical independence of neural sources sm (t), m = 1,..., M, then (given the regression problem (3) and forward model (1)) their topographies can be assessed as:

gm = e {Y(t)Y(t)T }wm = Rm wm, (5)

where Rm = E{Y(t)Y(t)T} is a K x K covariance matrix of Y(t) = X(t)hm temporally filtered multi-channel data under the assumption that xk (t), k = 1,...,K are all zero-mean processes [3].

Then, we observe the exactly symmetric recipe for interpreting the temporal weights. The temporal pattern can be found as:

qm = e wtmtf }hm = Rm hm (6)

where V(t) = X(t)T wm is a chunk of input signal passed through the spatial filter and Rm = E{V(t)V(t)T} is a branch specific N x N covariance matrix

of spatially filtered data. Here we again assume that xk(t), k = 1,... , K are zero-mean processes. Commonly, we explore the frequency domain of temporal pattern to get the sense of it, i.e. Qm(f) = — Q™(t)e-j2nft, where qm(t)

is the t-th element of qm temporal pattern vector.

When the chunk of data is longer than the filter length, the equation (2) has to be written with the convolution operation and will result not into a scalar, but a vector. In this case using the standard Wiener filtering arguments we can arrive at

Qm(f) = pmy (f )Hm (f) (7)

as the expression for the Fourier domain representation of the LFP activity pattern in the m-th branch. H^(f) in equation (7) is simply the Fourier transform of the temporal convolution weights vector h^.

5 Simulated and Real Data

In order to generate the simulated data, we precisely followed the setup described in our phenomenological diagram in Fig. 1 with the following parameters. We generated four task-related sources with rhythmic LFPs Si(t) as narrow-band processes that resulted from filtering the Gaussian pseudo-random sequences in 30-80 Hz, 80-120 Hz, 120-170 Hz and 170-220 Hz bands using FIR filters. We add 10 task-unrelated sources per band with activation time series located in four bands: 40-70 Hz, 90-110 Hz, 130-160 Hz and 180-210 Hz. Kinematics z(t) was generated as a linear combination of the four envelopes. To simulate volume conduction effect we simply randomly generated 4 x 5 dimensional forward matrix G and 40 x 5 dimensional forward matrix A. We simulated 15min of the synthetic data sampled at 1000 Hz and then split it into equal contiguous train and test parts.

We used open source ECoG + kinematics data set from the BCI Competition IV collected by Kubanek et al to compare our compact DNN's decoding quality to linear models with pre-engineered features. The winning solution provided by Liang and Bougrain [6] have chosen as a baseline in this comparison. Another data set is our own ECoG data CBI (the Center for Bioelectric Interfaces) recorded with a 64-channel microgrid during self paced flexion of each individual finger over 1 min. The ethics research committee of the National Research University, The Higher School of Economics approved the experimental protocol of this study.

6 Simulated Data Results

We have trained the algorithm on simulated data to decode the kinematic z(t) and then to recover the patterns of sources that were found to be important for this task. Figure 3 shows that the only g^d match with the simulated topographies based on the true underlying sources is performed by Patterns using specific to branch temporal filters. The characteristic dips in the bands that correspond to the interference sources activity are demonstrated by the spectral

characteristics of the trained temporal filtering weights. Using the estimation theoretical approach (7), we acquire spectral patterns that closely match the simulated ones and have dips compensation.

1.0

A

a 0.5

8

m

0.0 1.0

M A

c 0.5

S

m

0.0 1.0

n x.

c 0.5

S m

0.0

Temporal Patterns

Spatial Patterns

r

J V

/A' \

-A J "

50 100 150 200 250 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

1.0 H

50 100 150 200 250 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

1.0 H

50 100 150 200 250 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Frequency, Hz

Weights ▼ Patterns * True

True —Patterns ▼ Patterns naive -#- Weights

Fig. 3. Temporal and spatial patterns acquired for a noisy case, SNR = 1.5. See the main text for the more detailed description.

7 Real Data Results: BCI Competition IV

In the context of processing electrophysiological data, the main advantage of deep learning based architectures is their ability to perform automatic feature selection in regression or classification tasks [7]. We have found that the architecture with the adaptive envelope detectors applied to Berlin BCI Competition IV data set performs on par or better compared to the winning solution [6], see Table 1.

8 Real Data Results: CBI Data

The following table shows the achieved accuracy for the four fingers of the two patients achieved with the proposed architecture.

In Fig. 4 we have applied the interpretation of the obtained spatial and temporal weights similarly to the way we analysed realistically simulated data. Below we show the interpretation plots for Patient 1 index finger.

Patterns vanilla

Patterns

hrrl /ii\

11 1 1 JJJ \

i i ts /7 pi_/ Ii 1 i i

i \ .'A 1 ppi uM \1 1 r j_s

I < l; —T ¿K L pvVv 37TT

75 100 125 Frequency, Hz

■ Learned Weights

■ Output Spectrum

Fig. 4. The interpretation of network weights for the index finger decoder for patient 1 from CBI data set. Each plot line corresponds to one out of three trained decoder's branches. The leftmost column shows the spatial filter weights mapped into colours, while the second and the third columns correspond to vanilla spatial patterns and properly recovered ones. The line graphs interpret the temporal filter weights in the Fourier domain. The filter weights are presented by the solid line, the power spectral density (PSD) pattern of the underlying LFP is marked by the blue dash line. The orange dash line, which is more similar to the filter weights Fourier coefficients, is the PSD of the signal at the output of the temporal convolution block.

Table 1. Comparative performance of our model architecture (NET) and the winning solution (Winner) of BCI IV competition Data set 4: «finger movements in ECoG ».

Subject 1

Thumb Index Middle Ring Little

Winner 0.58 0.71 0.14 0.53 0.29

NET 0.53 0.69 0.19 0.57 0.24

Subject 2

Thumb Index Middle Ring Little

Winner 0.51 0.37 0.24 0.47 0.35

NET 0.49 0.35 0.23 0.39 0.22

Subject 3

Thumb Index Middle Ring Little

Winner 0.69 0.46 0.58 0.58 0.63

NET 0.72 0.49 0.49 0.53 0.6

Table 2. Decoding performance obtained in two CBI patients. The results show the correlation coefficients between the actual and decoded finger trajectories for four fingers in two patients.

Thumb Index Ring Little

Subject 1 0.47 0.80 0.62 0.33

Subject 2 0.74 0.54 0.77 0.80

The DNN architecture for the CBI data had three branches, which were tuned to specific spatial-temporal pattern. We demonstrate the spatial filter weights, vanilla and proper patterns, which were interpreted by the expressions described in the Methods section. As you can in Fig. 4, the temporal filter weights (marked by solid line) clearly emphasize the frequency range above 100 Hz in the first two branches and the actual spectral pattern of the source (marked by dash line) in addition to the gamma-band content has peaks at around 11 Hz (in the first and second branches) and in the 25-50Hz range (the second branch). It may correspond to the sensory-motor rhythm and lower components of the gamma rhythm correspondingly. The third branch appears to be focused on a lower frequency range. Its spatial pattern is notably more diffused than pattern, focused on the higher frequency components in the first two branches. It is consistent with the phenomenon that the activation frequency and size of neural populations are mutually proportional.

9 Conclusion

We introduced a novel compact and interpretable architecture motivated by the knowledge present in the field. We have also extended the weights interpretation approach described earlier in [3] to the interpretation of the temporal convolution weights. We performed experiments with the proposed approach using both simulated and real data. In simulated data set the proposed architecture was able to almost exactly recover the underlying neuronal substrate that contributes to the kinematic time series that it was trained to decode.

We applied the proposed architecture to the real data set of BCI IV competition. Our neural network performed the decoding accuracy similar to the winning solution of the BCI competition [6]. Unlike the traditional approach, our DNN model does not require any feature engineering. On the contrary, after training the structure to decode the finger kinematics, we are able to interpret the weights as well as the extracted physiologically meaningful patterns, which correspond to the both temporal and spatial convolution weights.

Acknowledgement. This work is supported by the Center for Bioelectric Interfaces NRU HSE, RF Government grant, ag. N8. 14.641.31.0003.

References

1. Buzsaki, G.: Rhythms of the Brain. Oxford University Press, New York (2006)

2. Hahn, S.L.: On the uniqueness of the definition of the amplitude and phase of the analytic signal. Signal Process. 83(8), 1815-1820 (2003)

3. Haufe, S., Meinecke, F., Görgen, K., Dähne, S., Haynes, J.D., Blankertz, B., Bießmann, F.: On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage 87, 96-110 (2014)

4. Lawhern, V.J., Solon, A.J., Waytowich, N.R., Gordon, S.M., Hung, C.P., Lance, B.J.: EEGnet: a compact convolutional network for EEG-based brain-computer interfaces. arXiv preprint arXiv:161108024 (2016)

5. Lemm, S., Blankertz, B., Dickhaus, T., Müller, K.R.: Introduction to machine learning for brain imaging. Neuroimage 56(2), 387-399 (2011)

6. Liang, N., Bougrain, L.: Decoding finger flexion from band-specific ECoG signals in humans. Front. Neurosci. 6, 91 (2012). https://doi.org/10.3389/ fnins.2012.00091

7. Roy, Y., Banville, H., Albuquerque, I., Gramfort, A., Falk, T.H., Faubert, J.: Deep learning-based electroencephalography analysis: a systematic review. J. Neural Eng. 16(5), 051001 (2019)

8. Schirrmeister, R.T., Springenberg, J.T., Fiederer, L.D.J., Glasstetter, M., Eggensperger, K., Tangermann, M., Hutter, F., Burgard, W., Ball, T.: Deep learning with convolutional neural networks for brain mapping and decoding of movement-related information from the human EEG. arXiv preprint arXiv:170305051 (2017)

9. Zubarev, I., Zetter, R., Halme, H.L., Parkkonen, L.: Adaptive neural network classifier for decoding meg signals. Neuroimage 197, 425-434 (2019)

Приложение 5. Статья. Compact and interpretable architecture for speech decoding from stereotactic EEG

Авторы: Petrosyan A., Voskoboinikov A., Ossadtchi A.

Опубликовано: Third International Conference Neurotechnologies and Neurointerfaces (CNN) - IEEE. 2021. С. 79-82

DOI: 10.1109/CNN53494.2021.9580381

Разрешение на копирование: ©2021 IEEE. Reprinted, with permission, from [5].

Compact and interpretable architecture for speech decoding from stereotactic EEG

Artur Petrosyan Alexey Voskoboynikov Alexei Ossadtchi

Center for Bioelectric Interfaces Center for Bioelectric Interfaces Center for Bioelectric Interfaces

Higher School of Economics Moscow Higher School of Economics Moscow Higher School of Economics Moscow

Russia, 101000 Russia, 101000 Russia, 101000

Email: apetrosyan@hse.ru Email: avoskoboinikov@hse.ru Email: aossadtchi@hse.ru

Abstract—Background: Brain-computer interfaces (BCIs) decode neural activity and extract from it information that can be meaningfully interpreted. One of the most intriguing opportunities is to employ BCIs for decoding speech, a uniquely human trait, which opens up plentiful applications from rehabilitation of patients to a direct and seamless communication between human species. To decipher neuronal code complex deep neural networks furnish only limited success. In such solutions an iffy performance gain is achieved with uniterpretable decision rules characterised by thousands of parameters to be identified from a limited amount of training data. Our recent experience shows that when applied to neural activity data compact neural networks with trainable and physiologically meaningful feature extraction layers [1] deliver comparable performance, ensure robustness of the learned decision rules and offer the exciting opportunity of automatic knowledge discovery.

Methods: We collected approximately one hour of data (from two sessions) where we recorded stereotactic EEG (sEEG) activity during overt speech (6 different randomly shuffled phrases and rest). We have also recorded synchronized audio speech signal. The sEEG recording was carried out in an epilepsy patient implanted for medical reasons with an sEEG electrode passing through Broca area with 6 contacts spaced at 5 mm. We then used a compact convolutional network-based architecture to recover speech mel-cepstrum coefficients followed by a 2D convolutional network to classify individual words. We then interpreted the former network weights using the theoretically justified approach devised by us earlier [1].

Results: We achieved on average 44% accuracy in classifying 26+1 words (3.7% chance level) using only 6 channels of data recorded with a single minimally invasive sEEG electrode. We compared the performance of our compact convolutional network to that of the DenseNet-like architecture that has recently been featured in neural speech decoding literature and did not find statistically significant performance differences. Moreover, our architecture appeared to be able to learn faster and resulted in a stable, interpretable and physiologically meaningful decision rule successfully operating over a contiguous data segment no-overlapping with the training data interval. Spatial characteristics of neuronal population pivotal to the task corroborate the results of active speech mapping procedure and frequency domain patterns show primary involvement of the high frequency activity.

Conclusions : Most of the speech decoding solutions available to date either use potentially harmful intracortical electrodes or rely on the data recorded with impractically massive multi-electrode grids covering large cortical area. Here we for the first time achieved practically usable decoding accuracy for the vocabulary of 26 words + 1 silence class backed by only 6 channels of cortical activity sampled with a single sEEG shaft. The

decoding was implemented using a compact and interpretable architecture which ensures robustness of the solution and requires small amount of training data. The proposed approach is the first step towards minimally invasive implantable BCI solution for restoring speech function.

I. Introduction

Brain-computer interfaces (BCIs) directly link the nervous system to external devices [2] or even other brains [3]. While there exist many applications of BCIs [4], clinically relevant BCIs are of primary interest since they hold promise to rehabilitate patients with sensory, motor, and cognitive disabilities [5],[6].

BCIs can deal with a variety of neural signals [7], [8] such as, for example, electroencephalographic (EEG) potentials sampled with electrodes placed on the surface of the head [9], or neural activity recorded invasively with intracortical electrodes penetrating cortex [10] or placed onto the cortical surface [11]. A promising and minimally invasive way to directly access cortical activity is to use stereotactic EEG (sEEG) electrodes inserted stereotactically via a twist drill or a burr hole made in the skull. Recent advances in implantation techniques including the use of brain's 3D angiography, MRI and robot-assisted surgery help to further reduce the risks of such an implantation and make sEEG technology an ideal trade-off for BCI applications [12].

Current study deals with restoration of speech function, one of the most exciting potential applications of the BCI technology. Several attempts have already been made and certain progress is achieved in decoding both individual words [13], [14] and phonetic features [15] with practically usable accuracy. However, these studies relied on heavily multichannel brain activity measurements implemented either with intracortical arrays [16] or with massive ECoG grids [17], [13], [18] covering significant cortical area. Both solutions for reading off brain activity are not intended for a long term use and are associated with significant risks to a patient [19]. sEEG is a promising alternative that has already being tried for the speech decoding task [20] with some success. This study, however, relies on the high count of sEEG channels distributed lover a large part of the left frontal and left superior temporal lobes which hinders practical applications.

65432]

Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.