Images : Alexandre Mellier 2008 + Borgy photographie

Papiers acceptés

Apprentissage de représentation 1
[19]	Robust Domain Adaptation: Representations, Weights and Inductive Bias	[PDF] [VIDEO] [SLIDES]
	Victor Bouvier, Philippe Very, Clément Chastagnol, Myriam Tami and Hudelot Céline
Résumé : Unsupervised Domain Adaptation (UDA) has attracted a lot of attention in the last ten years. The emergence of Domain Invariant Representations (IR) has improved drastically the transferability of representations from a labelled source domain to a new and unlabelled target domain. However, a potential pitfall of this approach, namely the presence of \textit{label shift}, has been brought to light. Some works address this issue with a relaxed version of domain invariance obtained by weighting samples, a strategy often referred to as Importance Sampling. From our point of view, the theoretical aspects of how Importance Sampling and Invariant Representations interact in UDA have not been studied in depth. In the present work, we present a bound of the target risk which incorporates both weights and invariant representations. Our theoretical analysis highlights the role of inductive bias in aligning distributions across domains. We illustrate it on standard benchmarks by proposing a new learning procedure for UDA. We observed empirically that weak inductive bias makes adaptation more robust. The elaboration of stronger inductive bias is a promising direction for new UDA algorithms.

[20]	Vers une meilleure compréhension des méthodes de méta-apprentissage à travers la théorie de l'apprentissage de représentations multi-tâches	[PDF] [VIDEO] [SLIDES]
	Quentin Bouniot, Ievgen Redko, Romaric Audigier and Angélique Loesch
Résumé : Dans ce papier, nous considérons le cadre de l'apprentissage de représentations multi-tâches où l'objectif est d'utiliser des tâches sources pour apprendre une représentation qui réduit la complexité en données nécessaires pour la résolution d'une tâche cible. Nous commençons par passer en revue les avancées récentes de la théorie en apprentissage multi-tâches et nous montrons qu'elles peuvent fournir de nouveaux éclaircissements pour les algorithmes populaires de méta-apprentissage lorsque ceux-ci sont analysés dans ce cadre. En particulier, nous mettons en évidence une différence fondamentale entre les algorithmes basés sur les gradients et ceux basés sur un calcul de distance et nous proposons une analyse théorique pour l'expliquer. Enfin, nous utilisons les résultats obtenus pour améliorer la capacité de généralisation des méthodes de méta-apprentissage par le biais d'un nouveau terme de régularisation spectral et nous confirmons son efficacité par des études expérimentales sur des bases de données classiques de classification avec peu d'images. À notre connaissance, il s'agit de la première contribution qui met en pratique les plus récentes bornes issues de la théorie de l'apprentissage de représentations multi-tâches.

[37]	Linear Program Powered Attack	[PDF] [VIDEO]
	Ismaila Seck, Gaelle Loosli and Stephane Canu
Résumé : Finding the exact robust test error is a good way to assess neural networks, but it is a difficult task even on small networks and datasets like MNIST. On the one hand, comprehensive methods such as Mixed Integer Program (MIP) give exact robust test accuracy but are time-consuming. On the other hand, many popular attacks are fast but tend to perform poorly against robust networks and only provide a bound on the robust test error. The purpose of this paper is to present a fast and novel attack method called LiPPA, that gives better bounds than previous attacks. This method exploits the algebraic properties of networks with piecewise linear activation functions to partition the input space in such a way that for each subset of that partition, finding the local optimal adversarial example is done by solving a linear program. Switching from one subset to another is done using classic gradient-based attack tools. The empirical evidence reported on MNIST illustrates the interest of LiPPA over state-of-the-art fast attacks.

Statistical Learning Theory
[15]	Apprentissage de Vote de Majorité par Minimisation d'une C-Borne PAC-Bayésienne
	Paul Viallard, Pascal Germain and Emilie Morvant
Résumé : Dans la littérature PAC-Bayésienne, la C-Borne est une fonction de perte qui prend en compte les deux premiers moments statistiques de la marge d'un modèle s'exprimant comme un vote de majorité. Les algorithmes d'apprentissage basés sur la C-Borne développés jusqu'à présent minimisent uniquement la version empirique de la C-Borne, au lieu de minimiser explicitement une borne en généralisation (PAC-Bayésienne). Dans cet article, nous dérivons les premiers algorithmes de minimisation de bornes en généralisation existantes sur la C-Borne. Nos algorithmes, basés sur la descente de gradient, permettent d'apprendre des votes de majorités de risque faible avec un des garanties fortes et non trivialles.

[16]	Dérandomisation des Bornes PAC-Bayésiennes
	Paul Viallard, Pascal Germain and Emilie Morvant
Résumé : Les bornes en généralisation PAC-Bayésiennes sont connues pour être précises et informatives lors de l'étude de la capacité en généralisation des classifieurs stochastiques. Cependant, lorsqu'elles sont appliquées à une famille de modèles déterministes, tels que les réseaux de neurones, elles requièrent une étape de dérandomisation coûteuse. Afin d'éviter cette étape, nous introduisons trois nouvelles bornes en généralisation PAC-Bayésiennes qui ont l'originalité d'être pointwise, ce qui signifie qu'elles fournissent des garanties sur une seule hypothèse tirée selon une distribution (apprise). Nos bornes sont générales, potentiellement paramétrables, et offrent de nouvelles perspectives pour divers contextes d'apprentissage automatique de modèles déterministes. Nous illustrons ce résultat avec une analyse de la généralisation des réseaux de neurones.

[51]	Implicit Regularization via Neural Feature Alignment
	Aristide Baratin, Thomas George, César Laurent, R Devon Hjelm, Guillaume Lajoie, Pascal Vincent and Simon Lacoste-Julien
Résumé : We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a regularization effect induced by a dynamical alignment of the neural tangent features introduced by Jacot et al, along a small number of task-relevant directions. This can be interpreted as a combined mechanism of feature selection and compression. By extrapolating a new analysis of Rademacher complexity bounds for linear models, we motivate and study a heuristic complexity measure that captures this phenomenon, in terms of sequences of tangent kernel classes along optimization paths. Cet article a été publié à AISTATS 2021

NLP
[3]	Let’s Stop Incorrect Comparisons in End-to-end Relation Extraction !	[PDF]
	Bruno Taillé, Vincent Guigue, Geoffrey Scoutheeten and Patrick Gallinari
Résumé : Article accepted at EMNLP 2020. Despite efforts to distinguish three different evaluation setups (Bekoulis et al., 2018), numerous end-to-end Relation Extraction (RE) articles present unreliable performance comparison to previous work. In this paper, we first identify several patterns of invalid comparisons in published papers and describe them to avoid their propagation. We then propose a small empirical study to quantify the most common mistake’s impact and evaluate it leads to overestimating the final RE performance by around 5% on ACE05. We also seize this opportunity to study the unexplored ablations of two recent developments: the use of language model pretraining (specifically BERT) and span-level NER. This meta-analysis emphasizes the need for rigor in the report of both the evaluation setting and the dataset statistics. We finally call for unifying the evaluation setting in end-to-end RE.

[11]	A Neural Few-Shot Text Classification Reality Check
	Thomas Dopierre, Christophe Gravier and Wilfired Logerais
Résumé : Modern classification models tend to struggle when the amount of annotated data is scarce. To overcome this issue, several neural few-shot classification models have emerged, yielding significant progress over time, both in Computer Vision and Natural Language Processing. In the latter, such models used to rely on fixed word embeddings before the advent of transformers. Additionally, some models used in Computer Vision are yet to be tested in NLP applications. In this paper, we compare all these models, first adapting those made in the field of image processing to NLP, and second providing them access to transformers. We then test these models equipped with the same transformer-based encoder on the intent detection task, known for having a large number of classes. Our results reveal that while methods perform almost equally on the ARSC dataset, this is not the case for the Intent Detection task, where the most recent and supposedly best competitors perform worse than older and simpler ones (while all are given access to transformers). We also show that a simple baseline is surprisingly strong. All the new developed models, as well as the evaluation framework, are made publicly available.

[6]	Représentation d'auteurs par des distributions gaussiennes dans un contexte dynamique
	Antoine Gourru, Julien Velcin and Julien Jacques
Résumé : Les auteurs publient des documents de manière dynamique. Leurs sujets d'intérêt et leur style d'écriture peuvent changer au fil du temps. Des tâches telles que la classification d'auteurs, l'identification d'auteurs ou la prédiction de liens sont difficiles à résoudre dans des contextes de données aussi complexes. Nous proposons un nouveau modèle d'apprentissage de représentations qui capture l'évolution temporelle. Nous formulons une hypothèse générative générale: la représentation de l'auteur au temps $t$ est une distribution gaussienne qui génère des vecteurs de documents, et qui dépend des publications observées jusqu'au temps $t$. Nous proposons deux modèles qui s'inscrivent dans ce cadre. Le premier se base sur un modèle markovien et est optimisé en utilisant les équations de Kalman, tandis que le second fait appel à un réseau neuronal récurrent. Nous évaluons notre méthode sur plusieurs tâches quantitatives : identification d'auteur, classification, et prédiction de co-auteurs, sur deux jeux de données écrits en anglais. De plus, notre modèle est agnostique à la langue puisqu'il fait usage de vecteurs de documents pré-entraînés.

Apprentissage de représentation 2
[22]	Beneficial effect of combined replay for continual learning
	Miguel A. Solinas, Marion Mainsant, Marina Reyboz, Stephane Rousset and Martial Mermillod
Résumé : While deep learning has yielded remarkable results in a wide range of applications, artificial neural networks suffer from catastrophic forgetting of old knowledge as new knowledge is learned. Rehearsal methods overcome catastrophic forgetting by replaying an amount of previously learned data stored in dedicated memory buffers. Alternatively, pseudo-rehearsal methods generate pseudo-samples to emulate the previously learned data, thus alleviating the need for dedicated buffers. Unfortunately, up to now, these methods have shown limited accuracy. In this work, we combine these two approaches and employ the data stored in tiny memory buffers as seeds to enhance the pseudo-sample generation process. We then show that pseudo-rehearsal can improve performance versus rehearsal methods for small buffer sizes. This is due to an improvement in the retrieval process of previously learned information. Our combined replay approach consists of a hybrid architecture that generates pseudo-samples through a reinjection sampling procedure (i.e. iterative sampling). The generated pseudo-samples are then interlaced with the new data to acquire new knowledge without forgetting the previous one. We evaluate our method extensively on the MNIST, CIFAR-10 and CIFAR-100 image classification datasets, and present state-of-the-art performance using tiny memory buffers.

[35]	Adversarial dictionary learning	[PDF] [VIDEO] [SLIDES]
	Jordan Frecon, Lucas Anquetil, Gilles Gasso and Stéphane Canu
Résumé : This work frames the learning of multiple adversarial perturbations as a sparse dictionary learning problem bridging the gap between specific and universal attacks. On the one hand, this framework allows to build an adversary attack to new examples by only learning the coding vectors, provided that the dictionary is known. On the other hand, the a posteriori study of the atoms unveils the most common patterns to attack the classifier. Numerical experiments conducted on CIFAR-10 illustrate that our approach, termed as Sparse Coding of ADversarial Attacks (SCADA), achieves higher fooling rates of the deep model than state-of-the-art attacks for smaller adversarial perturbations.

[29]	Target Consistency for Domain Adaptation: when Robustness meets Transferability	[PDF]
	Yassine Ouali, Victor Bouvier, Myriam Tami and Hudelot Céline
Résumé : Learning \textit{Invariant Representations} has been successfully applied for reconciling a source and a target domain for Unsupervised Domain Adaptation. In this work, we start by investigating the robustness of such methods under the cluster assumption's prism, bringing new empirical evidence that invariance with a low source risk does not guarantee a well-performing target classifier. More precisely, we show that the cluster assumption is violated in the target domain despite being maintained in the source domain, indicating a lack of robustness of the target classifier. To address this problem, we demonstrate the importance of enforcing the cluster assumption in the target domain, named \textit{Target Consistency} (TC), especially when paired with a loss that promotes (class-level) invariance. Our new approach results in a significant improvement in image classification and segmentation benchmarks over state-of-the-art methods based on invariant representations. Importantly, our method is flexible and easy to implement, making it a complementary technique to existing approaches for improving the transferability of representations.

Optimization
[41]	Anderson acceleration of coordinate descent	[PDF]
	Quentin Bertrand and Mathurin Massias
Résumé : Acceleration of first order methods is mainly obtained via inertia à la Nesterov, or via nonlinear extrapolation. The latter has known a recent surge of interest, with successful applications to gradient and proximal gradient techniques. On multiple Machine Learning problems, coordinate descent achieves performance significantly superior to full-gradient methods. Speeding up coordinate descent in practice is not easy: inertially accelerated versions of coordinate descent are theoretically accelerated, but might not always lead to practical speed-ups. We propose an accelerated version of coordinate descent using extrapolation, showing considerable speed up in practice, compared to inertial accelerated coordinate descent and extrapolated (proximal) gradient descent. Experiments on least squares, Lasso, elastic net and logistic regression validate the approach.

[47]	Minimisation privée du risque empirique par descente par coordonnées	[VIDEO] [SLIDES]
	Paul Mangold, Aurélien Bellet, Joseph Salmon and Marc Tommasi
Résumé : Il est désormais connu que des modèles de machine learning sont susceptibles de révéler les données avec lesquelles ils ont été appris. Lorsque que ces données sont confidentielles, cela peut entraîner d'importants problèmes de sécurité ou violer la vie privée d'individus. Il convient alors d'adapter la procédure d'apprentissage. Dans cet article, nous proposons un nouvel algorithme pour résoudre ce problème : l'algorithme de descente par coordonnées privé, dont le principe est de mettre à jour les paramètres du modèle une coordonnée à la fois, à partir d'une seule coordonnée du gradient bruité. Nous analysons théoriquement la convergence de cet algorithme, et qualifions son utilité en exprimant une borne en espérance sur la différence avec la valeur optimale de la fonction de perte. Nous montrons que le compromis confidentialité-utilité obtenu correspond aux bornes inférieures connues pour le problème considéré. Nous en déduisons ensuite une règle permettant d'adapter en pratique les seuils des gradients associés à chacune des coordonnées, évitant un réglage fastidieux. Nous comparons notre algorithme de descente par coordonnées privé avec l'algorithme le plus populaire pour le problème de minimisation privée du risque empirique, DP-SGD. Nous montrons que notre algorithme donne, dans certains régimes, de meilleurs résultats que ce dernier. Finalement, nous réalisons des simulations numériques sur données synthétiques, validant nos résultats théoriques et montrant la pertinence de l'approche proposée.

[30]	Smooth optimization of orthogonal wavelet basis	[PDF]
	Jordan Frecon, Riccardo Grazzi, Saverio Salzo and Massimiliano Pontil
Résumé : Wavelets are a powerful tool for signal and image processing tasks. They allow to analyze the noise level separately at multiple scales and to adapt the denoising algorithm accordingly. However, the performance strongly rely on the choice of the wavelet basis. The aim of this work is to learn the wavelet basis that is adapted to both the denoising task and the class of images at hand. We tackle this problem by a smooth bilevel approach where the wavelet coefficients are optimized at the lower-level and the wavelet filters are learned at the upper-level. Numerical experiments support the added benefits over classical wavelets.

Renforcement/Bandits
[18]	PARENTing via Model-Agnostic Reinforcement Learning to Correct Pathological Behaviors in Data-to-Text Generation
	Clément Rebuffel, Laure Soulier, Geoffrey Scoutheteen and Patrick Gallinari
Résumé : In language generation models conditioned by structured data, the classical training via maximum likelihood almost always leads models to pick up on dataset divergence (i.e., hallucinations or omissions), and to incorporate them erroneously in their own generations at inference. In this work, we build ontop of previous Reinforcement Learning based approaches and show that a model-agnostic framework relying on the recently introduced PARENT metric is efficient at reducing both hallucinations and omissions. Evaluations on the widely used WikiBIO and WebNLG benchmarks demonstrate the effectiveness of this framework compared to state-of-the-art models.

[28]	Epsilon Best Arm Identification in Spectral Bandits
	Tomáš Kocák and Aurélien Garivier
Résumé : We propose an analysis of PAC identification of an \epsilon-best arm in graph bandit models with Gaussian distributions. We consider finite but potentially very large bandit models where the set of arms is endowed with a graph structure, and we assume that the arms' expectations \mu are smooth with respect to this graph. Our goal is to identify an arm whose expectation is at most \epsilon below the largest of all means. We focus on the fixed-confidence setting: given a risk parameter \delta, we consider sequential strategies that yield an \epsilon-optimal arm with probability at least 1-\delta. All such strategies use at least T^_{R,\epsilon}(\mu)\log(1/\delta) samples, where R is the smoothness parameter. We identify the complexity term T^_{R,\epsilon}(\bmu) as the solution of a min-max problem for which we give a game-theoretic analysis and an approximation procedure. This procedure is the key element required by the asymptotically optimal Track-and-Stop strategy.

[7]	Ordonnancement d'objets par bandits unimodaux sur des graphes paramétriques	[VIDEO1] [VIDEO2] [SLIDES]
	Camille-Sovanneary Gauthier, Romaric Gaudel, Elisa Fromont and Aser Boammani Lompo
Résumé : Nous cherchons à résoudre le problème du placement optimal (ordonnancement), en ligne, de K objets à L positions pré-déterminées sur une page web de manière à maximiser le nombre de clics des utilisateurs. Nous proposons un algorithme original, facile à implémenter et ayant des garanties théoriques fortes fonctionnant pour un modèle utilisateur basé positions (PBM) qui est bien adapté à un contexte dans lequel les objets sont placés sur une grille où aucune position n’est a priori meilleure. Notre algorithme propose une recommandation optimale à travers l’apprentissage d’un graphe compacte représentant les différentes permutations de L items parmi les K positions. La borne logarithmique de regret de notre algorithme de bandit est une conséquence directe de la propriété unimodale de ce bandit par rapport au graphe appris. Des expérimentations comparant notre méthode à l’état de l’art des algorithmes fonctionnant sur ce même modèle utilisateur, montrent que notre méthode est beaucoup plus efficace tout en fournissant des performances de regret du même niveau que les meilleurs algorithmes connus et ce sur des données synthétiques ou des données réelles. La version longue (en anglais) de cet article a été publiée à la conférence ICML 2021 sous le titre "Parametric Graph for Unimodal Ranking Bandit".

[25]	Echantillonnage adaptatif pour l'identification de la politique optimale dans les PDMs	[VIDEO] [SLIDES]
	Aymen Al Marjani and Alexandre Proutiere
Résumé : Nous étudions le problème de l'identification de la politique optimale dans les processus de décision de Markov (PDM) escomptés lorsque l'agent a accès à un modèle génératif. L'objectif est de concevoir un algorithme d'apprentissage retournant la politique optimale le plus tôt possible. Nous établissons d'abord une borne inférieure de la complexité d'échantillonnage \textit{spécifique-au-problème}, satisfaite par tout algorithme d'apprentissage. Cette limite inférieure correspond à une allocation optimale d'échantillonnage qui résout un programme non convexe, et donc, est difficile à exploiter dans la conception d'algorithmes efficaces. Nous fournissons ensuite une borne supérieure simple et serrée de la borne inférieure de la complexité d'échantillonage, dont l'allocation d'échantillonage quasi-optimale correspondante devient explicite. La borne supérieure dépend de fonctions spécifiques du PDM, telles que les écarts de sous-optimalité et la variance de la fonction de valeur de l'état suivant, et capture ainsi réellement la dureté du PDM. Enfin, nous concevons KLB-TS (KL Ball Track-and-Stop), un algorithme qui suit cette allocation quasi-optimale, et nous fournissons des garanties asymptotiques pour sa complexité d'échantillon.

Interprétabilité
[38]	Deep GONet: Self-explainable deep neural network based on Gene Ontology for phenotype prediction from gene expression data	[PDF] [VIDEO1] [VIDEO2][SLIDES]
	Victoria Bourgeais, Farida Zehraoui, Mohamed Ben Hamdoune and Blaise Hanczar
Résumé : Background: With the rapid advancement of genomic sequencing techniques, massive production of gene expression data is becoming possible, which prompts the development of precision medicine. Deep learning is a promising approach for phenotype prediction (clinical diagnosis, prognosis, and drug response) based on gene expression profile. Existing deep learning models are usually considered as black-boxes that provide accurate predictions but are not interpretable. However, accuracy and interpretation are both essential for precision medicine. In addition, most models do not integrate the knowledge of the domain. Hence, making deep learning models interpretable for medical applications using prior biological knowledge is the main focus of this paper. Results: In this paper, we propose a new self-explainable deep learning model, called Deep GONet, integrating the Gene Ontology into the hierarchical architecture of the neural network. This model is based on a fully-connected architecture constrained by the Gene Ontology annotations, such that each neuron represents a biological function. The experiments on cancer diagnosis datasets demonstrate that Deep GONet is both easily interpretable and highly performant to discriminate cancer and non-cancer samples. Conclusions: Our model provides an explanation to its predictions by identifying the most important neurons and associating them with biological functions, making the model understandable for biologists and physicians.

[12]	Scalable and Accurate Subsequence Transform for Time Series Classification	[HAL] [VIDEO1] [VIDEO2][SLIDES]
	Michael Franklin Mbouopda and Engelbert Mephu Nguifo
Résumé : Time series classification using phase-independent subsequences called shapelets is one of the best approaches in the state of the art. This approach is especially characterized by its interpretable property and its fast prediction time. However, given a dataset of $n$ time series of length at most $m$, learning shapelets requires a computation time of $O(n^2m^4)$ which is too high for practical datasets. In this paper, we exploit the fact that shapelets are shared by the members of the same class to propose the SAST (Scalable and Accurate Subsequence Transform) algorithm which has a time complexity of $O(nm^3)$. SAST is accurate, interpretable and does not learn redundant shapelets. The experiments we conducted on the UCR archive datasets shown that SAST is more accurate than the state of the art Shapelet Transform algorithm while being significantly more scalable.

[8]	Co-clustering for fair recommendation
	Gabriel Frisch, Yves Grandvalet and Jean-Benoist Leger
Résumé : Collaborative filtering relies on a sparse rating matrix, where each user rates a few products, to propose recommendations. The approach consists to approximate the sparse rating matrix with a simple model whose regularities allow to fill in the missing entries. The latent block model is a generative co-clustering model that can provide such an approximation. In this paper, we show that exogenous sensitive attributes can be incorporated in this model to ensure fair recommendations. Since users are only characterized by their ratings and their sensitive attribute, fairness is measured here by a parity criterion. Introducing the sensitive attribute in the latent block model leads to a classification of users that is independent from the sensitive attribute. We propose a definition of fairness for the recommender system that expresses that the ranking of items should be independent of the sensitive attribute. We show that our model ensures approximately fair recommendations provided that the classification of users approximately respects statistical parity.

Papiers acceptés en poster
[2]	Narcissist! Do you need so much attention?	[PDF]
	Gaëtan Caillaut, Nicolas Dugué and Nathalie Camelin
Résumé : After the rise of word2vec came the BERT era, with large architectures allowing to deal with polysemy by taking into account the contextual information. This burst led to great performance improvement on classic NLP tasks such as part-of-speech tagging or named entity recognition. BERT systems are considered universal: they can be fine-tuned to address any task efficiently. However, these systems are huge to deploy, not trivial to fine-tune, and may not be fitted to some corpora, e.g. domain-specific and small ones. For instance, we consider the DEFT 2018 corpus of tweets and show that CamemBERT is not appropriate to this corpus and task. According to the Occam’s razor principle, we designed MiniBERT, a tiny BERT architecture, and show in this preliminary paper the benefits of such a system: interpretable, greener, easily trainable and deployable.

[4]	Naturally Constrained Online Expectation Maximization
	Daniela Pamplona and Antoine Manzanera
Résumé : With the rise of big data sets, learning algorithms must be adapted to piece-wise mechanisms to tackle large-scale calculations' time and memory costs. Furthermore, for most learning embedded systems, the input data are fed sequentially and contingently: one by one, and possibly class by class. Thus, learning algorithms should not only run online but cope with time-varying, non-independent, and non-balanced training data for the system’s entire life. Online Expectation-Maximization is a well-known algorithm for learning probabilistic models in real-time, due to its simplicity and convergence properties. However, these properties are only valid in the case of large, independent and identically distributed samples. In this paper, we propose to constrain the online Expectation-Maximization on the Fisher distance between the parameters. After presenting the algorithm, we make a thorough study of its use in Probabilistic PrincipalComponents Analysis. First, we derive the update rules, and then we analyze the effect of the constraint on major problems of online and sequential learning: convergence, forgetting and interference. Furthermore, we use several algorithmic protocols: iid vs sequential data, and constraint parameters updated step-wise vs class-wise. Our results show that this constraint increases the convergence rate of online Expectation-Maximization, decreases forgetting and slightly introduces positive transfer learning.

[9]	DeeREKt: Deep Recognition of Emotions using Body Kinematics
	Victor Brossard, Thomas Peel and Yvonne Delevoye-Turrell
Résumé : Imagine seeing your best friend walking down the street. As you see her approaching you can already tell if she is happy or sad. In this work, we present DeeREKt, a deep learning algorithm able to recognize emotions from body kinematics. We use published motion capture data from 22 actors performing six emotions (anger, disgust, fear, happiness, sadness and surprise) and one neutral condition. Among the 58-markers skeleton, we selected 22 emotionally relevant markers to create a lighter skeleton. We based the backbone architecture on the DD-Net, a fast and efficient action recognition network. We augmented DeeREKt with a domain-adaptive discriminant head to avoid possible bias from physical differences between the actors. DeeREKt achieves an accuracy of 54 % across all seven emotions. We computed the baseline score based on the most represented class (disgust, 15.76 %). Top accuracy is achieved for the anger emotion (61.3 %) while lowest accuracy is obtained for the surprise emotion (29.9 %). The lightweight skeleton accelerates the training of the network by almost 50 % while maintaining the same level of accuracy. The discriminant head did not increase accuracy as the emotional information contained in the data might have been sufficient to perform the classification task. In our future studies, we aim to apply DeeREKt to a data set incorporating a wider variety of emotional expressions in individuals with contrasting physical morphologies. The transdisciplinary approach reported here provides valuable insights on how machine learning can provide the tools to test current psychological models of emotional motor behaviors.

[10]	Manifold exploration of industrial processes with Variational AutoEncoders	[PDF] [SLIDES]
	Brendan L'Ollivier, Sonia Tabti and Julien Budynek
Résumé : In this article, a computationally efficient manifold learning algorithm combining a variational autoencoder and a nearest neighbor graph is proposed. In fact, using a variational autoencoder to compute an approximation of the underlying data distribution allows our method to tackle some shortcomings of neighbor graph construction methods, namely the ability to deal with noisy and high dimensional data. This method aims to extend the range of application of graph-based manifold learning techniques to the complexity of industrial process data. Once a graph is computed, it provides a condensed representation of the behavior of the process. Also, the graph framework makes it more convenient to incorporate industrial metrics, such as product quality, through weights customization. The final graph can be used to assist the operator in selecting optimal process parameters values. The proposed approach is tested on both synthetic and real data.

[17]	Une Analyse PAC-Bayésienne de la Robustesse Adversariale
	Guillaume Vidot, Paul Viallard and Emilie Morvant
Résumé : Dans ce papier, nous adaptons le cadre de la robustesse adversariale au cadre PAC-Bayésien connu pour fournir des bornes précises en moyenne sur l'ensemble d'hypothèses (plutôt qu'une analyse en pire cas). Cette approche nous permet de démontrer des bornes en généralisation sur le risque de modèles s'exprimant comme un vote de majorité et estiment à quel point le modèle est invariant aux perturbations imperceptibles de l'entrée. Cette analyse théorique a l'avantage de fournir des bornes générales (i) indépendantes du type de perturbations (i.e., les attaques adversariales), (ii) précises et (iii) et directement minimisable pendant la phase d'apprentissage. Nous démontrons empiriquement que les modèles obtenus sont robustes face à différentes attaques au moment de la classification.

[23]	Complexité de l'échantillon et systèmes de questions réponses visuelles	[PDF]
	Corentin Kervadec, Grigory Antipov, Moez Baccouche, Madiha Nadri and Christian Wolf
Résumé : Les méthodes adressant la tâche des question-réponses visuelles, consistant à répondre à une question posée sur une image, sont connues pour leur tendance à exploiter les biais dans les données plutot que de raisonner. Récemment, il a été montré qu'il est possible de favoriser l'émergence de patterns de raisonnement dans les couches d'attention de modèles d'apprentissage profond de l'Etat-de-l'Art, en les entrainant sur des données visuelles parfaites. Ainsi, ces modèles sont capables de produire un raisonnement lorsque les conditions d'entrainement sont favorables. Cependant, il reste difficile de transferer ces parterns de raisonnement vers des modèles déployables, qui ne disposent pas d'une representation visuelle parfaite. Nous proposons une méthode de transfer basée sur un mécanisme de régularisation impliquant la supervision des séquences d'opérations nécéssaires pour répondre à la question. Nous donnons une analyse théorique basé sur le PAC-learning, montrant, sous conditions, qu'une telle supervision permet de réduire la compléxité de l'échantillon. Enfin, Nous démontrons experimentalement l'éfficacité de notre méthode sur la base de données GQA, ainsi que sa complémentarité avec les méthodes de pré-entrainement inspirées de BERT.

[26]	Une Approche Non-asymptotique au problème d'Identification de Meilleur Bras pour des Bandits Gaussiens	[HAL]
	Antoine Barrier, Aurélien Garivier and Tomáš Kocák
Résumé : Nous proposons une nouvelle stratégie pour le problème d’Identification de Meilleur Bras pour des variables gaussiennes réduites avec moyennes bornées. Cette stratégie appelée Exploration-Biased Sampling est asymptotiquement optimale, mais bénéficie également de bornes non-asymptotiques ayant lieu avec forte probabilité. À notre connaissance, il s’agit de la première stratégie avec de telles garanties. Mais le principal avantage sur les autres algorithmes tels Track-and-Stop est une amélioration du comportement vis-à-vis de l’exploration : Exploration-Biased Sampling est légèrement biaisée en faveur de l’exploration de manière subtile et naturelle qui rend la stratégie d’échantillonnage plus stable et interprétable. Ces améliorations s’appuient sur une nouvelle analyse du problème d’optimisation de la complexité d’échantillonnage, qui mène également à une amélioration de la vitesse de résolution du problème d’optimisation et à plusieurs résultats quantitatifs de régularités que nous estimons d’un d’intérêt indépendant.

[27]	Reconciling partially hidden Markov-switching models with local autoregressive dynamics.
	Fatoumata Dama and Christine Sinoquet
Résumé : Time series subject to regime changes are encountered in multiple applications, such as those involving industrial processes, machine health monitoring, econometrics. For discrete-value regimes, the renowned Hidden Markov Model (HMM) describes time series whose states are unknown at all time-steps. In some cases, an annotation function allows to label the time series. Thus, another category of models addresses the case where regimes are known at all time-steps. We present a novel model, to handle the intermediate case: the backbone of our model is a Partially Hidden Markov Chain (PHMC). However, in addition to this global nonlinear dynamics, real-world time series would be best described if we could specify autoregressive local state-specific dynamics. Thus, our proposal, the partially hidden Markov Chain Linear AutoRegressive (PHMC-LAR) model, also aims at reconciling the PHMC framework with LAR local dynamics. We develop a specific instance of the Expectation-Maximization algorithm to learn the parameters of the PHMC-LAR model. We also tackle the inference problem, that is finding the most probable states, for a observed time series and its partial state annotation. We conduct an experimental analysis with simulated data, to evaluate how incorporating partial kwowledge on states impacts inference performance. Our results highlight that partial information about hidden states may substantially improve inference performance, and speeds up convergence for the learning algorithm.

[31]	A Framework using Contrastive Learning for Classification with Noisy Labels	[PDF] [VIDEO]
	Madalina Ciortan, Romain Dupuis and Thomas Peel
Résumé : We propose a framework using contrastive learning as a pre-training task to perform image classification in the presence of noisy labels. Recent strategies such as pseudo-labelling, sample selection with Gaussian Mixture models, weighted supervised contrastive learning have been combined into a fine-tuning phase following the pre-training. This paper provides an extensive empirical study showing that a preliminary contrastive learning step brings a significant gain in performance when using different loss functions: non robust, robust, and early-learning regularized. Our experiments performed on standard benchmarks and real-world datasets demonstrate that: i) the contrastive pre-training increases the robustness of any loss function to noisy labels and ii) the additional fine-tuning phase can further improve accuracy but at the cost of additional complexity.

[33]	EXtremely PRIvate supervised Learning	[PDF]
	Armand Lacombe, Saumya Jetley and Michèle Sebag
Résumé : This paper presents a new approach called ExPriL for learning from extremely private data. Iteratively, the learner supplies a candidate hypothesis and the data owner only releases the marginals of the error incurred by the hypothesis. Using the marginals as supervisory signal, the goal is to learn a hypothesis that fits the target data as best as possible. The privacy of the mechanism is provably enforced, assuming that the overall number of iterations is known in advance.

[34]	Multiview Artificial Generation Engine : MAGE - Générateur de données controlées pour l'apprentissage multivue	[VIDEO]
	Baptiste Bauvin, Sokol Koço, Dominique Benielli, Cecile Capponi and François Laviolette
Résumé : Multi-view learning has been a thriving research field for several years. Many approaches have been proposed based on multiple learning problems. However, to the best of our knowledge, most works propose their own definition of supervised multi-view learning and experimental frameworks. In order to give a more formal setting for multi-view learning with more than two views, we propose a toolbox for generating native multi-view datasets. Our contributions are two-fold: first we propose 3 definitions of view interactions, then we introduce MAGE, a python based toolbox for dataset generation with view interaction. Theoretical and empirical justifications are provided for each contribution.

[42]	Self-supervised learning for anomaly detection on time series: application to cellular data
	Romain Bailly, Marielle Malfante, Cédric Allier, Lamya Ghenim and Jérôme Mars
Résumé : This paper presents a new method for anomaly detec-tion in time series and its application to cellular data.These time series are computed from cell images ac-quired thanks to lens-free microscopy. In the context ofcellular biology, detecting abnormal cells is interestingfor any further analysis. Indeed, cells that deviate fromhealthy trajectories can further drive tissues towarddiseases [RAG+20]. It would be both time-consumingand costly to manually analyse each cell in a dataset often thoudands cells. To overcome this human process,we present a deep self-supervised approach to automat-ically detect abnormal cells from their dry mass timeseries. A 1D-convolutio nal neural network is trained topredict the dry mass of cells. An anomaly is detected ifthe mean squared error (MSE) between prediction andground truth is above a fixed threshold. This processbased on self-supervised learning is tested on a datasetof 9,100 time series of dry mass. The method succeedsin detecting abnormal time series with a precision of 96.6%.

[43]	Self-Learning for Received Signal Strength Map Reconstruction with Neural Architecture Search	[PDF]
	Aleksandra Malkova, Loic Pauletto, Christophe Villien, Benoit Denis and Massih-Reza Amini
Résumé : In this paper, we present a Neural Network (NN) model based on Neural Architecture Search (NAS) and self-learning for received signal strength (RSS) map reconstruction out of sparse single-snapshot input measurements, in the case where data-augmentation by side deterministic simulations cannot be performed. The approach first finds an optimal NN architecture and simultaneously train the deduced model over some ground-truth measurements of a given (RSS) map. These ground-truth measurements along with the predictions of the model over a set of randomly chosen points are then used to train a second NN model having the same architecture. Experimental results show that signal predictions of this second model outperforms non-learning based interpolation state-of-the-art techniques and NN models with no architecture search on five large-scale maps of RSS measurements.

[44]	Data-Efficient Information Extraction from Documents with Pre-Trained Language Models	[PDF]
	Clément Sage, Thibault Douzon, Alex Aussem, Véronique Eglin, Haytham Elghazel, Stefan Duffner, Christophe Garcia and Jérémy Espinas
Résumé : Like for many text understanding and generation tasks, pre-trained languages models have emerged as a powerful approach for extracting information from business documents. However, their performance has not been properly studied in data-constrained settings which are often encountered in industrial applications. In this paper, we show that LayoutLM, a pre-trained model recently proposed for encoding 2D documents, reveals a high sample-efficiency when fine-tuned on public and real-world Information Extraction (IE) datasets. Indeed, LayoutLM reaches more than 80\% of its full performance with as few as 32 documents for fine-tuning. When compared with a strong baseline learning IE from scratch, the pre-trained model needs between 4 to 30 times fewer annotated documents in the toughest data conditions. Finally, LayoutLM performs better on the real-world dataset when having been beforehand fine-tuned on the full public dataset, thus indicating valuable knowledge transfer abilities. We therefore advocate the use of pre-trained language models for tackling practical extraction problems.

[45]	Bornes du vote majoritaire multi-classes en présence du bruit sur les étiquettes de classes	[PDF]
	Vasilii Feofanov, Emilie Devijver and Massih-Reza Amini
Résumé : Dans ce travail, nous considérons le cadre de classification multi-classes avec des exemples d'apprentissage présentant des imperfections dans leurs étiquettes de classes. Nous modélisons cette imperfection avec un modèle d'erreur probabiliste. Sur cette base, nous dérivons des garanties théoriques pour un classifieur de vote majoritaire en étendant la borne C multi-classes, une borne supérieure du second ordre. Enfin, nous montrons empiriquement le comportement de la borne et discutons de son application pour les approches semi-supervisé basées sur le pseudo-étiquetage, en particulier pour l'auto-apprentissage.

[48]	Préparation efficace des données d’apprentissage Application à la classification d’images pour la détection du cancer du sein	[PDF] [VIDEO]
	Mouna Mayouf and Florence Dupin De Saint Cyr
Résumé : Mesurer "l'informativité" d'un dataset dans le cadre d’une tâche de classification est une question difficile. Dans cet article, nous essayons de circonscrire cette notion en introduisant de nouvelles mesures et en énonçant quelques principes sur la préparation des données. Nous expérimentons l'intérêt de ces mesures et la validité de ces principes en introduisant plusieurs protocoles destinés à comparer les différentes manières de préparer les données. Nous concluons en mettant en relation l'efficacité de la préparation des données et sa diversité théorique.

[50]	Apprentissage séquentiel de préférence utilisateurs pour les systèmes de recommandation	[PDF]
	Aleksandra Burashnikova, Yury Maximov, Marianne Clausel, Charlotte Laclau, Franck Iutzeler and Massih-Reza Amini
Résumé : Dans cet article, nous présentons une stratégie séquentielle pour l'apprentissage de systèmes de recommandation à grande échelle sur la base d'une rétroaction implicite, principalement sous la forme de clics. L'approche proposée consiste à minimiser l'erreur d'ordonnacement sur les blocs de produits consécutifs constitués d'une séquence de produits non cliqués suivie d'un produit cliqué pour chaque utilisateur. Afin d'éviter de mettre à jour les paramètres du modèle sur un nombre anormalement élevé de clics (principalement dus aux bots), nous introduisons un seuil supérieur et un seuil inférieur sur le nombre de mises à jour des paramètres pour chaque utilisateur. Ces seuils sont estimés sur la distribution du nombre de blocs dans l'ensemble d'apprentissage. Nous proposons une analyse de convergence de l'algorithme et démontrons empiriquement son efficacité sur six collections, à la fois en ce qui concerne les différentes mesures de performance et le temps de calcul.

Personnes connectées : 2

Vie privée

Papiers acceptés

Apprentissage de représentation 1

Statistical Learning Theory

NLP

Apprentissage de représentation 2

Optimization

Renforcement/Bandits

Interprétabilité

Papiers acceptés en poster