April 30, 2016

Trainlets: cropped wavelet decomposition for high-dimensional learning

It's being a lonng time: element 120 from the aperiodic table of wavelets is the trainlet, from Jeremias Sulam, Student Member, Boaz Ophir, Michael Zibulevsky, and Michael Elad, Trainlets: Dictionary Learning in High Dimensions:
Abstract: Sparse representations has shown to be a very powerful model for real world signals, and has enabled the development of applications with notable performance. Combined with the ability to learn a dictionary from signal examples, sparsity-inspired algorithms are often achieving state-of-the-art results in a wide variety of tasks. Yet, these methods have traditionally been restricted to small dimensions mainly due to the computational constraints that the dictionary learning problem entails. In the context of image processing, this implies handling small image patches. In this work we show how to efficiently handle bigger dimensions and go beyond the small patches in sparsity-based signal and image processing methods. We build our approach based on a new cropped wavelet decomposition, which enables a multi-scale analysis with virtually no border effects. We then employ this as the base dictionary within a double sparsity model to enable the training of adaptive dictionaries. To cope with the increase of training data, while at the same time improving the training performance, we present an Online Sparse Dictionary Learning (OSDL) algorithm to train this model effectively, enabling it to handle millions of examples. This work shows that dictionary learning can be up-scaled to tackle a new level of signal dimensions, obtaining large adaptable atoms that we call trainlets.
They offer a base dictionary used within a double sparsity model to enable the training of adaptive dictionaries. The associated package is here, from Michael Elad software page.  The  cropped wavelet decomposition enables a multi-scale analysis with virtually no border effects. An entry  to trainlets has added to WITS, the aperiodic table of wavelets.

But things always ends up with a song! Two of my favorite train songs, by  Porcupine tree (Trains) and the Nits (The train).

April 24, 2016

M-band 2D dual-tree (Hilbert) wavelet multicomponent image denoising

The toolbox implements a parametric nonlinear estimator that generalizes several wavelet shrinkage denoising methods. Dedicated to additive Gaussian noise, it adopts a multivariate statistical approach to take into account both the spatial and the inter-component correlations existing between the different wavelet subbands, using a Stein Unbiased Risk Estimator (SURE) principle, which derives optimal parameters. The wavelet choice is a slightly redundant multi-band geometrical dual-wavelet frame. Experiments on multispectral remote sensing images outperform conventional wavelet denoising techniques (including curvelets). Since they are based on MIMO filter banks (multi-input, multi-ooutput), in a mullti-band  fashion,, we can called they MIMOlets quite safely. The dual-tree wavelet consists in two directional wavelet trees, diisplayed below for a 4-band filter:

4-band directional dual-tree wavelets

The set of wavelet functions implements:
The demonstration script is Init_Demo.m, and the functions for M-band dual-tree wavelets are provided in the directory TOOLBOX_DTMband_solo. For instance, the clean multispectral image (port of Tunis, only one channel):

The (very) noisy version:

The denoised one:

November 10, 2015

BRANE Cut: Biologically-Related Apriori Network Enhancement with Graph cuts

[BRANE Cut featured on RNA-Seq blog][Omic tools][bioRxiv preprint][PubMed/Biomed Central][BRANE Cut code]

Gene regulatory networks are somehow difficult to infer. This first work from an on-going work (termed BRANE *, for Biologically Related Apriori Netwok Enhancement) introduces an optimization method (based on Graph cuts, borrowed from computer vision/image processing) to infer graphs based on biologically-related a priori (including sparsity). It is succesfully tested on DREAM challenge data and an Escherichia coli network, with a specific work to derive optimization parameters from gene network cardinality properties. And it is quite fast.

Background: Inferring gene networks from high-throughput data constitutes an important step in the discovery of relevant regulatory relationships in organism cells. Despite the large number of available Gene Regulatory Network inference methods, the problem remains challenging: the underdetermination in the space of possible solutions requires additional constraints that incorporate a priori information on gene interactions.

Methods: Weighting all possible pairwise gene relationships by a probability of edge presence, we formulate the regulatory network inference as a discrete variational problem on graphs. We enforce biologically plausible coupling between groups and types of genes by minimizing an edge labeling functional coding for a priori structures. The optimization is carried out with Graph cuts, an approach popular in image processing and computer vision. We compare the inferred regulatory networks to results achieved by the mutual-information-based Context Likelihood of Relatedness (CLR) method and by the state-of-the-art GENIE3, winner of the DREAM4 multifactorial challenge.

Our BRANE Cut approach infers more accurately the five DREAM4 in silico networks (with improvements from 6 % to 11 %). On a real Escherichia coli compendium, an improvement of 11.8 % compared to CLR and 3 % compared to GENIE3 is obtained in terms of Area Under Precision-Recall curve. Up to 48 additional verified interactions are obtained over GENIE3 for a given precision. On this dataset involving 4345 genes, our method achieves a performance similar to that of GENIE3, while being more than seven times faster. The BRANE Cut code is available at: http://​www-syscom.​univ-mlv.​fr/~pirayre/Codes-GRN-BRANE-cut.html.

Conclusions: BRANE Cut is a weighted graph thresholding method. Using biologically sound penalties and data-driven parameters, it improves three state-of-the art GRN inference methods. It is applicable as a generic network inference post-processing, due to its computational efficiency.
Keywords:  Network inference, Reverse engineering, Discrete optimization, Graph cuts, Gene expression data, DREAM challenge.

September 19, 2015

Big data, fishes and cooking: fourteen shades of "V"

[At this short post, you can access the 14 "V" often glued to Bug Data, including vacuity]

To Lao Tzu is often attributed (I cannot access the original meaning):
Govern a great nation as you would cook a small fish. Do not overdo it.
Today's wisdom could be:
Deal with Big data as you would process a small signal. Do not over-expect from it, do not over-fit it, do not-overinterpret it.

Luckily, Big data does not exist, where Making The Most Of Small Data is advocated. This is a bit like teenage sex:
“Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”

In the What exactly is Big Data StackExchange question, I have listed all 14 "V" that could describe big data, including... vacuity. They are: Validity, Value, Variability/Variance, Variety, Velocity, Veracity/Veraciousness, Viability, Virtuality, Visualization, Volatility, Volume and Vacuity.

August 28, 2015

Hugo Steinhaus, or K-means clustering in French

Kernel clustering
[Modern transcription of the Hugo Steinhaus paper in 1956 (in French), at the source of k-means clustering algorithms, published first in a french-written post]

Data clustering or clustering analysis belongs to statistical data analysis methods. It aims at forming groups of objects that are similar in some way. Those groups are named clusters. The word cluster is related to clot, for thick mass of coagulated liquid or of material stuck together

The whole set of objects contains heterogeneous data, that ought to be grouped into subsets possessing a greater inner homogeneity. Such methods rely on similarity criteria or proximity measures. They are related to classification, machine learning, segmentation, pattern recognition, and have applications ranging from image processing to bioinformatics.
One of the most popular clustering method is known as K-means (k-moyennes in French). with a variation called dynamic clustering (beautifully called nuées dynamiques in French, for an application in bilogy: Kinetic transcriptome analysis reveals an essentially intact induction system in a cellulase hyper-producer Trichoderma reesei strain, Dante Poggi et al., Biotechnology and Biofuels, 2014).

An history of K-means can be found in Data Clustering: 50 Years Beyond K-Means, Anil K. Jain, Pattern recognition letters, 2010. Other historic bits can be found in Origins and extensions of the k-means algorithm in cluster analysis, Hans-Hermann Bock, Electronic Journ@l for History of Probability and Statistics, 2008. This algorithm is deeply linked to Lloyd-Max algorithm, developed by Lloyd in  1957, and rediscovered by Max three after after. It is useful for optimal scalar quantifier design.

Sur la division des corps matériels en parties (pdf)
The K-means technique is a little older. It was published in French by Hugo Steinhaus in 1956 in the Bulletin de l’Académie Polonaise des Sciences (Bulletin of the Polish Academy of Sciences). Hugo Steinhaus (1887-1972) is a Polish mathematician, sometimes known as the discover of Stefan Banach. He contributed to numerous branches of mathematics, and considered a early founder of probability theory.

He also contributed to applied mathematics, working jointly with engineers, biologists, physician, economists, geologists, and "even lawyers". Lacking of trustworthy information during World War II, he invented a statistical tool to estimate German losses, using necrologic news from German soldiers on the front. He notably used the mention that the soldier killed was the first, second or third child from a family. He thus is a precursor of data science.
This paper is called "Sur la division des corps matériels en parties" (On the division of material bodies into parts). The first to explicitely  formulate in finite dimension the principle of k-mean clustering. It is much more constructive that the Banach-Tarski paradoxal theorem which delas with cutting a ball into two different balls, doubling in volume. This paper is pleasant to read, and evokes practical uses from type classifcation in anthropology to industrial object normalization.

Being in French, in a journal whose archives cannot be accessed easily, it has not been read as much as it deserved. ns une revue aux archives peu disponibles en ligne, cet article n'a pas eu la lecture qu'il méritait. Here is it transduced by Maciej Denkowski, transmitted by Jérôme Bolte, and transcribed in LaTeX by  myself, with some effort to perserve the original typesetting and composition.

  Title                    = {Sur la division des corps mat\'eriels en parties},
  Author                   = {Steinhaus, H.},
 File                     = {Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means.pdf:Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means:PDF},
  Journal                  = {Bulletin de l’Acad\'emie Polonaise des Sciences},
  Number                   = {12},
  Pages                    = {801--804},
  Volume                   = {Cl. {III} --- Vol. {IV}},
  Year                     = {1956},

  Owner                    = {duvall},
  Timestamp                = {2015.}

Hugo Steinhaus : classification par k-moyennes, nuées dynamiques

Partitionnement à noyau
[Mise à disposition de l'article de Hugo Steinhaus de 1956, à l'origine de l'algorithme de partitionnement par les k-moyennes (available in English)]

Le partitionnement des données (data clustering ou clustering analysis) est une méthode "statistique" d'analyse de données visant à regrouper, dans un ensemble de données hétérogènes, des sous-ensembles de ces données en amas ou paquets plus homogènes. Chaque sous-ensemble doit ainsi présenter des caractéristiques similaires, quantifiée par des critères de similarité ou différentes mesures de proximité. Ces techniques appartiennent aux familles de classification, d'apprentissage automatique ou de segmentation, employées dans un nombre phénoménal d'applications, du traitement d'image à la bio-informatique.

L'une des méthodes de partitionnement ou d’agrégation les plus populaires est celle des k-moyennes (ou K-means), un problème d'optimisation combinatoire dont une version porte le joli nom de nuées dynamiques (pour une application qui m'intéresse : Kinetic transcriptome analysis reveals an essentially intact induction system in a cellulase hyper-producer Trichoderma reesei strain, Dante Poggi et al., Biotechnology and Biofuels, 2014).

Une histoire des k-moyennes est disponible dans Data Clustering: 50 Years Beyond K-Means, Anil K. Jain, Pattern recognition letters, 2010. Un autre point de vue est dans Origins and extensions of the k-means algorithm in cluster analysis, Hans-Hermann Bock, Journ@l Electronique d'Histoire des Probabilités et de la Statistique. Cet algorithme est profondément relié à l'algorithme dit de Lloyd-Max, développé par Lloyd en 1957, et redécouvert par Max trois ans après. Il permet notamment de construire un quantificateur scalaire optimal. 

Sur la division des corps matériels en parties (pdf)
Cette technique a cependant une source légèrement antérieure, publiée en français par Hugo Steinhaus en 1956 dans le Bulletin de l’Académie Polonaise des Sciences. Hugo Steinhaus (1887-1972) est un mathématicien polonais qui a contribué à de nombreuses branches des mathématiques, et est considéré comme l’un des précurseurs de la théorie des probabilités. Il a également œuvré en mathématiques appliquées, avec des collaborations avec des ingénieurs, géologues, économistes, des physiciens, biologistes. En manque d’informations fiables sur le déroulement de la 2e guerre mondiale, il « invente » un outil statistique pour estimer les pertes allemandes, en utilisant les annonces sporadiques des décès, partant d’un calcul de la fréquence relative d’annonces nécrologiques de soldats décédés mentionnant s’ils sont les 1er, 2e, 3e etc. fils d’une famille. Il est ainsi un précurseur de la science des données.

Cet article s'intitule "Sur la division des corps matériels en parties", et est le premier formulant de manière explicite, en dimension finie, le problème de partitionnement par les k-moyennes. Il est donc plus constructif que le théorème paradoxal de Banach-Tarski qui s'intéresse à la découpe d'une boule en deux boules de volume total double. Son écriture est plaisante, visant un usage pratique allant de la classification des types en anthropologie à la normalisation des objets industriels. 

Étant en langue française, dans une revue aux archives peu disponibles en ligne, cet article n'a pas eu la lecture qu'il méritait. Le voici transduit par Maciej Denkowski, transmis par Jérôme Bolte, et transcrit en LaTeX par votre serviteur, avec un effort majeur pour en conserver la pagination originale.

  Title                    = {Sur la division des corps mat\'eriels en parties},
  Author                   = {Steinhaus, H.},
 File                     = {Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means.pdf:Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means:PDF},
  Journal                  = {Bulletin de l’Acad\'emie Polonaise des Sciences},
  Number                   = {12},
  Pages                    = {801--804},
  Volume                   = {Cl. {III} --- Vol. {IV}},
  Year                     = {1956},

  Owner                    = {duvall},
  Timestamp                = {2015.}

Une version anglophone de ce billet s'intitule : Hugo Steinhaus, or K-means clustering in French.

July 8, 2015

Sparse seismic data restoration: a PhD defense

Smoothed $\ell_1/\ell_2$ function for a sparse $\ell_0$ surrogate
Mai Quyen PHAM has defended her PhD thesis on July 15th, 2015 at 10.00 am, on the topic of "Seismic wave field restoration using sparse representations and quantitative analysis” (manuscript in pdf), at Université Paris-Est, bâtiment Copernic, amphithéâtre Maurice Gross, 5 boulevard Descartes (RER A, Noisy-Champs), 77420 Champs-sur-Marne. 

Its focus is twofold:
1) sparse adaptive filtering with approximate templates in redundant and geometric wavelet frames (akin to echo cancellation in speech),
2) sparse blind deconvolution for parsimonious reflectivity signals with l1/l2 norm ratio penalty

This work has notably been published in two journal papers
*Euclid in a Taxicab: Sparse Blind Deconvolution with Smoothed l_1/l_2
Audrey Repetti, Mai Quyen-Pham, Laurent Duval, Émilie Chouzenoux, Jean-Christophe Pesquet
IEEE Signal Processing Letters, May 2015, Volume 22, Number 5, pages 539-543.
Abstract: The l1/l2 ratio regularization function has shown good performance for retrieving sparse signals in a number of recent works, in the context of blind deconvolution. Indeed, it benefits from a scale invariance property much desirable in the blind context. However, the l1/l2 function raises some difficulties when solving the nonconvex and nonsmooth minimization problems resulting from the use of such a penalty term in current restoration methods. In this paper, we propose a new penalty based on a smooth approximation to the l1/l2 function. In addition, we develop a proximal-based algorithm to solve variational problems involving this function and we derive theoretical convergence results. We demonstrate the effectiveness of our method through a comparison with a recent alternating optimization strategy dealing with the exact l1/l2 term, on an application to seismic data blind deconvolution.
*A Primal-Dual Proximal Algorithm for Sparse Template-Based Adaptive Filtering: Application to Seismic Multiple Removal
Mai-Quyen Pham, Laurent Duval, Caroline Chaux, Jean-Christophe Pesquet
IEEE Transactions on Signal Processing, August 2014, Volume 62, Issue 16, pages 4256-4269

PhD Committee:
Reporter Jean-Francois Aujol Prof. Université de Bordeaux
Reporter Mauricio D Sacchi Prof. University of Alberta
Examiner Jérôme Mars Prof. Grenoble-INP
Examiner Mai K. Nguyen Prof. Université de Cergy-Pontoise
PhD supervisor Jean-Christophe Pesquet Prof. Université Paris-Est Marne-la-Vallée
PhD co-supervisor Laurent Duval Dr. IFP Energies nouvelles (IFPEN)
PhD co-supervisor Caroline Chaux CNRS researcher, I2M, Aix-Marseille Université

Map to Copernic building

Abstract: This thesis deals with two different problems within the framework of convex and non convex optimization. The first one is an application to multiple removal in seismic data with adaptive filters and the second one is an application to blind deconvolution problem that produces characteristics closest to the Earth layers. More precisely: Unveiling meaningful geophysical information from seismic data requires to deal with both random and structured “noises”. As their amplitude may be greater than signals of interest (primaries), additional prior information is especially important in performing efficient signal separation. We address here the problem of multiple reflections, caused by wave-field bouncing between layers. Since only approximate models of these phenomena are available, we propose a flexible framework for time-varying adaptive filtering of seismic signals, using sparse representations, based on inaccurate templates. We recast the joint estimation of adaptive filters and primaries in a new convex variational formulation. This approach allows us to incorporate plausible knowledge about noise statistics, data sparsity and slow filter variation in parsimony-promoting wavelet transforms. The designed primal-dual algorithm solves a constrained minimization problem that alleviates standard regularization issues in finding hyperparameters. The approach demonstrates significantly good performance in low signal-to-noise ratio conditions, both for simulated and real field seismic data. In seismic exploration, a seismic signal (e.g. primary signal) is often represented as the results of a convolution between the “seismic wavelet” and the reflectivity series. The second goal of this thesis is to deconvolve them from the seismic signal which is presented in Chapter 6. The main idea of this work is to use an additional premise that the reflections occur as sparsely restricted, for which a study on the “sparsity measure” is considered. Some well known methods that fall in this category are proposed such as (Sacchi et al., 1994; Sacchi, 1997). We propose a new penalty based on a smooth approximation of the ℓ1/ℓ2 function that makes a difficult nonconvex minimization problem. We develop a proximal-based algorithm to solve variational problems involving this function and we derive theoretical convergence results. We demonstrate the effectiveness of our method through a comparison with a recent alternating optimization strategy dealing with the exact ℓ1/ℓ2 term. 

June 2, 2015

Facebook FAIR(ies) in Paris

Paris is buzzing about the announcement of the new european research center of Facebook in Paris. This was already known around April/May. Six  "fairies" are supposed to have joined Facebook FAIR, or Facebook Artificial Intelligence (AI) Research center. They will do some magical data science (or dédoménologie), under the guidance of Yann LeCun. He was the host of the day "Data Science and Massive Data Analysis" on the campus of ESIEE Paris, Ecole des Ponts ParisTech and Université Paris-Est Marne-la-Vallée (Paris at large) on June 12th 2014. It's not eerie, it's ESIEE.

After Menlo Park and New York, this center, the third and the first outside the US, has been attracted to City of lights.  Where they will bring their TORCH for our enlightenment. Attracted by  moths around a flame, by the local talents and excellent education facilities in artificial intelligence and computer science, and probably substantial financial incentives.They are also nearing Google headquarters in Paris. Google's HQ inauguration enjoyed the presence of former rightist président Nicolas Sarkozy. Will there be a people appearance of leftist François Hollande? Internet giants (GAFA, ABTX) are waging wars in our data. Remember Bertrand Russell, for who "War does not determine who is right - only who is left".

Camille Couprie (NYU, on leave from IFPEN)
Florent Perronnin (form. Panasonic and Xerox)
Hervé Jégou (INRIA)
Holger Schwenk (DeepLingo, université du Maine). Nota: the Facebook link is not active (201506032340), and points to G. Synnaeve's as if Holger was not in charge yet. His other page.
Gabriel Synnaeve (Ecole normale supérieure)
So a large portion of the fairies are male, which may not be fair to the magical creatures, knowing that female fairies usually have more power. Two handful more talents would joint by the end of the year, 25 to 50 in the years to come. Figures vary with sources, depending on permanent positions or PhD and post-doctorants. That is big data.

Google news full coverage
Facebook opens an artificial intelligence research lab in Paris
Intelligence artificielle : Facebook écrit une partie du futur à Paris.
Facebook mise sur Paris 
Intelligence Artificielle : Facebook ouvre un centre de recherche à Paris

April 10, 2015

Dédoménologie : la science du traitement de données (signal, images, etc.)

[Où l'on propose le néologisme dédoménologie pour désigner la technique, la pratique, la science du traitement de signal et de l'analyse d'images, au cœur du domaine naissant de la science des données, en passant par Euclide]
Ce sont les mots qui existent, ce qui n'a pas de nom n'existe pas. Le mot lumière existe, la lumière n'existe pas. (Francis Picabia, ou Francis-Marie Martinez de Picabia, Écrits)
Quelle analyste d'image, quel traiteur de signal n'a jamais eu des difficultés à décrire son métier ? Pas en détail bien sûr :
En fait, je m'intéresse aux propriétés cyclostationnaires des coefficients d'ondelettes de mouvements browniens fractionnaires dans les images de textures. Enfin quand je dis cyclostationnaire, il faut entendre périodiquement corrélé, hein, je ne parle pas des processus presque périodiques.

Le traitement du signal, des images ou des données requiert très souvent des périphrases. Des exemples parlants :
Tu vois Photoshop ? 
Très mauvais exemple. L'interlocuteur voit rapidement une journée de "travail" à bouger le mulot pour changer une teinte, sélectionner des objets à la baguette magique.
Tu connais le mp3 ? Le format JPEG ?
Et de rentrer dans des détails sordides de données numériques redondantes, de quantification, dont on cherche à extraire uniquement la partie perceptible utile. Avec le risque de remarques déplacées :
Le son du mp3, moi je trouver ça nul par rapport au vinyl ! [C'est pas gagné...]
L'analyste de signaux, le traiteur d'images, mais quel bruit fait-il ? C'est souvent une histoire de bruit d'ailleurs, un signal propre, une image nette, des données obvies, personne ne nous demande jamais de les analyser, de les traiter. C'est un peu comme les médecins en occident : ils voient peu de gens en bonne santé, à part les hypocondriaques, qui ont bien sûr un petit problème de santé, à un autre niveau. Le traitement de signal, l'analyse de données, on n'en fait pas en petite classe, du moins pas directement. Ce n'est pas au baccalauréat. Donc cette matière, la plupart des gens n'ont pas eu à la subir, et ayant peu de contact avec des mesures expérimentales, rares sont ceux qui ont dû avouer qu'ils ne savaient pas traiter les échantillons chèrement acquis. Pourtant, la donnée numérique est au cœur du monde réel, et il est fort possible que cela ne fasse qu'empirer.
Traitement d'images pour la science de la physique des matériaux

Ah ça, les mathématiques, on voit bien. La physique, la biologie, c'est à peu près clair. La chimie, évident. De loin, on sent que c'est compliqué, mais que ces gens gens-là se comprennent. Il y a des cases pour cela à l'Académie des sciences. Parfois ils ont des prix Nobel. Pas en mathématiques, mais on sait bien qu'il  y une espèce de prix Nobel des mathématiques, la médaille Fields. Et pourtant on connait mal le sens de ces mots : mathématiques vient du grec par le latin, avec un sens original de "science, connaissance". Physique vient de la "connaissance de la nature". Dans biologie, il y a le vivant, les produits bios... La chimie, c'est un peu plus compliqué (des mélanges, de la magie noire, de l'arabe et du grec). Mais statistiques, on retrouve la trace de l’État : science qui a pour but de faire connaître l'étendue, la population, les ressources agricoles et industrielles d'un État. L'électronique, on voit bien les petits électrons qui bougent dans les fils.

Analyse de signal chromatographique bidimensionnel
Et pourtant, la réalité scientifique est bien plus éparpillée : un biologiste qui étudie une bactérie a souvent peu à partager avec une spécialiste des champignons. Rien à voir, comme un spécialiste des dauphins et un lombriculteur. Cédric Villani, qui fait un grand effort de divulgation ces derniers temps (Les mathématiques sont un art comme les autres), et qui sera invité du prochain congrès de la communauté francophone et groupement de recherche et d'études du traitement du signal et de l'image (GRETSI 2015) à Lyon, montre une prudence naturelle quand on l'interroge sur d'autres mathématiques que les siennes. Physiciens des particules et mécaniciens des fluides se rencontrent rarement. La carte des sciences est bien plus complexe que les contribuables ne le pensent généralement.

Une carte des sciences en graphe

Comme les médecins qui ne soignent pas uniquement leurs bobos, les traiteurs de signaux traitent souvent les problèmes des autres disciplines. Les images pour la physique des matériaux, les signaux de chromatographie pour les chimistes analytiques, les réflexions sismiques des géophysiciens. Les traiteurs de signaux, les analyseurs d'images doivent comprendre un peu de ces disciplines. Savoir utiliser différentes techniques (analyse spectrale, statistiques, algèbre linéaire, modélisation paramétrique, optimisation, normalisation) et les mettre en pratique par des algorithmes et des programmes, même du matériel.
Déconvolution aveugle de réflexions sismiques

Il y a donc de la pratique (praxis) et de la technique (tekné) dans cette discipline composite, à la frontière d'autres sciences et disciplines. Il a fallu d'abord des mesures expérimentales avant de pouvoir les traiter. Personnellement, je me sens un praticien de certains types de données. Je connais leurs pathologies de base, j'ai quelques traitements qui marchent parfois. Technicien de la donnée, praticien de l'échantillon, ça donnerait des métiers d'infopracteurs/trices ou infopraticiens/praticiennes, d'infotechniciens/ciennes. J’aime a priori l’idée de rendre leur noblesse aux termes de praxis et de tekné. Donc « infopraxie » ou « infotechnie » comme nom de discipline ? Mais -technie, -praxie ça fait rebouteux, mécanicien auto. Et puis info, c’est trop « informatique ». Datalogie, ça aurait pu être bien : c'est une racine composite latino-hellénique, qui reflète bien l'aspect interdisciplinaire de cette science. Malheureusement, c'est déjà pris par l'informatique à nouveau (computer science). Il faudrait donc un truc plus sérieux.

Et par un tournant de sérendipité, je tombe sur  un texte d’Euclide, Dedomena qui a vu son titre traduit en latin en « data ». Il s'agit d'un texte sur la nature et les implications d'information donnée pour résoudre un problème géométrique. On y est. Il définit comment un objet est décrit en forme, en position, en grandeur. Ces critères sont très nettement ceux que l'on extrait au quotidien de données numériques. Et puis Euclide, c'est un bel hommage : les espaces euclidiens sont à la base de nombreux concepts algorithmes, et la minimisation de la distance euclidienne, c'est notre pain quotidien. Il suffit de regarder cet interlude de Martin Vetterli, From Euclid to Hilbert, Digital Signal Processing.

Alors je propose de renommer le traitement de signal ou des images, des données en général, en  dédoménologie, ou l’art de ceux qui parlent de, qui analysent des données. Cela plonge directement le traitement des signaux et des images au centre de la science des données en général. Ça aurait de la gueule sur une carte de visite, non ? Une mnémonique pour retenir ce terme : des domaines aux logis.

A partir de là, on peut étendre le vocabulaire à la dédoménotaxie, pour les opérations de classement, de tri de données (pour coller Euclide dans un taxi, c'est ici), à la dédoménotropie pour les flux de données, la dédoménomancie pour les aspects de prédiction (façon predictive analytics). Dénoménonomie, à la manière de l'astronomie, c'est peut-être dur à porter. Voila pour la science. Pour les aspects sociétaux, la mode est à la crainte des usages de manipulation de l'opinion et d'une gouvernance accrue par les données : dédoménodoxie et dédoménocratie. Attention donc à la dédoménophobie.

P. S. : après des recherches complémentaires, en anglais, le mot dedomenology pour désigner le concept de data science semble déjà avoir été émis, sans trop de succès.
P. S. 2 : on me signale (Hervé T.) que l'homophonie avec démonologie est suspecte. D'un : le dialbe est dans les détails, en science des données aussi. De deux : le démon de Maxwell, sans autre la qualité d'un filtre, est le parangon du tri du bon grain et de l'ivraie. De trois :  en informatique; un démon est : 
Un daemon (prononcé /ˈdiːmən/ ou /ˈdeɪmən/, du grec δαιμων) ou démon désigne un type de programme informatique, un processus ou un ensemble de processus qui s'exécute en arrière-plan plutôt que sous le contrôle direct d'un utilisateur.
Tout est dit.

February 16, 2015

Let data (science) speak

The Doctoral College at IFPEN (IFP Energies nouvelles) organizes seminars for PhD students. The next one on 30 March 2015 is about Data Science: "Faire parler les mesures, de la capture (acquisition) aux premiers mots (apprentissage) : la science des données, une discipline émergente" or "Let data speak: from its capture (acquisition) to its  first words (learning): data science, an emerging discipline". 

The invitees are Igor Carron (Nuit Blanche), Laurent Daudet (Institut Langevin) and Stéphane Mallat (École normale supérieure). Abstracts and slides follow, with two musical interludes:
For those who could not attend, or for a second shot in video:
Laurent Daudet
Stéphane Mallat :
Laurent Duval, Aurélie Pirayre, IFPEN

*Titre : introduction à la science des données
*Résumé : La science des données (ou dédoménologie) est une discipline émergente : le terme "data science" apparaît en 2001 et désigne un ensemble de techniques empruntant aux sciences de l'information, aux mathématiques, aux statistiques, à l'informatique, à la visualisation, à l'apprentissage automatique. Elle vise à extraire de la connaissance de données expérimentales, potentiellement complexes, volumineuses ou hétérogènes, en révélant des motifs ou structures peu explicites. Ce domaine est notamment tiré par le GAFA (Google, Apple, Facebook, Amazon), et joue un rôle croissant en biologie, en médecine, en sciences sociales, en astronomie ainsi qu'en physique.

Les exposés illustrent quelques facettes de cette discipline : comment exploiter une forme de hasard dans les mesures, comment voir à travers la peinture, comment apprendre à classifier par la non-linéarité ?

Data science appeared in 2001 as an emerging discipline. It designates a corpus of techniques derived from information sciences, mathematics, statistics, computer science, visualization, machine learning. It aims at extracting knowledge from (experimental) data, potentially complex, huge or heterogeneous, by unravelling weakly explicit patterns. This field is partly driven by GAFA companies (Google, Apple, Facebook, Amazon), and plays an increasing role in biology, medicine, social sciences, astronomy or physics.

The different talks shed a light on some aspects of this discipline : how to exploit randomness in measurements, how to see though the paint, how to learn to classify with non-linearities?

Igor Carron

*Title: "Ca va être compliqué": Islands of knowledge, Mathematician-Pirates and the Great Convergence
*Abstract: In this talk, we will survey the different techniques that have led to recent changes in the way we do sensing and how to make sense of that information. In particular, we will talk about problem complexity and attendant algorithms, compressive sensing, advanced matrix factorization, sensing hardware and machine learning and how all these seemingly unrelated issues are of importance to the practising engineer. In particular, we'll draw some parallel between some of the techniques currently used in machine learning as used by internet companies and the upcoming convergence that will occur in many fields of Engineering and Science as a result.

Laurent Daudet, Institut Langevin, Ondes et images

*Title: Compressed Sensing Imaging through multiply scattering materials (Un imageur compressé utilisant les milieux multiplement diffusants)
*Abstract: The recent theory of compressive sensing leverages upon the structure of signals to acquire them with much fewer measurements than was previously thought necessary, and certainly well below the traditional Nyquist-Shannon sampling rate. However, most implementations developed to take advantage of this framework revolve around controlling the measurements with carefully engineered material or acquisition sequences. Instead, we use the natural randomness of wave propagation through multiply scattering media as an optimal and instantaneous compressive imaging mechanism. Waves reflected from an object are detected after propagation through a well-characterized complex medium. Each local measurement thus contains global information about the object, yielding a purely analog compressive sensing method. We experimentally demonstrate the effectiveness of the proposed approach for optical imaging by using a 300-micrometer thick layer of white paint as the compressive imaging device. Scattering media are thus promising candidates for designing efficient and compact compressive imagers.
(joint work with I. Carron, G. Chardon, A. Drémeau, S. Gigan, O. Katz, F. Krzakala, G. Lerosey, A. Liutkus, D. Martina, S. Popoff)

Stéphane Mallat, École Normale Supérieure

*Title: Learning Signals, Images and Physics with Deep Neural Networks
*Abstract: Big data, huge memory and computational capacity are opening a scientific world which did not seem reachable just few years ago. Besides brute-force computational power, algorithms are evolving quickly. In particular, deep neural networks provide impressive classification results for many types of signals, images and data sets. It is thus time to wonder what type of information is extracted by these network architectures, and why they work so well.

Learning does not seem to be the key element of this story. Multirate filter banks together with non-linearities can compute multiscale invariants, which appear to provide stable representations of complex geometric structures and random processes.  This will be illustrated through audio and image classification problems. We also show that such architectures can learn complex physical functionals, such as quantum chemistry energies.

December 26, 2014

The law of excessive gardening in education

Once upon a time. the term proletarian (or a Latin equivalent) used refer to  a Roman citizen who was so poor that he only had his children considered as his property. The term proles stands here for descendants (or litter). The term evolved through Marxism, to workers without capital or production means, who should sell their own work-force. The term can be somehow extended to the loss of knowledge of the production tools. An example lies in the comparison between the craftsman or artisan, who masters his tools, and may even be able to repair them, and manages a series of processes, as opposed to the factory worker, whose work has been taylorized and who lacks of knowledge in the whole meaning of the chain, or the functionning of the robots he "controls".

Dilbert: Wally, the Boss and the screen-saver effect.
The concept of proletarianization has been used more recently by Bernard Stiegler (Ars Indutrialis), to describe a pauperization, not in terms of wealth, but in terms of ability to do, to make and to live. An externalization and therefore a loss of knowledge and memory. 

I believe proletarianization, in that sense, is rampant, and pervades all strata of the society: at every level of work, at least in institutions or companies, there is a loss of sense and understanding about the way stuff works. A corollary to Peter's principle. To make a long story short, strategic visions have progessively been replaced by indicator-based management. In science evaluation, the infamous impact factor or h-index are examples of such indicators. Stock-value and quality-based management are other examples. 
Peter's principle: (success>advancement)^n>failure

The top of proletarianization spread was given by Alan Greenspan, the former chairman of the Federal Reserve, who once told the CNBC that he did not fully understand the scope of the subprime mortgage market until well into 2005 and could not make sense of the complex derivative products created out of mortgages. Alan Greenspan, the chief banker of the world, was mystified...

So what's got the maths to do with it?

About two years ago, in April 2012, there was a conference called Fixing mathematical education. There i learned the effective law of excessive learning in mathematics from Alexandre V. Borovik:
To be able to use maths at certain level it is necessary to learn it at the next level
A week ago, Alexandre Borovik sent me a link to his preprint, Calling a spade a spade: Mathematics in the new pattern of division of labour:
The growing disconnection of the majority of population from mathematics is becoming a phenomenon that is increasingly difficult to ignore. This paper attempts to point to deeper roots of this cultural and social phenomenon. It concentrates on mathematics education, as the most important and better documented area of interaction of mathematics with the rest of human culture.
    I argue that new patterns of division of labour have dramatically changed the nature and role of mathematical skills needed for the labour force and correspondingly changed the place of mathematics in popular culture and in the mainstream education. The forces that drive these changes come from the tension between the ever deepening specialisation of labour and ever increasing length of specialised training required for jobs at the increasingly sharp cutting edge of technology.
    Unfortunately these deeper socio-economic origins of the current systemic crisis of mathematics education are not clearly spelt out, neither in cultural studies nor, even more worryingly, in the education policy discourse; at the best, they are only euphemistically hinted at.
    This paper is an attempt to describe the socio-economic landscape of mathematics education without resorting to euphemisms. 
The pdf is here. Alexandre claims that "The communist block was destroyed by a simple sentiment: If they think they pay me let them think I am working. Mathematics education in the West is being destroyed by a quiet thought (or even a subconscious impulse): If they think they teach me something useful, let them think I am learning." My experience as a part-time teacher is that it is increasingfly difficult to have students learn things. Most of them are especially disgusted by the concept of "learning by heart" (wait, everything is in the Internet, isn't it?). While they could just try to understand. 

For a long time,  people could have proudly claimed that "I have never been good at mathematics, but I live happily without it". The XXth century has been rich in discoveries that have both pervaded the society (basically, from the transistor to the mp3). The pitfall is many users have no idea about the way these technologies work, and happily put their agenda, thoughts, entertainement time and finally life (pictures, films) in those hands. 

The world seemingly grows a local era where data becomes important (a new gold, a novel oil?) and big data or data science are slowly emerging as potentially disruptive technologies. Then, Alexandre states that "the West is losing the ability to produce competitively educated workers for mathematically intensive industries". It is high times to rethink (mathematical) education, as its changes may be much slower than the present evolution of technologies. 

Nota bene: and Igor, just for you, there is a phase transition ("The crystallisation of a mathematical concept (say, of a fraction), in a child's mind could be like a phase transition in a crystal") and an aha moment ("An aha! moment is a sudden jump to another level of abstraction")

December 20, 2014

Learning meets compression: small-data-science internship (IFPEN)

Internship subject: [french/english]
Many experimental designs acquire continuous or salve signals or images. Those are characteristic of a specific phenomenon. One may find examples at IFPEN in seismic data/images, NDT/NDE acoustic emissions (corrosion, battery diagnosis) engine benches (cylinder pressure data, fast camera), high-thoughput screening in chemistry. Very often, such data is analyzed with standardized, a priori indices. Comparisons between different experiments (difference- or classification-based) are often based on the same indices, without resorting to initial measurements.

The increasing data volume, the variability in sensor and sampling, the possibility of different pre-processing yield two problems: the management and access to data (« big data ») and their optimal exploitation by dimension reduction methods, supervised or unsupervised learning (« data science »). This project aims at the analysis of the possibility of a joint compressed representation of data and the extraction of pertinent indicators, at different characteristic scales, and the relative impact of the first aspect (lossy compression degradation) over the second aspect (precision and robustness of extracted feature indicators).

The internship possesses a dual goal. The first aspect will be dealing with scientific research on sparse signal/image representations with convolution networks based on multiscale wavelet techniques, called scattering networks. Their descriptors (or footprints) possess fine translation, rotation and scale invariance. Those descriptors will be employed for classification and detection. The second aspect will carry on the impact of lossy compression on the preceeding results, and the development of novel sparse representations for joint compression and learning.

J. Bruna, S. Mallat. Invariant scattering convolution networks. IEEE Trans. on Patt. Anal. and Mach. Int., 2010
L. Jacques, L. Duval, C. Chaux, G. Peyré, A Panorama on Multiscale Geometric Representations, Intertwining Spatial, Directional and Frequency Selectivity, Signal Processing, 2011
C. Couprie, C. Farabet, L. Najman, Y. LeCun, Convolutional Nets and Watershed Cuts for Real-Time Semantic Labeling of RGBD Videos, Journal of Machine Learning Research, 2014

A PhD thesis (Characteristic fingerprint computation and storage for high-throughput data flows and their on-line analysis, with J.-C. Pesquet, Univ. Paris-Est) is proposed, starting September 2015.

Information update: http://www.laurent-duval.eu/lcd-2015-intern-learning-compression.html

November 29, 2014

Haiku (libre) : sémantique et général

La trahison des images, René Magritte
Les gens qui confondent
La carte et le territoire
Me fatiguent un peu

Big data and Data science: LIX colloquium 2014

Sketch of the Hype Cycle for Emerging Technologies
Data science and Big data are two concepts at the tip of the tongue and the top of the Gartner Hype Cycle for Emerging Technologies. Close to the peak of inflated expectations. The Data science LIX colloquium 2014 at Ecole Polytechnique, organized by Michalis Vazirgiannis from DaSciM was held yesterday on the Plateau de Saclay, which may have prevented some to attend the event. Fortunately, it was webcast.

The talks covered a wide range of topics pertaining to Datacy (data + literacy). The community detection in graphs (with survey) keynote promoted local optimization (OSLOM, with order statistics). It was said than "We should define validation procedures even before starting developing algorithms", including negative tests; on random graphs, a clustering method should find non prominent cluster (except the whole graph), in other words no signal in noise. But there was no mention to phase transition in clustering. The variety of text data (SMS, tweets, chats, emails, news web pages, books, 100 languages, OCR and spelling mistakes) and its veracity was questioned with Facebook estimating that between 5% and 11% of accounts are fake, and 68.8 percent of the Internet is spam (how did they get the 3 figures precision?). News-hungry people would be interested in EMM News, a media watch tool aggregating 10000 RSS flux and 4000 news aggregators. With all these sources, some communities are concerned with virtual ghost town effects, and look for way to spark discussions (retweets and the likes) to keep social activity alive. Flat approaches or hierarchical grouping are still debated challenges in large-scale classification and web-scale taxonomies. Potentially novel graph structures (hypernode graphs, related to hypergraphs or maybe n-polygraphs) with convex stability and spectral theory are also proposed in the first part of the colloquium.

Big Data Cap Gap: the space between all and relevant data
While Paris-Saclay center for data science has opened its website, the unbalanced data was exposed around the HiggsML data-driven challenge. Less than 100 Higgs bosons (expected) to be detected in 10^10 yearly. Big-data analogs of the greek Pythia, as well as efficient indexing and mining methods would necessary to harness the data beast. More industrials talks concluded the colloquium, given by AXA, Amazon and Google representatives, which i could not attend, left with the so-called "crap gap" in mind, i. e. the gap between Relevant Data and Big Data. 

Innovation driven by large data sets still requires, at least, vague goals in mind. In Latin, "Ignoranti quem portum petat nullus suus uentus (ventus) est", wrote Sénèque in his 71th letter to Lucilius. A possible translation in English: "When a man does not know what harbour he is making for, no wind is the right wind". In German, "Dem weht kein Wind, der keinen Hafen hat, nach dem er segelt". And "Il n'y a point de vent favorable pour celui qui ne sait dans quel port il veut arriver" in French.

All of the information, and possibly information you need, may be found in the following program and videos. As the videos are not split into talks, the time codes are provided, thanks to the excellent suggestion (and typos corrections) by Igor Carron.

LIX colloquium 2014 on Data Science LIVE part 1
  • 00:00:00 > 00:22:22: Introduction and program
  • 00:22:22 > 01:22:18: Keynote speech: Community detection in networks, Santo Fortunato, Aalto University
  • 01:22:18 > 01:57:30: Text and Big Data, Gregory Grefenstette, Inria Saclay - Île de France
  • 01:57:30 > 02:29:23: Accessing Information in Large Document Collections: classification in web-scale taxonomies, Eric Gaussier, Université Joseph Fourier (Grenoble I)
  • 02:29:23 > 03:01:32: Shaping Social Activity by Incentivizing Users, Manuel Gomez Rodriguez, Max Planck Institute for Software Systems
  • 03:01:32 > 03:38:00: Machine Learning on Graphs and Beyond, Marc Tommasi, Inria Lille

LIX colloquium 2014 on Data Science LIVE part 2
  • 00:00:00 > 00:33:57: Learning to discover: data science in high-energy physics and the HiggsML challenge, Balázs Kégl, CNRS
  • 00:34:11 > 01:06:15: Big Data on Big Systems: Busting a Few Myths, Peter Triantafillou, University of Glasgow
  • 01:06:15 > 01:38:29: Big Sequence Management, Themis Palpanas, Paris Descartes University

LIX colloquium 2014 on Data Science LIVE part 3 
  • 00:00:00 > 00:38:11: Understanding Videos at YouTube Scale, Richard Washington, Google
  • 00:38:11 > 01:05:48: AWS's big data/HPC innovations, Stephan Hadinger, Amazon Web Services
  • 01:05:48 > 02:02:41: Big Data in Insurance - Evolution or disruption? Stéphane Guinet, AXA 
  • 02:02:41 > 02:06:35: Closing words on a word cloud (with time, series, graph and classification are the big four)

November 1, 2014

Cédric Villani : les mathématiques sont un art comme les autres (podcast)

Les mathématiques sont un art comme les autres, une série de cinq entretiens avec Cédric Villani (professeur de l'Université de Lyon et directeur de l'Institut Henri Poincaré), dans l'émission "Un autre jour est possible", sur France Culture. Sur la poésie, la musique, le design, les arts de la rue et le cinéma. Cette tête chercheuse fait beaucoup et bien pour la vulgarisation des mathématiques et leur transfert innovation vers des disciplines afférentes. Louable effort. "Nul ne peut être mathématicien s'il n'a une âme de poète", disait Sophie Kowalevskaia. Les 15 et 16 décembre 2014, le forum Horizon Maths a lieu à IFP Energies nouvelles à Rueil-Malmaison (thème : "Les mathématiques se dévoilent aux industriels"), le programme en pdf est ici.En plus détaillé, après la pause podcast sur Cédric Villani.

Session « Méthodes pour la chimie ab initio »
  • Pascal Raybaud (IFPEN) : « Enjeux de la performance numérique pour les calculs ab initio
  • en catalyse »
  • Thierry Deutsch (CEA Grenoble) « Les ondelettes, une base flexible permettant un contrôle fin de la précision et la mise au point des méthodes ordre N pour le calcul de la structure électronique via BigDFT »
  • Benjamin Stamm (UPMC) « A posteriori estimation for non-linear eigenvalue problems in the context of DFT- methods »
  • Filippo Lipparini (Universität Mainz) « Large, polarizable QM/MM/Continuum computations : ancient wishes and recent advances »Eric Cancès (Ecole des Ponts ParisTech) « Aspects mathématiques de la théorie fonctionnelle de la densité (DFT) »
Session « Optimisation sans dérivée »
  • Delphine Sinoquet (IFPEN) « Applications de l’optimisation sans dérivée dans le secteur pétrolier et le domaine des énergies marines renouvelables »
  • Emmanuel Vazquez (SUPELEC) « Nouvelles fonctions de perte pour l’optimisation bayesienne »
  • Serge Gratton (CERFACS) « Optimisation sans dérivée : algorithmes stochastiques et complexité »
  • Wim van Ackooij (EDF) « Optimisation sous contraintes probabilistes et applications en management d’énergies »
  • Marc Schoenauer (INRIA) « Optimisation continue à base de comparaisons : modèles de substitution et adaptation automatique »
  • Bilan de la session par Josselin Garnier (Université Paris Diderot)
Session « Maillages et Applications Industrielles »
  • Jean-Marc Daniel (IFPEN) « Besoins pour le maillage des milieux géologiques complexes »
  • Paul-Louis George et Houman Borouchaki (INRIA et UTT - INRIA respectivement) « Panorama des méthodes génériques de génération de maillages et méthodes spécifiques de maillage en géosciences »
  • Jean-François Remacle (UCL - Rice University) « An indirect approach to hex mesh generation »
  • Thierry Coupez (Ecole Centrale de Nantes) « Frontières implicites et adaptation anisotrope de maillage »
  • Pascal Tremblay (Michelin) « Les défis de la transition du maillage hexaédrique vers tétraédrique pour des applications industrielles »
  • Bilan de la session par Frédéric Hecht (UPMC)
Session « Visualisation »
  • Sébastien Schneider (IFPEN) « Courte introduction à la visualisation pour les géosciences à IFPEN »
  • Julien Jomier (Kitware) « Scientific Visualization with Open-Source Tools »
  • Emilie Chouzenoux (UPEM) « A Random block- coordinate primal-dual proximal algorithm with application to 3D mesh denoising »
  • Jean-Daniel Fekete (INRIA) « Visualisation de réseaux par matrices d’adjacence »
  • Marc Antonini (CNRS) « Compression et visualisation de données 3D massives »
  • Bilan de la session par Julien Tierny (CNRS - UPMC)