About
I have obtained my PhD in Computer Science at the Toulouse Institute of Computer Science Research, Université Paul Sabatier (France). My work took place under the supervision of Drs Mouna Kamel and Nathalie Aussenac-Gilles, and focused on Natural Language Processing, Document Analysis and Machine Learning fields. More particularly, I was interested in the acquisition of lexical relations using the layout and the formatting of documents. For this work, I received the ATALA Best PhD Thesis Award.
Previously, I graduated in Natural Language Processing at the Catholic University of Louvain (Belgium) in 2012. This Master's degree programme was delivered by the CENTAL laboratory.
Recent (2024)
- MM1: McKinzie, B., Gan, Z., Fauconnier, J., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Weers, F. and Belyi, A., et. al., MM1: Methods, Analysis & insights from Multimodal LLM Pre-training., Milano, ECCV 2024
- RLHF: Amirloo, E., Fauconnier, J., Roesmann, C., Kerl, C., Boney, R., Qian, Y., Wang, Z., Dehghan, A., Yang, Y., Gan, Z. and Grasch, P. Understanding Alignment in Multimodal LLMs: A Comprehensive Study., 2024.
- Instruction-Following: Qian, Y., Ye, H., Fauconnier, J., Grasch, P., Yang, Y. and Gan, Z. MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs. 2024.
Research Interests
Natural Language Processing
- Analysis: morpho-lexical analysis and syntactic parsing
- Corpora: development of annotation tools, annotation campaigns, and inter-rater agreement evaluation
- Meaning: ontology learning from text and Word embeddings
Document Analysis
- Logical modeling: links between physical layout structures and logical ones
- LR Parsing: statistical and context-free grammar parsing for building logical tree
Statistics
- Predictive modeling: supervised classification, structured prediction and unsupervised feature selection
- Descriptive analytics: statistical relationships and multivariate analysis
Works
National journal papers
- M. Kamel, B. Rothenburger, J.-P. Fauconnier. Identification de relations sémantiques portées par les structures énumératives paradigmatiques : une approche symbolique et une approche par apprentissage supervisé. Revue d'Intelligence Artificielle, Hermès Science, Numéro spécial Ingénierie des Connaissances. Nouvelles évolutions., Vol. 28, N. 2-3, p. 271-296, 2014.
International conference papers
- J.-P. Fauconnier, M. Kamel. Discovering Hypernymy Relations using Text Layout (regular paper). The Fourth Joint Conference on Lexical and Computational Semantics (*SEM 2015), Denver, Colorado, 2015. PDF
- J.-P. Fauconnier, M. Kamel, B. Rothenburger. A Supervised Machine Learning Approach for Taxonomic Relation Recognition through Non-linear Enumerative Structures (short paper). ACM Symposium on Applied Computing (SAC 2015), Salamanque, 2015. PDF
- J.-P. Fauconnier, M. Kamel, B. Rothenburger. Une typologie multi-dimensionnelle des structures énumératives pour l'identification des relations termino-ontologiques (regular paper). Conférence Internationale sur la Terminologie et l'Intelligence Artificielle (TIA 2013), Paris, Université Paris 13, p. 137-144, 2013. PDF
National conference papers
- J.-P. Fauconnier, L. Sorin, M. Kamel, Mustapha Mojahid, N. Aussenac-Gilles. Détection automatique de la structure organisationnelle de documents à partir de marqueurs visuels et lexicaux (regular paper). Traitement Automatique des Langues Naturelles (TALN 2014), Marseille, Association pour le Traitement Automatique des Langues (ATALA), p. 340-351, 2014. PDF
- J.-P. Fauconnier, M. Kamel, B. Rothenburger, N. Aussenac-Gilles. Apprentissage supervisé pour l'identification de relations sémantiques au sein de structures énumératives parallèles (regular paper). Traitement Automatique des Langues Naturelles (TALN 2013), Les Sables d'Olonne, Association pour le Traitement Automatique des Langues (ATALA), p. 132-145, 2013. PDF
Phd Thesis
- J.-P. Fauconnier, Acquisition de liens sémantiques à partir d'éléments de mise en forme des textes : exploitation des structures énumératives (PhD Thesis). Université de Toulouse, 2016. PDF HAL (Best PhD Thesis Award - ATALA 2017)
Talks
- J.-P. Fauconnier. La mise en forme des textes : un indice supplémentaire pour l'identification des relations hiérarchiques (talk). Séminaire de l'équipe TALEP, Laboratoire d'Informatique Fondamentale de Marseille, 20-05-16.
- J.-P. Fauconnier. Mise en forme et indices linguistiques de surface pour l'extraction de connaissances (talk). Journées d'étude internationales S'caladis, Université Toulouse Jean Jaurès, Toulouse, 19-11-15. Abstract
- J.-P. Fauconnier, M. Kamel, N. Aussenac-Gilles. Acquisition de relations sémantiques à partir d’éléments de mise en forme des textes (talk). Séminaires du CENTAL, Université Catholique de Louvain, Louvain-la-Neuve, 21-11-14. PDF
- J.-P. Fauconnier. Métriques pour l’évaluation de l’annotation (talk). Séminaires de l’équipe MELODI, Université Paul Sabatier, Toulouse, 25-11-13. Link
- J.-P. Fauconnier, M. Kamel, N. Aussenac-Gilles. A Supervised Learning for the Identification of Semantic Relations in Parallel Enumerative Structures (poster). The 10th Summer School on Ontology Engineering and the Semantic Web (SSSW 2013), Cercédilla, 10-07-13. PDF
- J.-P. Fauconnier. Classifieur d’Entropie Maximale (MaxEnt) (talk). Séminaires de l’équipe MELODI, Université Paul Sabatier, Toulouse, 15-02-13. Link
- A. Urieli, J.-P. Fauconnier. PosTagger et Parseur Talismane (talk). Séminaires de l’Axe TAL, CLLE-ERSS, Toulouse, 20-06-12. PDF
- J.-P. Fauconnier, J. Roumier, F. Estiévenart. Musonto - A Semantic Search Engine dedicated to Music and Musicians (talk). Music Linked Data Workshop, JISC, Londres, 12-05-11. Link
Teaching
2015-2016
- Algorithms and C++ Programming - 54 hours - Université Jean Jaurès (Toulouse, France)
- Relational Database - 44 hours - Université Jean Jaurès (Toulouse, France)
- Web Integration - 40 hours - Université Jean Jaurès (Toulouse, France)
- C2I Certification - 36 hours - Université Jean Jaurès (Toulouse, France)
- XML technologies - 12 hours - Université Jean Jaurès (Toulouse, France)
- Semantic Web - 4 hours - Université Jean Jaurès (Toulouse, France)
2014-2015
- Web Integration - 24 hours - Institut Universitaire de Technologie (Tarbes, France)
- Relational Database - 18 hours - Institut Universitaire de Technologie (Tarbes, France)
- Web Integration - 18 hours - Institut Universitaire de Technologie (Tarbes, France)
2013-2014
- Web Integration - 36 hours - Institut Universitaire de Technologie (Tarbes, France)
- Relational Database - 12 hours - Institut Universitaire de Technologie (Tarbes, France)
- Information System - 12 hours - Institut Universitaire de Technologie (Tarbes, France)
2012-2013
- Office Automation - 40 hours - Institut Universitaire de Technologie (Tarbes, France)
- Office Automation - 24 hours - Institut Universitaire de Technologie (Tarbes, France)
Software
Most of resources are located on my github repository. The fast way to download a given resource is to use git:
mkdir resource
cd resource
git clone https://github.com/fauconnier/resource
Personal softwares
AMI (Another Maxent Implementation) is a R implementation of multinomial logistic regression, also known as Maximum Entropy classifier. This implementation deals with binary and real-valued features and uses standard R functions to optimize the objective. Then, it is possible to use several iterative methods: LM-BFGS, Conjugate Gradient, Gradient Descent and Generalized Iterative Scaling.
LARAt (Layout Annotation for Relation Acquisition tool), pronounced /laʁa/, is an annotation tool which supports the layout and the formatting of HTML documents. LARAt was used during an annotation campaign in 2013 and, in his current state, is dedicated to the annotation of enumerative structures. The typology implemented is the one described in the TIA 2013 paper.
LaToe (Layout Annotation for Textual Object Extraction) is a tool which extracts the text layout from HTML, MediaWiki, or PDF documents for identifying specific textual objects (such as enumerative structures). Currently, the CRF model used for the PDF analyzer was trained on a small corpus (LING_GEOP). This implies that LaToe could be not efficient for unseen PDF documents with specific formatting.
Source code reviews
Code review of a C++ library for maximum entropy classification. On his website, Tsuruoka proposed a fast implementation of a multinomial logistic regression. In order to get a better and deeper understanding of implementation details, I propose a simple code review. The code base is relatively small (around 2500 lines of code). Those notes are primary intended for my personal use and reflect my current understanding. I propose them here, in case it could help someone. Note that this document is currently a work in progress.
Open source contributions
Some open source contributions:
- 2016
- neoterm (vim module) Fixed documentation. PR
- sc (spreadsheet calculator) Updated sc 7.16. commit
- sc (spreadsheet calculator) Fixed conflicting malloc. PR
- 2015
- LAPDFText (Layout-Aware PDF Analyser) Led a student for improving bottom-up parsing and building a GUI. fork
- 2014
- FREJ (Fuzzy Regular Expressions) Client-server architecture for spreading the load across a cluster. github
- YaTeA (terminology extractor) Adaptation to Talismane POS-tagset and Java client. github
- ACABIT (terminology extractor) Adaptation to Talismane POS-tagset and Java client. github
- Talismane (statistical dependency parser for French) Java client for treating corpora "on-the-fly". github
- 2013-12
Data
French word embeddings models
I propose here some pre-trained word2vec models for French. Their format is the original binary format proposed by word2vec v0.1c. Depending on your needs, you may want to convert those models. A simple way to convert them into text can be:
git clone https://github.com/marekrei/convertvec
cd convertvec/
make
./convertvec bin2txt frWiki_no_phrase_no_postag_700_cbow_cut100.bin output.txt
Alternatively, you can load a binary model directly into a few Python libraries. Below I give a minimal usage example with Gensim:
pip install gensim
python
>>> from gensim.models import KeyedVectors
>>> model = KeyedVectors.load_word2vec_format("frWac_postag_no_phrase_700_skip_cut50.bin", binary=True, unicode_errors="ignore")
>>> model.most_similar("intéressant_a")
[('très_adv' , 0.5967904925346375),
('intéresser_v' , 0.5439727902412415),
('peu_adv' , 0.5426771640777588),
('assez_adv' , 0.5398581027984619),
('certainement_adv' , 0.5246292352676392),
('plutôt_adv' , 0.5234975814819336),
('instructif_a' , 0.5230029225349426),
('trouver_v' , 0.5131329894065857),
('aussi_adv' , 0.505642294883728),
('beaucoup_adv' , 0.5034803152084351)]
For this model, we can see that the adjective 'intéressant' has a lot of shared contexts with adverbs. Note that the color code and the layout are mine.
Please check (Mikolov et al., 2013) to gain insight into the model hyper-parameters.
Thanks to Tim V. C., Adam B., Claude C., Sascha R., Philipp D., Nirina R., Ian W. and Antoine V. who all helped in retrieving some of the original models.
frWac2Vec
FrWac corpus, 1.6 billion words.
frWiki2Vec
FrWiki dump (raw file), 600 millions words.
How to cite those models?
Given the attribution is provided and according to the licence CC-BY 3.0, you are free to copy, distribute, remix and tweak those models for any purpose. The attribution must be made by quoting my name with a link to this page, or by using the bibtex entry below. Those models were trained during my PhD Thesis, and are in no way linked to my current or any future activities. Note also that those models are shared without any guarantees or support.
@misc{fauconnier_2015,
author = {Fauconnier, Jean-Philippe},
title = {French Word Embeddings},
url = {http://fauconnier.github.io},
year = {2015}}
Below, public projects and papers using those models:
- Di Tella et al., (2023), Keep your Enemies Closer: Strategic Platform Adjustments during U.S. and French Elections. in National Bureau of Economic Research (NBER), NBER Working Paper Series. PDF
- Rennard, V., Shang, G., Grari, D., Hunter, J. and Vazirgiannis, M. (2023), FREDSum: A Dialogue Summarization Corpus for French Political Debates. in EMNLP 2023. PDF
- Ait-Saada, M. and Nadif, M. (2023), Unsupervised Anomaly Detection in Multi-Topic Short-Text Corpora. in 17th Conference of the European Chapter of the Association for Computational Linguistic (EACL), Croatia. PDF
- Ivan, I., and Chometton, N. (2022). Alignement des embeddings des définitions et du contexte pour un assistant de lecture sensible au contexte. In Journées Jointes des Groupements de Recherche Linguistique Informatique, Formelle et de Terrain (LIFT) et Traitement Automatique des Langues (TAL). PDF
- Descampe, A., Massart, C., Poelman, S., Standaert, F. X., and Standaert, O. (2022). Automated news recommendation in front of adversarial examples and the technical limits of transparency in algorithmic accountability.. In AI and SOCIETY. Université Catholique de Louvain. PDF
- Eshkol-Taravella, I., Barbedette, A., Liu, X., Soumah, V.-G. (2022) Classification automatique de questions spontanées vs. préparées dans des transcriptions de l’oral. In TALN 2022. PDF
- bpisano (2022). Dall-E Guesser. — Application
- Guillem, F. (2022). Analyse des Programmes de la Présidentielle 2022 avec Word2vec. Le Blog de François Guillem. Blog
- Benamar, A, Grouin, C., Bothua, M. and Vilnat, A. (2022). Etude des stéréotypes genrés dans le théâtre français du XVIe au XIXe siècle à travers des plongements lexicaux. In TALN 2022. PDF
- Enigmatix (2022). Cémantix. Trouvez le mot secret — Application
- Tardy, P. (2021). Approches neuronales pour le résumé abstractif de transcriptions de parole. PhD Thesis Le Mans Université. PDF
- Schoffeniels, A. (2021). NLP Methods for Insurance Document Comparison. Master Thesis University of Liège. PDF
- Mallart, C., Le Nouy, M., Gravier, G., and Sébillot, P. (2021). Active Learning for Interactive Relation Extraction in a French Newspaper's Articles. Recent Advances in Natural Language Processing. PDF
- Barbedette, A., and Eshkol-Taravella, I. (2021). Quand les questions en disent plus que les réponses: classification automatique des intentions dans les questions. Discours (28). OpenEdition
- Wang, X and Liu, X. (2021). Text Classification: du TF-IDF aux word embeddings en passant par features expertes. NLP for French. Blog
- Abdine, H., Xypolopoulos, C., Kamal Eddine, M. and Vazirgiannis, M. (2021). Evaluation of Word Embeddings from Large-Scale French Web Content. arXiv:2105.01990v2. arXiv
- Louis, A., Spanakis, G., and Van Dijck, G. (2021). A Statutory Article Retrieval Dataset in French. arXiv preprint arXiv:2108.11792. arXiv
- Gerardin, M. and Ranvier, M. (2021). Enrichment of the Banque de France’s monthly business survey: lessons from textual analysis of business leaders’ comments. Working Paper — Banque de France. PDF
- Crouzet, O. (2021). Outils computationnels pour l'étude des mécanismes d'adaptation à la variation en perception de la parole. La lettre de I’InSHS — CNRS. PDF
- Dénigot, Q. and Burnett, H. (2021). Using Word Embeddings to Uncover Discourses. In Proceedings of Computation in Linguistics. PDF
- Lambrey, L. (2021). Génération automatique de mots-valises : approche distributionnelle de la classification sémantique de construits. In Journée d'Étude Sciences du Langage, Nancy. PDF
- Billami, M. B., Nicolaieff, L., Gosset, C. and Bortolaso, C. (2021). Participation de Berger-Levrault (BL. Research) à DEFT 2021: de l’apprentissage des seuils de validation à la classification multi-labels de documents. In TALN 2021. PDF
- Hemamou, L. (2021). Analyse automatique des comportements multimodaux lors d’entretiens vidéo différés pour le recrutement. PhD Thesis Université Paris-Saclay. PDF
- Le, N. L. (2020). French Language DRS Parsing. PhD Thesis Mines-Telecom Atlantique. PDF
- Tardy, P., Janiszek, D., Estève, Y., and Nguyen, V. (2020). Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation. In Proceedings of LREC 2020. PDF
- Blandin, A., Lecorvé, G., Battistelli, D., and Étienne, A. (2020). Age Recommendation for Texts. In Proceedings of LREC 2020. PDF
- Kang, H. J. and Eshkol-Taravella, I. (2020). Les avis sur les restaurants à l’épreuve de l’apprentissage automatique. In TALN 2020. PDF
- Leprieur, L., Crouzet, O., and Gaudrain, E. (2020). Une base de données de phrases en français pour l'étude du rôle conjoint des incertitudes sémantique et acoustique dans la perception de la parole. In TALN 2020. PDF
- Benamar, A. (2020). Segmentation de texte non-supervisée pour la détection de thématiques à l'aide de plongements lexicaux. In TALN 2020. PDF
- Bourdois, L. (2020). L'augmentation de données en NLP. from Chaudhary, A., A Visual Survey of Data Augmentation in NLP. Blog
- Pelloin, V., and Prouteau, T. (2020). Apprentissage de plongements de mots sur des corpus en langue de spécialité: une étude d’impact. In TALN 2020. PDF
- Coulombe, C. (2020). Techniques d’amplification des données textuelles pour l’apprentissage profond. PhD Thesis TÉLUQ. PDF
- Billami, M. (2020). DEESSE: A Generic Search Engine based on Artificial Intelligence. Recherche Berger-Levrault. Blog
- Gilly, A. (2019). The Semantic Architecture of Social Unrest. Blog
- Hemamou, L., Felhi, G., Vandenbussche, V., Martin, J. C., and Clavel, C. (2019). Hirenet: A hierarchical attention model for the automatic analysis of asynchronous video job interviews. In Proceedings of the AAAI Conference on Artificial Intelligence. PDF
- Le Ngoc, L., and Haralambous, Y. (2019). CCG Supertagging Using Morphological and Dependency Syntax Information. In Computational Linguistics and Intelligent Text Processing. PDF
- Chow, J. (2019). Lost in translation: fidelity-focused machine translation evaluation. Master Thesis, Imperial College London. PDF
- Vincent, R. (2019). Multilingual Poetry Generation. Master Thesis Norwegian University of Science and Technology. PDF
- Poelman, S. (2019). Faking news recommendation: an exploration of content representation.. Master Thesis UCL. PDF
- Bladier, T., Waszczuk, J., Kallmeyer, L., and Janke, J. H. (2019). From partial neural graph-based LTAG parsing towards full parsing. Computational Linguistics in the Netherlands Journal. PDF
- Dufter, P., and Schütze, H. (2019). Analytical Methods for Interpretable Ultradense Word Embeddings.. In Proceedings of EMNLP 2019. PDF
- Bardet, A., Loubère, L., Dugué, N., Hu, S. and Delpech, Y. (2019). Proposer des analyses sémantiques ou discursives des contributions du GDN. In Hackatal 2019
- Lettria (2019) — Word embedding. Blog
- De Caigny, K. Coussement, K.W. De Bock, S. Lessmann (2019). Incorporating Textual Information in Customer Churn Prediction Models Based on a Convolutional Neural Network. In International Journal of Forecasting
- Hemamou, L., Felhi, G., Vandenbussche, V., Martin, J.-C. and Clavel, C. (2019). HireNet: a Hierarchical Attention Model for the Automatic Analysis of Asynchronous Video Job Interviews. In AAAI 2019. PDF
- Hmida, F., Billami, M. B., François, T. and Gala, N. (2018). Assisted Lexical Simplification for French Native Children with Reading Difficulties. In Proceedings of the Workshop Automatic Text Adaptation, 11th International Conference on Natural Language Generation (INLG 2018), Tilburg. PDF
- Moot, R. (2017). Combining logical and distributional methods in type-logical grammars. Journal of Language Modelling. HAL
- Dusserre, E., and Padró, M. (2017). Bigger does not mean better! We prefer specificity. In IWCS 2017. Montpellier. PDF
- Linz, N., Tröger, J., Alexandersson, J., and König, A. (2017). Using Neural Word Embeddings in the Analysis of the Clinical Semantic Verbal Fluency Task. In IWCS 2017. Montpellier
- Lefebvre-Brossard, A., Spaeth, A., and Desmarais, M. C. (2017). Encoding User as More Than the Sum of Their Parts: Recurrent Neural Networks and Word Embedding for People-to-people Recommendation. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization ACM
- Wang, R. (2017). Second Attempt at Building a Language Translator. Blog
- Sehikh, I., Fohr, D., & Illina, I. (2017). Topic Segmentation in ASR Transcripts using Bidirectional RNNS for Change Detection. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). PDF
- Ebert, S. (2017). Artificial Neural Network Methods Applied to Sentiment Analysis. PhD Thesis Ludwig-Maximilians-Universität München. PDF
- Practical Deep Learning For Coders (2017) — Lesson 13 — Fast.ai project
- Povolny, F., Matejka, P., Hradis, M., Popková, A., Otrusina, L., Smrz, P. (2016). Multimodal Emotion Recognition for AVEC 2016 Challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge
- EU Project MixedEmotions (2016) — Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets
- Rothe, S., Ebert, S. and Schütze, H. (2016). Ultradense Word Embeddings by Orthogonal Transformation. NAACL 2016. San Diego
- Fauconnier, J. P., Kamel, M. (2015). Discovering Hypernymy Relations using Text Layout. *SEM 2015. Denver. PDF
Annotated copora
Annotated corpora built during my PhD Thesis:
- corpus-LARA: The corpus LARA is a French corpus of Wikipedia pages annotated with enumerative structures.
- corpus-LING_GEOP: LING_GEOP is a corpus with visual and logical clues. The PDF documents come from ANNODIS-ME.
Experience
Laboratory life
- Member of the IRIT doctoral committee
- Member of the JéTou 2015 organization committee
Review
Jobs & internships
- Axe TAL, CLEE-ERSS, Université Toulouse II, 2012. Evaluation of the French Talismane parser.
- Division ICT, Federal Agency for Medicines and Health Products, Belgium, 2011. Treatment of EudraCT data.
- Software & Technologies, CETIC, Belgium, 2010. Development of the semantic search engine MusOnto.
- Division PRE, Federal Agency for Medicines and Health Products, Belgium, 2009. Database integrity verification.
Links