Jean-Philippe Fauconnier

I am a Software Engineer at Apple 

Previously, I have obtained my PhD in Computer Science at the Université Paul Sabatier (Toulouse, France) and I have completed my Master Degree in Natural Language Processing at the Catholic University of Louvain (Belgium).


I have obtained my PhD in Computer Science at the Toulouse Institute of Computer Science Research, Université Paul Sabatier (France). My work took place under the supervision of Drs Mouna Kamel and Nathalie Aussenac-Gilles, and focused on Natural Language Processing, Document Analysis and Machine Learning fields. More particularly, I was interested in the acquisition of lexical relations using the layout and the formatting of documents. For this work, I received the ATALA Best PhD Thesis Award.

Previously, I was graduated in Natural Language Processing at the Catholic University of Louvain (Belgium) in 2012. This Master's degree programme was delivered by the CENTAL laboratory.

Research Interests

Natural Language Processing

Document Analysis



National journal papers

International conference papers

National conference papers

Phd Thesis








Most of resources are located on my github repository. The fast way to download a given resource is to use git:

mkdir resource
cd resource
git clone

Personal softwares


AMI (Another Maxent Implementation) is a R implementation of multinomial logistic regression, also known as Maximum Entropy classifier. This implementation deals with binary and real-valued features and uses standard R functions to optimize the objective. Then, it is possible to use several iterative methods: LM-BFGS, Conjugate Gradient, Gradient Descent and Generalized Iterative Scaling.


LARAt (Layout Annotation for Relation Acquisition tool), pronounced /laʁa/, is an annotation tool which supports the layout and the formatting of HTML documents. LARAt was used during an annotation campaign in 2013 and, in his current state, is dedicated to the annotation of enumerative structures. The typology implemented is the one described in the TIA 2013 paper.


LaToe (Layout Annotation for Textual Object Extraction) is a tool which extracts the text layout from HTML, MediaWiki, or PDF documents for identifying specific textual objects (such as enumerative structures). Currently, the CRF model used for the PDF analyzer was trained on a small corpus (LING_GEOP). This implies that LaToe could be not efficient for unseen PDF documents with specific formatting.

Source code reviews


Code review of a C++ library for maximum entropy classification. On his website, Tsuruoka proposed a fast implementation of a multinomial logistic regression. In order to get a better and deeper understanding of implementation details, I propose a simple code review. The code base is relatively small (around 2500 lines of code). Those notes are primary intended for my personal use and reflect my current understanding. I propose them here, in case it could help someone. Note that this document is currently a work in progress.

Open source contributions

Some open source contributions:


French word embeddings models

I propose here some pre-trained word2vec models for French. Their format is the initial binary format proposed with word2vec v0.1c. Depending on your needs, you may want to convert those models. A simple way to convert them into text format can be:

git clone
cd convertvec/
./convertvec bin2txt frWiki_no_phrase_no_postag_700_cbow_cut100.bin output.txt

Below I give a minimal usage example in Python:

pip install word2vec
>>> import word2vec
>>> model = word2vec.load('frWac_postag_no_phrase_700_skip_cut50.bin')
>>> indexes, scores = model.cosine('intéressant_a')
>>> model.generate_response(indexes,scores).tolist()
[('très_adv'        , 0.5967900206395151),
('intéresser_v'     , 0.5439725695003301),
('peu_adv'          , 0.542676993533696),
('assez_adv'        , 0.5398579170306232),
('certainement_adv' , 0.5246291122355085),
('plutôt_adv'       , 0.5234975073833474),
('instructif_a'     , 0.5230028009476526),
('trouver_v'        , 0.5131327677418707),
('aussi_adv'        , 0.5056422730726639),
('beaucoup_adv'     , 0.5034801589883425)]

For this model, we can see that the adjective 'intéressant' has a lot of shared contexts with adverbs. Note that the color code and the layout are mine. Please check (Mikolov et al., 2013) to gain insight into the model hyper-parameters.


FrWac corpus, 1.6 billion of words.

Lem Pos Phrase Train Dim Cutoff
bin (2.7Gb) - - - cbow 200 0
bin (120Mb) - - - cbow 200 100
bin (120Mb) - - - skip 200 100
bin (298Mb) - - - skip 500 100
bin (202Mb) - - - skip 500 200
bin (229Mb) - - cbow 500 100
bin (229Mb) - - skip 500 100
bin (494Mb) - - skip 700 50
bin (577Mb) - skip 700 50
bin (520Mb) - skip 1000 100
bin (2Gb) - cbow 500 10
bin (289Mb) - cbow 500 100

FrWiki dump (raw file), 600 millions of words.

Lem Pos Phrase Train Dim Cutoff
bin (253Mb) - - - cbow 1000 100
bin (195Mb) - - - cbow 1000 200
bin (253Mb) - - - skip 1000 100
bin (195Mb) - - - skip 1000 200
bin (128Mb) - - cbow 500 10
bin (106Mb) - - cbow 700 100
bin (151Mb) - - skip 1000 100
bin (121Mb) - - skip 1000 200

How to cite those models?

According to the licence CC-BY 3.0, please, feel free to copy, distribute, remix and tweak those models for any purpose. The attribution must be made by quoting my name with a link to this page, or by using the bibtex entry below. Those models were trained during my PhD Thesis, and are in no way linked to my current or any future activities. Note also that those models are shared without any guarantees or support.

	author = {Fauconnier, Jean-Philippe},
	title = {French Word Embeddings},
	url = {},
	year = {2015}}

Below, projects and papers using those models:

To see your work listed, contact me.

Annotated copora

Annotated corpora built during my PhD Thesis:


Laboratory life


Jobs & internships