Fuwen Tan


Samsung AI Center, Cambridge

50/60 Station Road, Cambridge, UK


About me

I am a Researcher in the Samsung AI Center, Cambridge (SAIC-Cambridge), working with Dr. Brais Martinez and Dr. Georgios Tzimiropoulos. I received my PhD in Computer Science at the University of Virginia, where I worked with Dr. Vicente Ordóñez Román on Vision and Language. Here is my CV.


[07/2021] Our RRT paper is accepted to ICCV 2021. Code and pretrained models are released in RerankingTransformer.

[06/2021] I start working as a Researcher in the Samsung AI Center, Cambridge (SAIC-Cambridge).

[05/2021] I was recognized as an Outstanding Reviewer for CVPR 2021.

[04/2021] I successfully defended my PhD Dissertation: Learning Local Representations of Images and Text.


Instance-level Image Retrieval using Reranking Transformers

Fuwen Tan, Jiangbo Yuan, Vicente Ordonez
International Conference on Computer Vision (ICCV), 2021.

Instance-level image retrieval is the task of searching in a large database for images that match an object in a query image. To address this task, systems usually rely on a retrieval step that uses global image descriptors, and a subsequent step that performs domain-specific refinements or reranking by leveraging operations ...

[ preprint ]    [ code ]    [bibtex]

Curriculum Labeling: Self-paced Pseudo-Labeling for Semi-Supervised Learning

Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez
AAAI Conference on Artificial Intelligence (AAAI), 2021.

Semi-supervised learning aims to take advantage of a large amount of unlabeled data to improve the accuracy of a model that only has access to a small number of labeled examples. We propose curriculum labeling, an approach that exploits pseudo-labeling for propagating labels to unlabeled samples in an iterative and self-paced fashion. This approach is surprisingly simple and effective and surpasses or is comparable with the best methods proposed in the recent literature across all the standard benchmarks for image classification. ...

[ paper ]    [ code ]    [ bibtex ]

Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez
Conf. on Neural Information Processing Systems (NeurIPS), 2019

This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. ...

[ paper ]    [ code ]    [ poster ]    [ bibtex ]

Text2Scene: Generating Compositional Scenes from Textual Descriptions

Fuwen Tan, Song Feng, Vicente Ordonez
Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, (~Oral presentation + Best Paper Finalist)
Posts from NVIDIA Developer News, IBM Research Blog

We propose Text2Scene, a model that interprets input natural language descriptions in order to generate various forms of compositional scene representations; from abstract cartoon-like scenes to synthetic images. Unlike recent works, our method does not use generative adversarial networks, but a combination of an encoder-decoder model with a semi-parametric retrieval-based approach. ...

[ paper ]    [ code ]    [ poster ]    [ slides ]    [ bibtex ]

Where and Who? Automatic Semantic-Aware Person Composition

Fuwen Tan, Crispin Bernier, Benjamin Cohen, Vicente Ordonez, Connelly Barnes
Winter Conf. on Applications of Computer Vision (WACV), 2018

Image compositing is a popular and successful method used to generate realistic yet fake imagery. Much previous work in compositing has focused on improving the appearance compatibility between a given object segment and a background image. However, most previous work does not investigate the topic of automatically selecting compatible segments and predicting their locations and sizes given a background image. ...

[ paper ]     [ supplemental PDF ]     [ code ]     [ video ]     [ bibtex ]

FaceCollage: A Rapidly Deployable System for Real-time Head Reconstruction for On-The-Go 3D Telepresence

Fuwen Tan, Chi-Wing Fu, Teng Deng, Jianfei Cai, Tat Jen Cham
ACM Multimedia (ACM MM, full paper), 2017

This paper presents FaceCollage, a robust and real-time system for head reconstruction that can be used to create easy-to-deploy telepresence systems, using a pair of consumer-grade RGBD cameras that provide a wide range of views of the reconstructed user. A key feature is that the system is very simple to rapidly deploy, with autonomous calibration and requiring minimal intervention from the user. ...

[ paper ]    [ video]     [ poster ]    [ bibtex ]

High-Quality Kinect Depth Filtering For Real-time 3D Telepresence

Mengyao Zhao, Fuwen Tan, Chi-Wing Fu, Chi-Keung Tang, Jianfei Cai, Tat Jen Cham
Conf. on Multimedia and Expo (ICME), 2013

3D telepresence is a next-generation multimedia application, offering remote users an immersive and natural video­ conferencing environment with real-time 3D graphics. Kinect sensor, a conswner-grade range camera, facilitates the implementation of some recent 3D telepresence systems. However, conventional data filtering methods are insufficient to handle Kinect depth error because such error is quantized ...

[ IEEE Xplorer ]     [bibtex]

Field-guided Registration for Feature-conforming Shape Composition

Hui Huang, Minglun Gong, Daniel Cohen-Or, Yaobin Ouyang, Fuwen Tan, Hao Zhang
ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 2012

We present an automatic shape composition method to fuse two shape parts which may not overlap and possibly contain sharp features, a scenario often encountered when modeling man-made objects. At the core of our method is a novel field-guided approach to automatically align two input parts in a feature-conforming manner. ...

[ project ]    [paper]    [bibtex]


PhD Dissertation: Learning Local Representations of Images and Text

Images and text inherently exhibit hierarchical structures, e.g. scenes built from objects, sentences built from words. In many computer vision and natural language processing tasks, learning accurate prediction models requires analyzing the correlation of the local primitives of both the input and output data. In this thesis, we develop techniques for learning local representations of images and text and demonstrate their effectiveness on visual recognition, retrieval, and synthesis. ...

[ thesis ]    [ slides ]