Abstract
In this paper, we are interested in modeling a how-to instructional procedure, such as a cooking recipe, with a meaningful and rich high-level representation. Specifically, we propose to represent cooking recipes and food images as cooking programs. Programs provide a structured repre-sentation of the task, capturing cooking semantics and se-quential relationships of actions in the form of a graph. This allows them to be easily manipulated by users and executed by agents. To this end, we build a model that is trained to learn a joint embedding between recipes and food images via self-supervision and jointly generate a program from this embedding as a sequence. To validate our idea, we crowdsource programs for cooking recipes and show that: (a) projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results; (b) generating programs from images leads to better recognition re-sults compared to predicting raw cooking instructions; and (c) we can generate food images by manipulating programs via optimizing the latent code of a GAN. Code, data, and models are available online11http://cookingprograms.csail.mit.edu.
Original language | English |
---|---|
Title of host publication | Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition |
Publisher | IEEE |
Publication date | 2022 |
Pages | 16538-16548 |
ISBN (Electronic) | 9781665469463 |
DOIs | |
Publication status | Published - 2022 |
Event | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition - New Orleans, United States Duration: 19 Jun 2022 → 24 Jun 2022 |
Conference
Conference | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition |
---|---|
Country/Territory | United States |
City | New Orleans |
Period | 19/06/2022 → 24/06/2022 |
Keywords
- categorization
- Datasets and evaluation
- Recognition: detection
- Retrieval
- Vision + language