In this thesis, a simple and effective model has been proposed for annotating images that makes use of deep convolutional neural networks. For this purpose, a Siamese/Triplet network has been designed which consists of two separate neural networks. The upper network is a deep convolutional neural network which extracts features from images, and the bottom network is a simple MLP. In each cycle, we feed one image alongside with the word embedding vector of one positive word (correct label) and one negative word (wrong label) to our network and use Triplet Loss in order to learn the distance between images features and related word embedding vectors. After evaluating the model on standard baseline datasets and comparing the results with related previous works, it shows about 5 percent enhancement in the recall and 0.2 percent enhancement in the F1 measure of the annotating process.