An iterative approach for the global estimation of sentence similarity
An iterative approach for the global estimation of sentence similarity
Tomoyuki Kajiwara 1 2
Danushka Bollegala 2
Yuichi Yoshida 0 2
Ken-ichi Kawarabayashi 0 2
0 National Institute of Informatics , Tokyo , Japan
1 Tokyo Metropolitan University , Tokyo , Japan , 2 University of Liverpool , Liverpool , United Kingdom
2 Editor: Gajendra P. S. Raghava, Institute of Microbial Technology CSIR , INDIA
Measuring the similarity between two sentences is often difficult due to their small lexical overlap. Instead of focusing on the sets of features in two given sentences between which we must measure similarity, we propose a sentence similarity method that considers two types of constraints that must be satisfied by all pairs of sentences in a given corpus. Namely, (a) if two sentences share many features in common, then it is likely that the remaining features in each sentence are also related, and (b) if two sentences contain many related features, then those two sentences are themselves similar. The two constraints are utilized in an iterative bootstrapping procedure that simultaneously updates both word and sentence similarity scores. Experimental results on SemEval 2015 Task 2 dataset show that the proposed iterative approach for measuring sentence semantic similarity is significantly better than the non-iterative counterparts.
Data Availability Statement: All relevant data are
within the paper.
Funding: The authors received no specific funding
for this work.
Competing interests: The authors have declared
that no competing interests exist.
Measuring the similarity between short textual units such as sentences, tweets or chat messages
is a commonplace task in numerous natural language processing (NLP) applications such as
information retrieval [
], text clustering, or classification [2±4]. Compared to measuring the
similarity between longer textual units such as documents that contain many words,
measuring the similarity between short sentences is a challenging task due to the lack of common
features. Consequently, similarity measures based on word overlap such as cosine similarity,
often fails to detect the similarity between sentences [
]. To overcome this feature sparseness
problem, prior work on sentence similarity have proposed methods that use external lexical
resources such as thesauri [
], or project sentences into a lower-dimensional dense spaces in
which subsequently similarity is computed [7±12].
We propose a complementary approach for measuring the similarity between two sentences
in a corpus that considers not only the features that occur in those two sentences, but also
features that occur in all pairs of sentences in the corpus. Specifically, we require sentence
similarity scores to satisfy two important types of constraints: (a) if two sentences share many
common features, then it is likely that the remaining features in each sentence are also related,
and (b) if two sentences contain many related features, then those two sentences are
To motivate the role played by these constraints consider the following three example
(i) I love dogs and cats.
(ii) I love dogs and rabbits.
(iii) My favorite pet is a cat.
Sentences (i) and (ii) share many common content words such as I, love, and dog. Thus, we
can infer that cat and rabbit must also be semantically related. The confidence of our inference
grows with (a) the proportion of the overlap, and (b) the number of different sentence pairs in
which we observe similar overlaps. Consider now that we are further required to compare
sentences (ii) and (iii), between which we have no common words. Without the knowledge that
cat and rabbit are related from our previous comparison, we would predict a zero similarity
score between sentences (ii) and (iii). However, if we use the knowledge obtained from (i) and
(ii), and consider cat and rabbit to be similar (i.e. pets in this case), then we could predict a
non-zero similarity score for (ii) and (iii). Therefore, we can benefit from the constraints
derived from other pairs of sentences in a corpus (such as (i) and (ii)), when measuring the
similarity between two given sentences selected from that corpus (such as (ii) and (iii)).
Our proposed method iterates over two stages.
· First, we align each sentence in a corpus with all the other similar sentences to build a
wordalignment matrix. We compute the similarity between two words based on two factors: (a)
pointwise mutual information between the two words according to their alignment
frequency in the word-alignment matrix, and (b) prior similarity between words measured
using pre-trained word embeddings. Using the computed word similarity scores, we
measure the similarity between two sentences using three sentence alignment methods.
· Second, we update the word similarity scores using the word-alignment matrix computed in
the first stage. Specifically, we propose two update rules for this purpose: an additive update,
and a multiplicative update. The proposed method iterates multiple times over the corpus
measuring similarities between all pairs of sentences. In practice, the proposed method
converges in less than 3 iterations. However, computing all sentence pair similarities can be time
consuming for large text corpora. To overcome this problem, we propose an efficient
method to identify the top-most similar sentence pairs in a corpus that contribute to the
similarity score update using SimHash [
] that obviates all-pair comparisons.
Our proposed method is unsupervised in the sense that it does not require any labeled data
for sentence similarity. Moreover, we do not use external resources such as thesauri, which
might not be available for resource poor languages or specialised domains. The proposed
method does not assume a specific sentence representation method, and can be used with
different sentence representations such as bag-of-words, or parse trees. Moreover, it is
complementary to the sentence embedding methods, and can be used in conjunction in an ensemble
setting as yet another sentence similarity measure.
We evaluate the proposed sentence similarity method using the SemEval-2015 Task 2
sentence similarity benchmark dataset. Our experimental results show that the proposed iterative
approach for measuring sentence semantic similarity is significantly better than the
2 / 15
Measuring the similarity between sentences is an omnipresent step in various NLP tasks such
as paraphrase detection, recognizing textual entailment, sentence simplification and text
In paraphrase detection, we must determine whether two sentences express the same
meaning. Socher et al. [
] used recursive autoencoders to learn feature vectors for phrases. The
feature vectors are then used to compute word- and phrase-wise similarity between sentences. A
dynamic pooling layer is used to create a fixed-size representation for sentences of varying
lengths. Finally, a supervised classifier is trained using this lower-dimensional embedding of
sentences. Ji and Eisenstein [
] proposed a discriminative KL-divergence-based term
weighting method and used matrix factorization to obtain lower-dimensional representations of
sentences. Finally, a supervised classifier is trained using those sentence representation to detect
similar sentence pairs. Cheng and Kartsaklis [
] used recursive neural networks for
embedding a sentence in a latent dimensional space, in which similarity between sentences were
measured. Representing sentences using latent features is an effective method to overcome the
feature sparseness problem encountered when measuring the similarity between two
sentences. Although we represented sentences using explicit lexical features, our proposed method
does not depend on a particular sentence representation method, and can be applied with any
of the representations proposed in prior work.
For recognising textual entailment, we must compare two sentences and decide whether
one statement entails the other [
]. Sentence similarity measures have been used as features
for recognizing entailment [
]. However, unlike similarity, entailment is an asymmetric
]. In sentence simplification [
], for a given sentence, we must find a sentence that is
simpler in terms of grammatical structure, word usage etc. than the original sentence. We
believe that the word-alignment methods we propose in this paper will be useful for finding
simplification candidates that preserve most information in the sentences to be simplified.
A benchmark dataset for sentence similarity was created via crowdsourcing in
SemEval2015 Task 2 [
]. Both supervised methods [
] that require sentence pairs annotated with
similarity ratings, as well as unsupervised methods [
] have been proposed. Instead of using
all the words in the two sentences, first selecting a subset of words from each sentence has
been an effective technique [22±25]. Following this observation, we proposed maximum
similarity and bipartite graph matching for selecting two subsets of words to be aligned between
Pre-trained word embeddings have been successfully used in prior work to overcome
feature spareness. Sultan et al. [
] used cosine similarity between word embeddings trained by
] and lexical substitution features from PPDB [
] for measuring sentence
similarity. HaÈnig et al. [
] used cosine similarity between word embeddings trained by SGNS [
and features such as synonym from WordNet [
] and ConceptNet [
] for measuring
sentence similarity. Han et al. [
] used cosine similarity between distributional word
representations and features from WordNet for word-alignment. These best systems from the
SemEval2015 Task 2 are supervised methods or it depends on external resources. However, our
proposed method is unsupervised and we do not use external resources. The main point in this
paper is that the global sentence similarity computation method we propose can be used with
any method for computing word similarity and representing a word/sentence embeddings.
An alternative method for measuring sentence similarity is to first embed each sentence
into a space, and then measure cosine similarity in the embedded space. Skip-thought vector
] and FastSent [
] are such sentence embedding methods that use consecutive triplets of
sentences selected from books. In contrast to sentence embedding methods, our proposed
3 / 15
method operates directly on pre-trained word embeddings to compute sentence similarity,
without requiring us to learn sentence embeddings. This is particularly useful in situations
where learning sentence embeddings is computationally expensive, or text corpora with
sequential sentences are unavailable.
Iterative similarity computation
Our proposed method iterates between two stages. First, we use the similarity between words
to align pairs of sentences in a corpus. Following Song and Roth [
], we extend three sentence
similarity measures for iterative similarity computation. Second, we update the word similarity
scores considering the sentence alignments produced in the first stage. Two update rules are
proposed for this purpose.
Let us denote a sentence x by a vector x
x1; x2; . . . ; xjVj, where the i-th element xi is set to 1
if the i-th word occurs in the sentence x, and otherwise to 0. Here, vocabulary V is the set of
words that occur in a corpus, and jVj denotes the number of unique words in that corpus.
Given a word-alignment method, A, the similarity, SA
x; y, between two sentences x and y
can then be calculated using a word similarity measure ϕ(xi, yj). We use the following three
word-alignment methods to define three sentence similarity measures.
Average similarity. The average similarity, Save(x, y), between two sentences x and y is
computed by averaging the similarities between all pairs of words taken from the two
sentences as follows:
Here, ||x|| denotes the ℓ2 norm of the vector x. In particular, if i = j we set ϕ(xi, yj) = 1 and 0
otherwise, Save reduces to the popular cosine similarity.
Maximum similarity. Instead of averaging the word similarity scores, maximum
similarity, Smax(x, y), considers for each word xi the most similar word yj, as follows:
Smax can be considered as a sentence similarity measure based on a one-to-many
wordalignment. We consider a word-pair (xi, yj) to be aligned if j argmax j0 yj0
xi; yj0 . We create
a word-alignment matrix Amax where the (i, j) element denotes the number of sentence pairs
in which the i-th word of the first sentence was aligned with the j-th word of the second
Bipartite matching. We can represent the two sentences x and y by a bipartite graph
where the vertices in each part correspond respectively to the two sets fi : i 2 V; xi 1g, and
fj : j 2 V; yj 1g consisting of words that occur in each sentence. Each vertex in the first part
(corresponding to the words in the first sentence) is connected to all the vertices in the second
part (corresponding to the words in the second sentence) using an undirected weighted edge.
The weight of the edge connecting i to j is set to the word similarity ϕ(xi, yj). This bipartite
graph can be constructed in O
jVj2 time complexity.
Next, we can model the problem of measuring the similarity between the two sentences x
and y as a problem of bipartite graph matching. Specifically, we would like to find the
one-toone mapping between the two parts that maximises the sum of edge-weights from x to y.
Formally, let M be a boolean matrix where Mi,j = 1 if word xi is aligned to word yj. Then the
optimal word alignment has weight
such that each word xi is aligned to at most one word yj. This maximum-matching problem
can be solved using the Hungarian algorithm [
], a bipartite matching algorithm with time
jVj3. For each word xi, let us denote its optimum alignment target under the
Hungarian method by yj = yh(i).
We define a similarity, Shun(x, y), based on this optimum alignment as follows:
i argmax Mi;j0
Shun can be considered as a sentence similarity measure based on a one-to-one
word-alignment. We create a word-alignment matrix Ahun where the (i, j) element denotes the number of
sentence pairs in which the i-th word of the first sentence was aligned with the j-th word of the
second sentence according to the Hungarian algorithm.
Incremental update rule
In many text similarity computation tasks such as finding similar documents in information
retrieval, or document clustering, we must compare not only one pair of texts (documents)
selected from a given collection, but compute the similarities between all pairs of texts.
Likewise, when calculating the similarity between sentences, it is often the case that we are given a
large collection of sentences (a corpus) from which a pair of sentences is selected. As we
already described, we can exploit the information available in all the sentences in the corpus
when measuring the similarity between two given sentences. Instead of considering the
similarity between two words, ϕ(xi, yj), to be a fixed value, we update word similarities considering
their alignments in sentences. Because the sentence similarity measures given by Eqs (1), (2)
and (5) depend on the word similarity scores, this results in an update procedure that iterates
between measuring sentence similarities (thereby word-alignments), and updating word
Let us denote the similarity between two words xi and yj after the t-th iteration by ϕ(t)(xi, yj),
and the word-alignment matrix computed using the maximum similarity or the bipartite
matching by A(t). Note that the word-alignment matrix A is an asymmetric matrix.
Therefore, we define a symmetric word co-occurrence matrix C(t), where its (i, j)-th element is
Let B(t) be the word similarity matrix where its (i, j) element Bi
jt denotes the similarity
between the two words i and j computed using co-occurrence counts Ci
jt. Different word
association measures can be used to compute similarity scores from co-occurrence counts.
In this work, we use the positive pointwise mutual information (PPMI)  computed as
jt max 0; log
xi; yj Z
PPMI is frequently used for measuring word similarity in various NLP tasks [
We propose two update rules for updating the word similarity scores using the
word-alignment counts: the additive update rule defined by Eq (8), and the multiplicative update rule
defined by Eq (9).
Here, η(t) is the update rate in the t-th iteration. Because we require word similarity scores to
be in the range [
], we scale ϕ(t+1)(xi, yj) by dividing from the maximum similarity score
between any pair of words, maxij ϕ(t+1)(xi, yj), after each iteration. In both update rules, the
initial word similarities, ϕ(0)(xi, yj), are computed using pre-trained word embeddings. In our
experiments, we used skip-gram with negative sampling (SGNS) [
] for learning word
embeddings. Then, ϕ(0)(xi, yj) is computed as the cosine similarity between the word
embeddings corresponding to the words xi and yj.
The additive update rule given by Eq (8) closely resembles the update rule used in imitation
], where a learner is required to imitate the training signal provided by an oracle.
In our case, the word similarity scores ϕ(t)(xi, yj) are required to follow Bi
jt, the similarity scores
computed using word-alignment counts. On the other hand, the multiplicative update rule
given by Eq (9) can be seen as a weighted similarity score where current similarity scores are
weighted by the corresponding alignment counts. We experimentally compare the different
combinations of word-alignment matrices produced by different sentence similarity measures
and the update rules.
In practice, even though two sentences might be similar, not all the words in the two
sentences need to be similar. However, both maximum similarity method and the bipartite
matching method require all word-pairs from the two sentences to be aligned. This imposes an
unnecessarily strict constraint on word-alignment because two words might get aligned
despite having a small word similarity score. To avoid such word-alignments, we consider
only word-pairs (xi, yj) with similarity ϕ(t)(xi, yj) > θ for the word-alignment process for a fixed
threshold θ 2 [
]. We experimentally study the effect of θ on the performance of our
Efficient computation of similarity
Calculating the full word-alignment matrix requires computational complexity of O
where n is the total number of sentences in the corpus. However, most sentence pairs in a
corpus will have almost zero similarity scores, and would not contribute to the word-alignment
matrices. To avoid such unproductive computations, we use SimHash [
] to find the most
similar k sentences for each sentence in the corpus, and measure sentence similarity only for
those sentence pairs. Hamming distance over SimHash values of two sentences approximates
the cosine similarity between the corresponding sentences. This method reduces the
computational complexity to O
nkjVj, which is significantly smaller than O
n2jVj for k n.
6 / 15
We evaluate the accuracy of our method by predicting the similarity between two given
sentences using SemEval-2015 Task 2 sentence similarity benchmark dataset. Sensitivity of the
performance for each parameter and initial word embeddings in our method is described.
Sentence similarity measurement
For evaluating the proposed method for measuring sentence similarity, we use the
SemEval2015 Task 2 dataset (http://alt.qcri.org/semeval2015/task2/) [
]. This dataset includes 3,000
sentence pairs from five different domains: news headlines (Head), image descriptions (Img),
answer pairs from a tutorial dialogue system (Stud), answer pairs from Q&A websites (QA),
and sentence pairs from a committed belief dataset (Bel). Sentence similarity scores that range
between 0 (the two sentences are completely dissimilar) to 5 (the two sentences are completely
equivalent, as they mean the same thing) are obtained via crowdsourcing. A sentence similarity
measure is evaluated against the human ratings in this dataset using the Pearson correlation
coefficient. Pearson correlation coefficient ranges in [
], and high values indicate better
agreement with the human notion of sentence similarity.
We use publicly available pre-trained word embeddings (https://code.google.com/archive/
p/word2vec/) trained using SGNS and use cosine similarity to compute initial word
similarities, ϕ(0)(xi, yj), required by the additive and the multiplicative rules defined respectively by Eqs
(8) and (9). The pre-trained word embeddings are trained on about 100 billion word Google
News corpus, and 300 dimensional vectors for 3 million words are created. We use 5-fold
cross validation on the train sentence pairs in the SemEval-2015 Task 2 dataset to obtain the
optimal values of θ = 0.4 and t = 3. Moreover, we experimented with different learning rate
scheduling methods and found η(t) = 1 to be the best. We analyse the sensitivity of the
performance of the proposed method to those parameters. Because the SemEval-2015 Task 2 dataset
contains only a small number of sentences (ca. 6,000), we do not require the SimHash-based
approximation method for this dataset.
To demonstrate the effectiveness of conducting iterative similarity updates in the proposed
method, we compare it against the following baseline methods that have been frequently used
in prior work that do not perform iterative similarity updates.
Cosine baseline calculates the similarity between two sentences x and y as the
cosine similarity between the two vectors x and y representing the two
Cosine (add SGNSs) baseline calculates the similarity between two sentences x and y as the
cosine similarity between two sentence embeddings. These sentence
embeddings are composed by adding the word embeddings of the
words in each sentence. Representing sentences via the sum of word
embeddings has been shown to be a strong baseline for creating
sentence embeddings [
SGNS method calculates the similarity between two sentences x and y using
the three sentence similarity measures, Save, Smax, and Shun respectively
using Eqs (1), (2) and (5). It uses the pre-trained word embeddings
learnt using SGNS, and measures the similarity ϕ(xi, yj), between two
words xi and yj as the cosine similarity between the corresponding word
embeddings. This method simulates the proposals made by Song and
] for measuring sentence similarity using word alignments.
7 / 15
This method does not perform any iterative similarity updates as done
by the proposed method, and corresponds to the current
state-of-theart unsupervised sentence similarity measure.
PPMI baseline uses the PPMI-based word similarity computed using
wordalignment counts, as the word similarity function ϕ(xi, yj), and
computes the three sentence similarity measures Save, Smax, and Shun.
Specifically, 6 variants of this baseline is computed by combining the two
word-alignment matrices Amax, and Ahun, with the three sentence
similarity measures Save, Smax, and Shun.
Table 1 compares the different sentence similarity measures using the Pearson correlation
coefficients with the human ratings for the test sentence pairs in the SemEval-2015 Task 2
dataset. The proposed method (denoted by Prop) is computed for the combinations of 2
word-alignment matrices (Amax and Ahun), 3 sentence similarity measures (Save, Smax, and
Shun), and 2 update rules (additive and multiplicative, denoted respectively by + and ),
resulting in 12 variants shown in Table 1. The final column, Mean, in Table 1 shows the weighted
mean over the 5 domains for each method. It is computed by weighting the Pearson
correlation coefficient in each domain by the total number of sentence pairs in that domain,
according to the official scoring guidelines in SemEval-2015 Task 2.
From Table 1, we see that Prop Amax + Smax is the best performing method among the
different methods compared. In particular, it reports the best correlation coefficients in 4 out of the 5
The bold scores means the highest performance. The scores with a star statistically significantly outperform the SGNS (Smax) baseline.
8 / 15
domains. Moreover, according to the Fisher z-transformation, the correlations reported by the
proposed method is statistically significantly better than that of SGNS Smax, which supports our
proposal that sentence similarities must be computed in an iterative fashion over the entire
corpus considering word-alignment constraints. Overall, the maximum similarity word-alignment
(Amax) with Smax consistently perform well across different domains and baselines.
Between the two update rules, additive update outperforms the multiplicative counterpart.
Recall that the word similarity matrix B(t) given by Eq (7) is in practice a sparse matrix.
Therefore, the multiplicative update rule given by Eq (9) results in even sparser similarity
scores ϕ(t+1) than ϕ(t) after each update. On the other hand, the additive update rule given by
Eq (8) would retain the non-zero elements in ϕ(t) during the update. We believe that the extra
sparsification in the multiplicative update rule decreases its performance when measuring the
We study the performance of the Prop Amax + Smax method, which reported the best results
according to Table 1, under different update rate scheduling methods. Specifically, we consider
Fig 1. Effect of the different update rate scheduling methods on the performance of the proposed method is shown. The dashed
horizontal line shows p < 0.05 significance level (Fisher z-transformation) for outperforming the SGNS Smax method. Peak correlation value
and the required number of iterations (t) are shown within brackets.
9 / 15
update rate scheduling methods frequently used in stochastic optimization such as constant
update rates (η(t) = 0.5, 1.0, 1.5), reciprocal update rates (η(t) = 1/t, 1/2t), and the inverse
squared update rate (η(t) = 1/t2).
Fig 1 shows the performance of the proposed method under different update rate
scheduling methods. The dashed horizontal line in Fig 1 is the level of performance a particular
method must obtain in order for that method to statistically significantly outperform the
stateof-the-art SGNS Smax. From Fig 1, we see that our proposed method outperforms SGNS Smax
under all update rate scheduling methods. Therefore, the proposed method is relatively
insensitive to the update rate scheduling method used.
Moreover, under constant update rates, when we increase the value of η, the Pearson
correlation reaches the maximum value with a smaller number of iterations. Once the Pearson
correlation coefficients have reached these maximum values, the performance converges.
Because it is desirable to converge to the best correlation value with smaller number of
iterations, η(t) = 1.5 (peak performance achieved after 3 iteration) is a suitable value.
Fig 2 shows the effect of considering word-pairs greater than similarity θ during the
sentence similarity measurement process. Considering less similar word-pairs in the alignment
step leads to poor performance because of noisy alignments. On the other hand, high θ values
Fig 2. Effect of selecting word-pairs with similarity greater than θ for updating the word-alignment matrix. The dashed horizontal
line shows p < 0.05 significance level (Fisher z-transformation) for outperforming the SGNS Smax method. Peak correlation value and the
required number of iterations (t) are shown within brackets.
10 / 15
Fig 3. Effect of the number of top-k similar sentences selected using SimHash on the performance of the proposed method is
shown. The dashed horizontal line shows p < 0.05 significance level (Fisher z-transformation) for outperforming the SGNS Smax method.
Peak correlation value and the required number of iterations (t) are shown within brackets.
will limit the number of words that we align between two sentences, leading to feature
sparseness issues. This trade-off can be seen from the three curves shown in Fig 2.
To study the effect of selecting top-k similar sentences using SimHash, in Fig 3 we measure
the performance of Prop Amax + Smax against different k values. We see that even selecting a
small sample as the top-most similar k = 100 sentences for each sentence in the corpus out of
all sentences (ca. 6,000), the proposed method can obtain a high (0.6302) correlation
coefficient. With k = 300 similar sentences we can obtain statistically significant improvements over
SGNS Smax. This is attractive when computing sentence similarities in large corpora. For
example, even for a small corpus such as the SemEval-2015 Task 2 dataset, which has only 6,000
sentences, time taken for one iteration is reduced from 24 min to 1.5 min, by using k = 100.
To demonstrate the effect of the different initial word embeddings, we initialize using
random vectors, and publicly available pre-trained word embeddings: 300 dimensional SGNS
vectors (https://code.google.com/archive/p/word2vec/) for 3 million words, 50, 100, 200 and 300
dimensional GloVe vectors (http://nlp.stanford.edu/projects/glove/) for 400 thousand words.
As shown in Fig 4, our proposed method can significantly improve any initial word similarity
by iterative updating. The better performance of SGNS over GloVe can be explained by the
larger vocabulary covered by SGNS.
11 / 15
Fig 4. Effect of the different initial word embeddings on the performance of the proposed method is shown. The dashed horizontal line
shows p < 0.05 significance level (Fisher z-transformation) for outperforming the SGNS Smax method. Peak correlation value and the required
number of iterations (t) are shown within brackets.
Sentence similarity complement
We improve an existing sentence similarity measure by a combination with the proposed
method. The Word Mover's Distance [
] which is a sentence similarity measure based on the
dissimilarity between words is improved in this study.
Table 2 compares the different word dissimilarity measure for the Word Mover's Distance.
Euclidean baseline is calculated by the Euclidean distance ||xi − yj|| between word xi and word
yj in the SGNS embeddings. Prop dissimilarity measure is calculated using our updated word
similarity 1 − ϕ(t)(xi, yj). From Table 2, we can see that Prop method calculated using our
updated word similarity improves Word Mover's Distance [
] calculated using Euclidean
distance. We confirmed the improvement of performance even in a small dataset (QA)
consisting only of 375 sentence pairs.
12 / 15
We proposed an unsupervised method to measure the similarity between two sentences which
updates both word and sentence similarity scores in an iterative manner, making multiple
passes over the entire corpus. Experimental results showed the effectiveness of the proposed
iterative approach for measuring sentence semantic similarity. In future, we plan to apply the
proposed method in large-scale paraphrase identification where we must detect similar
sentence pairs among potentially large number of dissimilar sentence pairs.
Data curation: TK.
Formal analysis: TK.
Funding acquisition: TK.
Methodology: DB YY KK.
Project administration: DB.
Supervision: YY KK.
Writing ± original draft: TK.
Writing ± review & editing: DB.
13 / 15
14 / 15
34. Niwa Y, Nitta Y. CO-OCCURRENCE VECTORS FROM CORPORA VS. DISTANCE VECTORS
FROM DICTIONARIES. In: Proc. of COLING; 1994. p. 304±309.
1. Salton G , Buckley C . Introduction to Modern Information Retreival . McGraw-Hill Book Company; 1983 .
2. Kim Y. Convolutional Neural Networks for Sentence Classification . In: Proc. of EMNLP ; 2014 . p. 1746 ± 1751 .
3. Yogatama D , Smith NA . Making the Most of Bag of Words: Sentence Regularization with Alternating Direction Method of Multipliers . In: Proc. of ICML ; 2014 .
4. Zanzotto FM , Dell'Arciprete L . Efficient kernels for sentence pair classification . In: Proc. of EMNLP ; 2009 . p. 91 ± 100 .
5. Agirre E , Cer D , Diab M , Gonzalez-Agirre A , Guo W. * SEM 2013 shared task: Semantic Textual Similarity . In: Proc. of *SEM; 2013 . p. 32 ± 43 .
6. Tsatsaronis G , Varlamis I , Vazirgiannis M. Text Relatedness Based on a Word Thesaurus . J Artif Int Res . 2010 ; 37 ( 1 ):1± 40 .
7. Kenter T , de Rijke M. Short Text Similarity with Word Embeddings . In: Proc. of CIKM ; 2015 . p. 1411 ± 1420 .
8. Le Q , Mikolov T . Distributed Representations of Sentences and Documents . In: Proc. of ICML ; 2014 . p. 1188 ± 1196 .
9. Yogatama D , Smith NA . Linguistic Structured Sparsity in Text Categorization . In: Proc. of ACL ; 2014 .
10. Hu B , Lu Z , Li H , Chen Q . Convolutional Neural Network Architectures for Matching Natural Language Sentences . In: Proc. of NIPS ; 2014 .
11. Ji Y , Eisenstein J. Discriminative Improvements to Distributional Sentence Similarity . In: Proc. of EMNLP ; 2013 . p. 891 ± 896 .
12. Guo W , Diab M. Modeling Sentences in the Latent Space . In: Proc. of ACL ; 2012 . p. 864 ± 872 .
13. Ravichandran D , Pantel P , Hovy E. Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering . In: Proc. of ACL ; 2005 . p. 622 ± 629 .
14. Socher R , Huang E , Pennington J , Ng A , Manning C . Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection . In: Proc. of NIPS ; 2011 .
15. Cheng J , Kartsaklis D . Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning . In: Proc. of EMNLP ; 2015 . p. 1531 ± 1542 .
16. Dagan I , Glickman O , Magnini B . The PASCAL Recognising Textual Entailment Challenge . In: Proc. of MLCW ; 2006 . p. 177 ± 190 .
17. Vilariño D , Pinto D , Tovar M , LeoÂn S , Castillo E. BUAP : Lexical and Semantic Similarity for Cross-lingual Textual Entailment . In: Proc. of *SEM; 2012 . p. 706 ± 709 .
18. Yokote K , Bollegala D , Ishizuka M . Similarity is not EntailmentÐJointly Learning Similarity Transformations for Textual Entailment . In: Proc. of AAAI; 2012 . p. 1720 ± 1726 .
19. Coster W , Kauchak D. Simple English Wikipedia: A New Text Simplification Task . In: Proc. of ACL ; 2011 . p. 665 ± 669 .
20. Agirre E , Banea C , Cardie C , Cer D , Diab M , Gonzalez-Agirre A , et al. SemEval -2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability . In: Proc. of SemEval; 2015 . p. 252 ± 263 .
21. Liu Y , Sun C , Lin L , Zhao Y , Wang X . Computing Semantic Text Similarity Using Rich Features . In: Proc. of PACLIC ; 2015 . p. 44 ± 52 .
22. Song Y , Roth D. Unsupervised Sparse Vector Densification for Short Text Similarity . In: Proc. of NAACL-HLT ; 2015 . p. 1275 ± 1280 .
23. Sultan MA , Bethard S , Sumner T. DLS @ CU: Sentence Similarity from Word Alignment and Semantic Vector Composition . In: Proc. of SemEval; 2015 . p. 148 ± 153 .
24. HaÈnig C , Remus R , de la Puente X. ExB Themis: Extensive Feature Extraction from Word Alignments for Semantic Textual Similarity . In: Proc. of SemEval; 2015 . p. 264 ± 268 .
25. Han L , Martineau J , Cheng D , Thomas C. Samsung: Align-and-Differentiate Approach to Semantic Textual Similarity . In: Proc. of SemEval; 2015 . p. 172 ± 177 .
26. Mikolov T , Chen K , Corrado G , Dean J . Efficient Estimation of Word Representations in Vector Space . In: Proc. of ICLR ; 2013 . p. 1 ± 12 .
27. Ganitkevitch J , Van Durme B , Callison-Burch C . PPDB: The Paraphrase Database . In: Proc. of NAACL ; 2013 . p. 758 ± 764 .
28. Mikolov T , Sutskever I , Chen K , Corrado GS , Dean J . Distributed Representations of Words and Phrases and their Compositionality . In: Proc. of NIPS ; 2013 . p. 3111 ± 3119 .
29. Miller GA . WordNet: A Lexical Database for English . Communications of the ACM . 1995 ; 38 ( 11 ): 39 ± 41 . https://doi.org/10.1145/219717.219748
30. Speer R , Havasi C . Representing General Relational Knowledge in ConceptNet 5 . In: Proc. of LREC; 2012 . p. 3679 ± 3686 .
31. Kiros R , Zhu Y , Salakhutdinov R , Zemel RS , Torralba A , Urtasun R , et al. Skip-Thought Vectors . In: Proc. of Advances in Neural Information Processing Systems (NIPS); 2015 . p. 3276 ± 3284 .
32. Hill F , Cho K , Korhonen A . Learning Disributed Representations of Sentences from Unlabelled Data . In: Proc. of NAACL-HLT ; 2016 . p. 1367 ± 1377 .
33. Kuhn HW . The Hungarian Method for the assignment problem . Naval Research Logistics Quarterly . 1955 ; 2 : 83 ± 97 . https://doi.org/10.1002/nav.3800020109
35. Turney P , Pantel P . From frequency to meaning: Vector space models of semantics . Journal of artificial intelligence research . 2010 ; 37 ( 1 ): 141 ± 188 .
36. Ross S , Gordon GJ , Bagnell D. A reduction of imitation learning and structured prediction to no-regret online learning . In: Proc. of ICML ; 2011 . p. 627 ± 635 .
37. Kusner MJ , Sun Y , Kolkin NI , Weinberger KQ . From Word Embeddings To Document Distances . In: Proc. of ICML ; 2015 .