Suppose you have two Excel files corresponding to two collections of bibliographic records such that each row contains a full bibliographic record and columns contain metadata types, including a column containing a unique identifier for each record.
Suppose you want to know, for each record of the first collection, whether there are any similar records in the second collection and, if so, how similar they are. I will describe three methods to do this. One uses Python and SalVe – a text-mining tool based on Lucene able to compare documents for word similarity; the second uses only Python; the third uses a user-defined function in Excel.
1. Comparing 16th century work titles fuzzily
Titles of works in 16th century are funny. Even when we are talking about the same work, the title might differ because e.g. a publisher decided to use genitive in the Latin title of a work, while another publisher has used the title in the nominative or in the accusative. Also our list might contain titles in several different languages, or employing varying orthographies. This means that we would need a method that works on multi-language texts, and on mono-linguistic texts with varying spelling conventions.
2. First Method
This method extracts records as separate .txt files and compares them for similarities word per word. It is only appropriate if our list is in one graphically standardized language for which we have a good stemmer. Note that Latin stemmers aren’t that great or commonly used – most of all, just SalVe doesn’t speak Latin yet. (our best bet seems this.) If this is not your case, check the other methods.
First we need to pull the data out of the Excel files in such a way that SalVe can compare the records. To do this, we need to extract the text corresponding to a single record from each row of the Excel file into a separate .txt file. The .txt files should be named in an insightful way, if possible by using the record identifier, should you have a unique identifier for your records that is insightful for you.
The following code prints the rows from a csv file (thanks to Stackoverflow). Beware of: (1) indents (2) upper-case.
import csv with open("myfile.csv", "rU") as f: reader = csv.reader(f, delimiter="\t") for row in reader: print row
Caveats: (i) the csv file needs to be in the right format: make sure it is saved as CSV (MS-DOS Comma-Separated); (ii) note that the code at line 3 is opening the file in a universal-newline mode (“rU”), not e.g. in binary mode (“rb”); this enables the code to run also when the csv contains non-ASCII character (again, thanks, Stackoverflow); (iii) I think the delimiter is an error, but never mind, and you can check escape sequences here; (iv) the csv.reader step (see csv module) lets you ‘see’ the csv file you import as having lines. This is important because you will need to specify iteration on lines with delimiters in them (such as commas) and their elements (as opposed to operating on characters on strings – see possible errors here for wrong iteration or here for single characters comparison – or tsv text, see possible useful tsv codes here and here.)
3. Second Method
Another option is to compare the strings that form the title character by character and get a percentage out of this comparison directly (get the Levenshtein distance for the closest titles). This solution involves installing the Levenshtein and the Distance module, which can be done using PIP (you have PIP if you followed the instructions linked above) as follows (thanks to Stackoverflow again):
pip install python-levenshtein pip install distance
(you will be advised to upgrade). Let’s see some preliminary steps. Our toy csv file (frutta.csv) contains two columns and four rows. This code prints all four lines of the csv:
import csv with open("py/frutta.csv", "rU") as f: csvreader = csv.reader(f) for line in csvreader: print(line)
The output is
We now want to compute the Levenshtein distance between the first word of each line (in other words, if you created your .csv file from a spreadsheet, you want to compute the Levenshtein distance between A1 and B1, A2 and B2, A3 and B3, and A4 and B4):
import Levenshtein import csv with open("py/frutta.csv", "rU") as f: csvreader = csv.reader(f) for line in csvreader: lev = Levenshtein.ratio(line.lower(), line.lower()) print(lev)
The first line imports the Levenshtein module. We compute the Levenshtein distance between the first unit of our line (line) and the second unit of our line (line) – basically cell A1 and B1, taking their lower case versions
The following gets you Levenshtein and some other comparison methods (difflib, sorensen, jaccard – see also here and here, and it might be useful to know commands such as strip to eliminate newline characters when Reading the file. See here and here):
import difflib, Levenshtein, distance import csv with open("py/frutta.csv","rU") as f: csvreader = csv.reader(f) for line in csvreader: lev = Levenshtein.ratio(line, line) diffl = difflib.SequenceMatcher(None, line, line).ratio() sor = 1 - distance.sorensen(line, line) jac = 1 - distance.jaccard(line, line) print(diffl, lev, sor, jac)
However, for what we need Levenshtein will do. Note that this method only works if you compare strings in the same row. If you want to compare a cell with a cell range, you should look at the third method.
4. Third method
For this method, we use a user-defined formula in Excel to calculate the Levenshtein distance between a cell and a range of cells, so this lets you compare cells wherever they are in a column. Instructions here (but start here if you have never done anything like that) – and check some improvements here.
Note that I have tested this only on Excel for Windows and the result of calculating the similarities of more than 1000 cells each with a range of about 600 is really slow, so improving the second method might be more promising.
5. Other methods
Another method, based on tokenization of strings using Jaccard ratios, relies on an add-on for fuzzy comparison in Excel for Windows. It did not give satisfactory results, among other things because the tokenization used is unspecified and it is arguably too coarse to yield the analyses we want. See also the discussions here and here.