Category Archives: Computer stuff

Comparing 16th century bibliographic records fuzzily

Suppose you have two Excel files corresponding to two collections of bibliographic records such that each row contains a full bibliographic record and columns contain metadata types, including a column containing a unique identifier for each record.

Suppose you want to know, for each record of the first collection, whether there are any similar records in the second collection and, if so, how similar they are. I will describe three methods to do this. One uses Python and SalVe – a text-mining tool based on Lucene able to compare documents for word similarity; the second uses only Python; the third uses a user-defined function in Excel.

1. Comparing 16th century work titles fuzzily

Titles of works in 16th century are funny. Even when we are talking about the same work, the title might differ because e.g. a publisher decided to use genitive in the Latin title of a work, while another publisher has used the title in the nominative or in the accusative. Also our list might contain titles in several different languages, or employing varying orthographies. This means that we would need a method that works on multi-language texts, and on mono-linguistic texts with varying spelling conventions.

2. First Method

This method extracts records as separate .txt files and compares them for similarities word per word. It is only appropriate if our list is in one graphically standardized language for which we have a good stemmer. Note that Latin stemmers aren’t that great or commonly used – most of all, just SalVe doesn’t speak Latin yet. (our best bet seems this.) If this is not your case, check the other methods.

First we need to pull the data out of the Excel files in such a way that SalVe can compare the records. To do this, we need to extract the text corresponding to a single record from each row of the Excel file into a separate .txt file. The .txt files should be named in an insightful way, if possible by using the record identifier, should you have a unique identifier for your records that is insightful for you.

We do this with Python (you can get the basics of Python at Codecademy). If you are on a mac, you first need to setup Python coding following these easy steps (you’ll be done in minutes.)

The following code prints the rows from a csv file (thanks to Stackoverflow). Beware of: (1) indents (2) upper-case.

import csv

with open("myfile.csv", "rU") as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        print row

Caveats: (i) the csv file needs to be in the right format: make sure it is saved as CSV (MS-DOS Comma-Separated); (ii) note that the code at line 3 is opening the file in a universal-newline mode (“rU”), not e.g. in binary mode (“rb”); this enables the code to run also when the csv contains non-ASCII character (again, thanks, Stackoverflow); (iii) I think the delimiter is an error, but never mind, and you can check escape sequences here; (iv) the csv.reader step (see csv module) lets you ‘see’ the csv file you import as having lines. This is important because you will need to specify iteration on lines with delimiters in them (such as commas) and their elements (as opposed to operating on characters on strings – see possible errors here for wrong iteration or here for single characters comparison – or tsv text, see possible useful tsv codes here and here.)

3. Second Method

Another option is to compare the strings that form the title character by character and get a percentage out of this comparison directly (get the Levenshtein distance for the closest titles). This solution involves installing the Levenshtein and the Distance module, which can be done using PIP (you have PIP if you followed the instructions linked above) as follows (thanks to Stackoverflow again):

pip install python-levenshtein
pip install distance

(you will be advised to upgrade). Let’s see some preliminary steps. Our toy csv file (frutta.csv) contains two columns and four rows. This code prints all four lines of the csv:

import csv

with open("py/frutta.csv", "rU") as f:
csvreader = csv.reader(f)
for line in csvreader:
print(line[0])

The output is

appel;appel
mango;appel
mango;monga
mango;mango

We now want to compute the Levenshtein distance between the first word of each line (in other words, if you created  your .csv file from a spreadsheet, you want to compute the Levenshtein distance between A1 and B1, A2 and B2, A3 and B3, and A4 and B4):

import Levenshtein
import csv

with open("py/frutta.csv", "rU") as f:
 csvreader = csv.reader(f)
 for line in csvreader:
 
 lev = Levenshtein.ratio(line[0].lower(), line[1].lower())
 print(lev)

The first line imports the Levenshtein module. We compute the Levenshtein distance between the first unit of our line (line[0]) and the second unit of our line (line[1]) – basically cell A1 and B1, taking their lower case versions

line[0].lower()
line[1].lower()

1.0
0.2
0.6
1.0

The following gets you Levenshtein and some other comparison methods (difflib, sorensen, jaccard – see also here and here, and it might be useful to know commands such as strip to eliminate newline characters  when Reading the file. See here and here):

import difflib, Levenshtein, distance
import csv

with open("py/frutta.csv","rU") as f:

    csvreader = csv.reader(f)
    for line in csvreader:

        lev = Levenshtein.ratio(line[0], line[1])
        diffl = difflib.SequenceMatcher(None, line[0], line[1]).ratio()
        sor = 1 - distance.sorensen(line[0], line[1])
        jac = 1 - distance.jaccard(line[0], line[1])

        print(diffl, lev, sor, jac)

However, for what we need Levenshtein will do. Note that this method only works if you compare strings in the same row. If you want to compare a cell with a cell range, you should look at the third method.

4. Third method

For this method, we use a user-defined formula in Excel to calculate the Levenshtein distance between a cell and a range of cells, so this lets you compare cells wherever they are in a column. Instructions here (but start here if you have never done anything like that) – and check some improvements here.

Note that I have tested this only on Excel for Windows and the result of calculating the similarities of more than 1000 cells each with a range of about 600 is really slow, so improving the second method might be more promising.

5. Other methods

Another method, based on tokenization of strings using Jaccard ratios, relies on an add-on for fuzzy comparison in Excel for Windows. It did not give satisfactory results, among other things because the tokenization used is unspecified and it is arguably too coarse to yield the analyses we want. See also the discussions here and here.

Advertisements

When your USB hub corrupts your external hard drive (Mac OS)

Your external hard drive had made a weird sound and has become invisible in Finder.

Take these steps.

NB: This applies to HDs formatted as HFS+ (Mac Extended Journalled). Otherwise, maybe try check this.

(0) Remove all USB devices and do a restart, cold restart and a restart in safe mode. If none work, go to (1).

(1) Connect the External HD, check that the lights are on (if they are) and launch Disk Utility: does the disk appear?

Yes: Great! Choose Repair. Also, maybe check this. If Repair is grayed out, this.
No: Go to (2).
I don’t know since it keeps saying ‘loading disks..’: Wait 5 minutes. If it keeps not opening, go to (2).

(2) Reboot in Safe Mode (CMD + R at startup) and go to Disk Utility. Does the disk appear?

Yes: Great! Choose Repair, and maybe check thisthis and this.
No: Go to (3).
I don’t know since it keeps saying ‘loading disks..’: Wait 5 minutes. If it keeps not opening, go to (3).

(3) Launch Terminal. Also, check this.
Type: diskutil list
Enter. Does your disk appear in the list?
Yes: Great! Jot down the BSD name (something like ‘disk2s1’.)
No: Go to (4).

(4)  Check System Report (About This Mac > System Report > USB). Does the disk appear?

Yes: Great! Jot down the BSD name (something like ‘disk2s1’.)
No: Go to (5). Also check this.

(5) Reboot in Single-User Mode (CMD + S). You might have to turn off FileVault.

Type: sudo fsck_hfs -f /dev/disk2s1
(NB: substitute for disk2s1 the BSD name of your volume!)

Enter.

This will take quite a while depending on how big your disk is.
Also check thisthisthisthis and this.

(6) Use DiskWarrior. Also check this. I did not try (5) as I manage at (4).