English subtitles for clip: File:Wikidata Editing with OpenRefine - Part 2.webm
Jump to navigation
Jump to search
1 00:00:00,000 --> 00:00:02,367 Welcome back to this tutorial 2 00:00:02,367 --> 00:00:04,700 on using OpenRefine to import data 3 00:00:04,700 --> 00:00:06,400 into Wikidata. 4 00:00:06,400 --> 00:00:08,200 In the previous video, 5 00:00:08,200 --> 00:00:10,750 we have matched films against Wikidata items 6 00:00:10,750 --> 00:00:13,100 and checked the quality of these matches. 7 00:00:13,100 --> 00:00:15,050 For each of these films, 8 00:00:15,050 --> 00:00:17,300 we want to add the filming locations 9 00:00:17,300 --> 00:00:19,550 to the Wikidata items. 10 00:00:19,550 --> 00:00:22,150 This requires reconciling the locations as well. 11 00:00:22,150 --> 00:00:25,107 So let's do this. 12 00:00:25,107 --> 00:00:27,104 The locations we have in this dataset 13 00:00:27,104 --> 00:00:29,726 are given as street addresses. 14 00:00:29,726 --> 00:00:31,974 these particular addresses are unlikely 15 00:00:31,974 --> 00:00:33,939 to have a corresponding Wikidata item, 16 00:00:33,939 --> 00:00:36,939 but the streets they are in often have one. 17 00:00:38,665 --> 00:00:40,608 So we are first going to extract 18 00:00:40,608 --> 00:00:41,871 the street names from the addresses. 19 00:00:41,871 --> 00:00:44,000 We use a regular expression 20 00:00:44,000 --> 00:00:47,467 to remove any number at the beginning of the string. 21 00:00:51,504 --> 00:00:54,500 In the preview window, we notice that 22 00:00:55,089 --> 00:00:55,400 our regular expression did not catch 23 00:00:55,400 --> 00:00:57,800 the leading spaces. 24 00:00:57,800 --> 00:01:00,500 This is an indication that these strings 25 00:01:00,500 --> 00:01:03,050 contain non-standard space characters. 26 00:01:03,050 --> 00:01:04,550 They are likely to cause problems 27 00:01:04,550 --> 00:01:07,445 during reconciliation with Wikidata. 28 00:01:07,445 --> 00:01:08,892 So let's just copy these weird characters 29 00:01:08,892 --> 00:01:10,700 and get rid of them 30 00:01:10,700 --> 00:01:13,967 with a first replace function. 31 00:01:15,479 --> 00:01:16,979 The first call to replace 32 00:01:16,979 --> 00:01:18,646 cleans up the whitespace; 33 00:01:19,302 --> 00:01:22,302 the second removes the street numbers. 34 00:01:30,550 --> 00:01:32,538 Pick a name for the column 35 00:01:32,750 --> 00:01:34,600 and create it. 36 00:01:37,050 --> 00:01:40,464 We can now reconcile these streets to Wikidata. 37 00:01:40,464 --> 00:01:43,531 Again, pick "Reconcile" -> "Start reconciling" 38 00:01:43,548 --> 00:01:45,525 and choose the Wikidata service. 39 00:01:48,250 --> 00:01:51,638 In this case, the "street" type is too narrow. 40 00:01:52,071 --> 00:01:54,735 Some locations are parks or bridges 41 00:01:54,735 --> 00:01:57,735 so we manually pick a broader type. 42 00:01:57,936 --> 00:01:59,859 Let's see what other information we could use 43 00:01:59,859 --> 00:02:02,200 to improve the matches. 44 00:02:02,200 --> 00:02:04,754 The postcode looks like a good fit 45 00:02:04,754 --> 00:02:07,300 but unfortunately postcodes are rarely 46 00:02:07,300 --> 00:02:10,300 added on street items. 47 00:02:10,600 --> 00:02:13,000 The last column contains the geographical 48 00:02:13,000 --> 00:02:14,535 coordinates of the locations, 49 00:02:14,535 --> 00:02:17,535 expressed as latitude, comma, longitude. 50 00:02:18,912 --> 00:02:22,379 We can match that to the coordinates of the streets. 51 00:02:22,688 --> 00:02:25,218 The closer these geographical points will be, 52 00:02:25,218 --> 00:02:28,218 the higher the matching score will get. 53 00:02:35,400 --> 00:02:37,949 Once reconciliation is done, 54 00:02:37,949 --> 00:02:39,902 we can inspect the matches. 55 00:02:39,902 --> 00:02:41,644 In this case, we can see that two streets 56 00:02:41,644 --> 00:02:43,034 with the same name 57 00:02:43,034 --> 00:02:45,073 got different matching scores, 58 00:02:45,073 --> 00:02:48,073 thanks to the matching on coordinates. 59 00:02:48,111 --> 00:02:51,111 The first one is the correct one. 60 00:02:52,600 --> 00:02:55,209 This cell was not matched automatically 61 00:02:55,209 --> 00:02:56,550 because the gap between the two scores 62 00:02:56,550 --> 00:02:58,300 is not big enough. 63 00:02:58,300 --> 00:03:00,550 I suspect there are more cases like this, 64 00:03:00,550 --> 00:03:02,900 so I am just going to filter the cells 65 00:03:02,900 --> 00:03:06,100 which were not matched 66 00:03:06,100 --> 00:03:10,431 but whose best candidate score is very high. 67 00:03:12,050 --> 00:03:14,500 I'm also going to add a facet 68 00:03:14,500 --> 00:03:16,238 which computes the string similarity 69 00:03:16,238 --> 00:03:17,819 between the cell content 70 00:03:17,819 --> 00:03:20,202 and the name of the best match 71 00:03:20,202 --> 00:03:23,202 and restrict to the high quality matches. 72 00:03:24,500 --> 00:03:26,127 Let's review these filtered rows 73 00:03:26,127 --> 00:03:27,860 and their best candidates. 74 00:03:48,209 --> 00:03:50,679 All these candidates are correct. 75 00:03:50,679 --> 00:03:52,469 So click "Reconcile" -> "Actions" 76 00:03:52,469 --> 00:03:55,469 -> "Match each cell to its best candidate" 77 00:03:57,015 --> 00:03:58,931 Obviously this operation should be used with care 78 00:03:58,931 --> 00:04:01,931 because it can introduce false positives. 79 00:04:03,650 --> 00:04:05,200 Let's now check the quality 80 00:04:05,200 --> 00:04:07,671 of the matched cells. 81 00:04:08,079 --> 00:04:09,582 For instance, 82 00:04:09,582 --> 00:04:11,100 we can fetch the administrative location 83 00:04:11,100 --> 00:04:15,669 of these streets. 84 00:04:23,676 --> 00:04:24,918 Once these locations are fetched, 85 00:04:24,918 --> 00:04:27,514 we can create a text facet on this column 86 00:04:27,514 --> 00:04:29,168 and sort the facet 87 00:04:29,168 --> 00:04:32,168 by decreasing number of occurrences. 88 00:04:39,550 --> 00:04:42,100 This gives us a broad overview 89 00:04:42,100 --> 00:04:47,241 of the most frequent values. 90 00:04:47,241 --> 00:04:47,800 We can review this list. 91 00:04:47,800 --> 00:04:50,050 All these locations are neighborhoods in Paris, 92 00:04:50,050 --> 00:04:52,947 which is consistent with the dataset. 93 00:05:03,250 --> 00:05:06,551 This is the end of the second part of this tutorial. 94 00:05:06,551 --> 00:05:08,500 In the next video, we are going to 95 00:05:08,500 --> 00:05:10,965 transform our table into statements 96 00:05:10,965 --> 00:05:12,832 and upload them to Wikidata.