English subtitles for clip: File:Wikidata Editing with OpenRefine - Part 1.webm

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
1
00:00:00,000 --> 00:00:05,333
Welcome to this tutorial series on using OpenRefine to import data into Wikidata.

2
00:00:05,333 --> 00:00:06,833
My name is Antonin

3
00:00:06,850 --> 00:00:09,674
I'm going to walk you through the entire process

4
00:00:09,674 --> 00:00:11,489
of cleaning up the dataset,

5
00:00:11,489 --> 00:00:13,468
matching it with Wikidata items,

6
00:00:13,468 --> 00:00:17,601
and uploading the information as statements on these items.

7
00:00:17,612 --> 00:00:20,133
No previous knowledge of OpenRefine is necessary to follow this tutorial

8
00:00:20,133 --> 00:00:23,333
but some familiarity with Wikidata will help.

9
00:00:24,078 --> 00:00:26,627
All the links necessary to follow the tutorial

10
00:00:26,627 --> 00:00:28,485
can be found in the description of the video.

11
00:00:28,485 --> 00:00:30,828
So let's get started!

12
00:00:30,828 --> 00:00:35,561
OpenRefine is free software that you can download on openrefine.org.

13
00:00:35,930 --> 00:00:40,330
Once you have installed it, it runs in your browser like this.

14
00:00:40,679 --> 00:00:43,363
In this tutorial, we are going to import data

15
00:00:43,363 --> 00:00:46,496
about shooting locations of films in Paris.

16
00:00:47,592 --> 00:00:49,947
The dataset we are going to work on is available

17
00:00:49,947 --> 00:00:52,947
on the Parisian open data portal

18
00:00:53,455 --> 00:00:55,962
and we can download it as a CSV file.

19
00:00:55,962 --> 00:00:58,501
We can just copy the URL of that file

20
00:00:58,501 --> 00:01:01,501
and paste that in OpenRefine.

21
00:01:01,794 --> 00:01:04,395
We now have a preview of the table

22
00:01:04,395 --> 00:01:06,604
and we are happy with this format

23
00:01:06,604 --> 00:01:10,004
so we give a name to the project and create it.

24
00:01:13,482 --> 00:01:15,824
The first step to import this data in Wikidata

25
00:01:15,824 --> 00:01:17,324
is to match the film names

26
00:01:17,324 --> 00:01:20,191
with the Wikidata items they correspond to.

27
00:01:20,766 --> 00:01:22,266
Click on the column that contains the names

28
00:01:22,266 --> 00:01:23,600
of the entities that you want to match.

29
00:01:23,600 --> 00:01:26,667
and choose "Reconcile" -> "Start reconciling".

30
00:01:27,200 --> 00:01:30,200
Pick the Wikidata reconciliation service.

31
00:01:31,150 --> 00:01:33,100
OpenRefine tries to guess

32
00:01:33,100 --> 00:01:37,100
the type of entity these names correspond to.

33
00:01:37,100 --> 00:01:37,688
In our case,

34
00:01:37,688 --> 00:01:40,688
its best guess is "film"

35
00:01:40,953 --> 00:01:43,638
which looks appropriate.

36
00:01:43,638 --> 00:01:46,572
OpenRefine will only consider instances of that class

37
00:01:46,572 --> 00:01:48,488
or subclasses of it

38
00:01:48,488 --> 00:01:51,472
when looking for matches.

39
00:01:51,472 --> 00:01:54,302
OpenRefine also lets you match on other properties

40
00:01:54,302 --> 00:01:56,993
stored in other columns of the table.

41
00:01:56,993 --> 00:01:59,785
In our case, the "Réalisateur" column

42
00:01:59,785 --> 00:02:02,145
contains the name of the film director,

43
00:02:02,145 --> 00:02:05,021
which is very useful for disambiguation.

44
00:02:05,021 --> 00:02:07,594
So tick that column and select

45
00:02:07,594 --> 00:02:10,114
the Wikidata property it should be matched against.

46
00:02:10,114 --> 00:02:13,066
Click "Start reconciling"

47
00:02:13,066 --> 00:02:16,066
and wait for the process to complete.

48
00:02:26,998 --> 00:02:29,153
Now that reconciliation is done,

49
00:02:29,153 --> 00:02:30,803
some names have turned into blue links

50
00:02:30,803 --> 00:02:34,270
which point to the corresponding Wikidata items.

51
00:02:34,990 --> 00:02:36,969
Others were not matched

52
00:02:36,969 --> 00:02:39,185
for instance because the director did not match

53
00:02:39,185 --> 00:02:42,185
in the case of this "Nadia" film.

54
00:02:42,411 --> 00:02:44,042
Some other films were not matched

55
00:02:44,042 --> 00:02:47,698
because Wikidata does not know who their director is.

56
00:02:47,698 --> 00:02:49,116
If you have time,

57
00:02:49,116 --> 00:02:51,265
you can go through these unmatched cells

58
00:02:51,265 --> 00:02:53,290
and manually reconcile them.

59
00:02:53,290 --> 00:02:55,097
But you can also leave them as they are:

60
00:02:55,097 --> 00:02:58,430
these rows will just be ignored in the import.

61
00:03:00,100 --> 00:03:02,993
On the left hand side, you can see two facets.

62
00:03:02,993 --> 00:03:04,530
These can be used to filter rows

63
00:03:04,530 --> 00:03:06,200
based on their matching status

64
00:03:06,200 --> 00:03:08,381
and matching score.

65
00:03:08,381 --> 00:03:10,896
You can select rows where matching succeeded

66
00:03:10,896 --> 00:03:13,896
by clicking on the "matched" status.

67
00:03:15,450 --> 00:03:17,200
It is important that you check

68
00:03:17,200 --> 00:03:19,500
the quality of these automated matches,

69
00:03:19,500 --> 00:03:21,250
and there are many ways to do this.

70
00:03:21,250 --> 00:03:23,250
In our case, the table contains

71
00:03:23,250 --> 00:03:25,000
the dates of the shootings

72
00:03:25,000 --> 00:03:26,700
so we can compare that

73
00:03:26,700 --> 00:03:28,774
to the release date of the movies

74
00:03:28,774 --> 00:03:30,440
and check that they are consistent.

75
00:03:30,440 --> 00:03:32,855
Click on the reconciled column,

76
00:03:32,855 --> 00:03:36,000
pick "Edit column" -> "Add column from reconciled values"

77
00:03:36,000 --> 00:03:39,000
and select "publication date".

78
00:03:46,700 --> 00:03:49,050
We will now create a column

79
00:03:49,050 --> 00:03:50,650
that will contain the difference

80
00:03:50,650 --> 00:03:52,150
between the publication date

81
00:03:52,150 --> 00:03:54,350
and the end of shooting date.

82
00:03:57,278 --> 00:04:01,211
Pick "Edit column" -> "Add column based on this column"

83
00:04:02,498 --> 00:04:04,800
The language used for the expression here

84
00:04:04,800 --> 00:04:06,750
is called GREL.

85
00:04:06,750 --> 00:04:08,550
It is a simple language

86
00:04:08,550 --> 00:04:10,150
that you can learn on OpenRefine's wiki.

87
00:04:10,150 --> 00:04:12,065
You can also select other languages

88
00:04:12,065 --> 00:04:14,398
if you are more familiar with them.

89
00:04:14,750 --> 00:04:17,588
This expression will compute the difference

90
00:04:17,588 --> 00:04:19,150
between the two dates

91
00:04:19,150 --> 00:04:22,159
as a number of days.

92
00:04:22,159 --> 00:04:24,196
Give the new column a name

93
00:04:24,196 --> 00:04:27,196
and create the column.

94
00:04:31,079 --> 00:04:32,579
We can now create a numeric facet

95
00:04:32,579 --> 00:04:33,682
on our new column

96
00:04:33,682 --> 00:04:37,149
and inspect the distribution of the differences.

97
00:04:39,704 --> 00:04:42,124
Some of these differences are negative

98
00:04:42,124 --> 00:04:44,700
which suggests that we might have matched cells

99
00:04:44,700 --> 00:04:48,443
to movies that were released before the shooting.

100
00:04:48,443 --> 00:04:52,200
In fact, that's just because the release date for them

101
00:04:52,200 --> 00:04:55,952
have a year precision on Wikidata.

102
00:04:57,041 --> 00:04:59,229
The maximum difference is less than two years

103
00:04:59,229 --> 00:05:00,643
which also makes sense,

104
00:05:00,643 --> 00:05:02,020
so we are confident

105
00:05:02,020 --> 00:05:05,020
that these matches are reliable.

106
00:05:08,515 --> 00:05:11,258
This is the end of the first part of the tutorial

107
00:05:11,258 --> 00:05:13,315
In the next video, we are going to reconcile

108
00:05:13,315 --> 00:05:16,315
the locations of the shootings.