Data talk:COVID-19 cases in Santa Clara County, California.tab

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

See or edit raw graph data.


See or edit raw graph data.


Contributing[edit]

To update this table based on Santa Clara County Public Health's coronavirus data dashboard, first install jq, then run this Bash script. The URLs come from inspecting the main Power BI dashboard's network traffic, then trimming the requests down to just what's needed for the table here.

 – Minh Nguyễn 💬 18:31, 3 May 2020 (UTC)[reply]

ADDRESSED (as best as we can): Mid-March discrepancy in the data[edit]

There is a discrepancy between the Timeline and this table. For example, the timeline says March 20 has 196 cases but the table says March 19 has 196 cases. From my notes on my brief look the other day, the error occurred somewhere between the 12-14. Haven't had a chance to fix it yet. — Preceding unsigned comment added by Michaelcomella (talk • contribs) 16:21, 21 March 2020 (UTC)[reply]

I discovered the cause and several other possible inaccuracies:
  • the case count in the table for 03-13 (91) is actually the case count for 03-14 (source). I haven't found a source for the 13th.
  • "Presumed community transmission" on 03-13 in the table is 40. However, that is the count in source above for 03-14. The "Presumed community transmission" data in the table for 03-14 is 52 but that does not align with the source I found.
  • "Close contact" is 28 in the table but 27 in the source.
I'm concerned with the accuracy of the data in the table. Furthermore, I wonder if there are sufficient sources to continue to update this table accurately for the duration of the pandemic: will all of these different categories of numbers be reported accurately? I wonder if it'd be beneficial to consolidate to a smaller number of facets so that this table is easier to maintain accurately, e.g. just the medical cases chart in the main article. I'm happy to be proven wrong. :) Michaelcomella (talk) 00:12, 22 March 2020 (UTC)[reply]
I added the missing data for 2020-03-13 and shifted the remaining entries down (diff). However, I still feel there are inaccuracies in the table: for example, the number of deaths on 2020-03-13 is 1 but on 2020-03-14 it's 0. Michaelcomella (talk) 14:26, 23 March 2020 (UTC)[reply]
I did another pass over the data, which resulted in shifting things around again. Note that the health department updates figures as of a certain day, but the Wayback Machine may not pick up that update until the following day UTC. Adding to the confusion, sometimes the health department has forgotten to update the timestamp when updating their figures, and often the Mercury News reports a given day's numbers the following day. After all that, there are still a couple gaps in the data, unfortunately. – Minh Nguyễn 💬 07:32, 24 March 2020 (UTC)[reply]
Thanks for doing that! The source citations on each row is a good idea. Given the discrepancy between data sources, we should probably try to prioritize one source over the other going forward. Presumably, this should be the Santa Clara County Public Health website. Should this be checked daily when it updates (i.e. we can't necessarily cite it) or should we take it from archive.org? I wish these websites presented their historical data. :( Michaelcomella (talk) 15:59, 24 March 2020 (UTC)[reply]
Mxn, do you feel your pass over the data is enough to mark this issue as "solved"? Michaelcomella (talk) 15:59, 24 March 2020 (UTC)[reply]
@Michaelcomella: It isn't ideal how the department is updating their statistics page so haphazardly, but this is probably the best we can do at the moment. Presumably in the future, we'll have a better chance at getting an actual chart from the department. In the meantime, it is appropriate to prefer the archived copies of the department's webpage. The Wayback Machine has been indexing this page with remarkable frequency over the past couple weeks, as part of the Internet Archive's efforts to preserve a historical record of this public health event. At this point, I'm only using the Mercury News to detect when the department updated the page without updating the timestamp, for the couple times where the Wayback Machine missed an update. – Minh Nguyễn 💬 08:12, 26 March 2020 (UTC)[reply]
As for the latter columns indicating causes, I think we should use -1 instead of 0 as a placeholder for missing data. – Minh Nguyễn 💬 07:33, 24 March 2020 (UTC)[reply]
I noticed we can use an empty string (i.e. ""), which has to be handled differently than -1 so folks ingesting the data are less likely to run into errors. I'd probably prefer that. Michaelcomella (talk) 15:59, 24 March 2020 (UTC)[reply]
That seems reasonable to me. – Minh Nguyễn 💬 08:12, 26 March 2020 (UTC)[reply]
MediaWiki doesn't allow the empty string for numeric fields, but it does allow null, so I've replaced the zeros with nulls wherever they looked like missing data. – Minh Nguyễn 💬 22:38, 21 April 2020 (UTC)[reply]

We shouldn't expect any more breakdowns on the cause of transmission, so over time, we'll have more unfilled cells than filled cells in those columns. On the bright side, it sounds like the department will be overhauling its website, hopefully with historical data:

​​Because most of our newly confirmed cases in Santa Clara County are likely associated with community transmission, we are no longer displaying case counts by mode of transmission.

The Public Health Department will launch a new web site tomorrow (Friday, March 27) with additional aggregate data about cases. ​​​​​​​​

 – Minh Nguyễn 💬 01:19, 27 March 2020 (UTC)[reply]

Data from March 27 onward[edit]

So the Public Health Department just replaced the page we've been relying on with a Microsoft PowerBI dashboard, which is definitely informative as far as understanding where we are in the curve. But can anyone figure out how to get actual numbers out of these bar charts? – Minh Nguyễn 💬 03:12, 28 March 2020 (UTC)[reply]

@Mxn: I'm able to see the numbers for the bar chart (though it doesn't look easy to copy and paste). I discovered two methods:
  1. I can hover my mouse over a bar in a bar graph, which will display a pop-up with the data point for the bar (e.g. number of cases).
  2. I can right click on a specific graph and click "Show as table" to get all of the data for one graph in a tabular format.
For what it's worth, I'm using Firefox on macOS. Michaelcomella (talk) 23:49, 28 March 2020 (UTC)[reply]

Death toll breakdown by day[edit]

Unfortunately, the county and CalREDIE haven't released a breakdown of deaths by day the way they've started to break down cases by day. So far, this table has been collecting the death toll as reported on each day, but we haven't been able to retroactively update past days based on deaths that are backdated. This limitation was highlighted today when the county announced a death on February 6. I don't think we can patch in individual deaths like this without the full data, because it would skew the curve, which is the main thing this chart depicts. – Minh Nguyễn 💬 05:50, 23 April 2020 (UTC)[reply]

The table now uses the death toll time series on the official dashboard, which allows us to backdate the first death to February 6. – Minh Nguyễn 💬 08:01, 23 June 2020 (UTC)[reply]

Schema comparison[edit]

For generalizing parts of these workflows, a comparison of the county's data to available state- and nation-wide data:

Here is the current tab schema:

date
newCases
totalConfirmedCases
hospitalized
deaths
intlTravelAssoc
closeContact
presumedCommunityTrans
undatedCases
source

Here is the CovidTracking schema:

Date
Positive tests
Negative tests
Pending tests
Hospitalized – Currently
Hospitalized – Cumulative
In ICU – Currently
In ICU – Cumulative
On Ventilator – Currently
On Ventilator – Cumulative
Recovered
Deaths

There is also extended demographic data: racial/ethnic breakdowns of cases and deaths.

Sources: there is a list of 1-4 source links per state, and a matrix with one source per cell [changing very slowly over time; but different bits of data come from different sources]. --SJ+ 22:42, 9 May 2020‎ (UTC)[reply]

@Sj: I agree in principle with aligning these tables to a well-established schema. Unfortunately, I think the county's own reporting methodology will be a limiting factor. For example, they only report cumulative data on confirmed infections (equivalent to positive tests, presumably) and deaths; everything else is a current datapoint. – Minh Nguyễn 💬 22:46, 9 May 2020 (UTC)[reply]

Yes, I'm imagining having (at least within a state, or across states) a shared set of fields; which not all counties will report identically. And starting with {cases, tests; hospitalizations, recovered, deaths}
I too find cumulative/current are the least-consistently reported - most sources include one but not both. It helps to use normalized field names to be clear which it is. --SJ+ 23:16, 9 May 2020 (UTC)[reply]

Undated cases column removed[edit]

I removed the undatedCases column. It has been broken since May 3, when the "COVID-19 case counts by date" dataset was last updated. Since then, the totalConfirmedCases column has simply been the difference between the known total on May 3 and the latest day's known total. These columns were fixed by migrating back to the Power BI dashboard, which is still being updated three times a week. However, I decided to remove the undatedCases column anyways because it complicates determining the current case count (requiring you to add two cells) and is inconsistent with all the other data tables in this category. I don't think we tracking any other county's backlog of case investigations in quite the same manner.

The latest day's known total, including both dated and undated cases, is now in the same column as the cumulative case count. I've taken the opportunity to rename this column from totalConfirmedCases to cases for consistency with the other data tables and to signal to transcluding templates and modules that the schema has changed.

 – Minh Nguyễn 💬 21:07, 8 July 2023 (UTC)[reply]