User talk:Dispenser/HTML entities

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Caret encoding[edit]

Hi Dispenser, thank you for that list! I just found a different pattern of bad encoding: Look for "^euro" (without ;) or "^tilde" or "^trade" or "^rdquo" or "^sbquo". There are a few others. Best, --Achim (talk) 19:59, 16 October 2017 (UTC)[reply]

...&-195;&-162;^euro,^tilde,... Fixes for decoding to UTF-8
‘ Assume HTML named entities
‘ Mojibake, let try encode in Windows-1252 .encode('cp1252')
e2 80 98 Try to decode as UTF-8 sequence .decode('utf-8')
U+2018 (Left curly apostrophe) Matches character from source URL
@Multichill: How many pipelines did that pass through!?! Matching file names for HTML named entities we 9,719 5,174 12,043 files to rename. Recent upload with this brokenness are:
Dispenser (talk) 23:36, 16 October 2017 (UTC)[reply]
Oh no, I've been fixing these, not making them. BD2412 T 23:50, 16 October 2017 (UTC)[reply]
Looks like your right, I forgot page_is_redirect=0 and page moves seem to break rev_parent_id=0. I've updated the figures. Dispenser (talk) 23:59, 16 October 2017 (UTC)[reply]
-- ; sql --cluster analytics commonswiki_p
SELECT REGEXP_REPLACE(page_title,
  ".*?(\\^\\^[0-9]+,|\\^\\^[xX]?[0-9a-fA-F]+,|\\^\\w{2,10}[,.]).*",
  "\\1") AS entity, COUNT(*)
FROM page
WHERE page_namespace=6 AND page_is_redirect=0
AND page_title REGEXP
"\\^(\\^[0-9]+|\\^\\^[xX]?[0-9a-fA-F]+|quot|amp| ...[lot of HTML entity names]... )\\b,"
GROUP BY 1 ORDER BY 2;
Top caret entities
Entity Files Char Info
^^39, 7116 ' U+0027 (Apostrophe)
^quot, 3584 " U+0022 (Quote Mark)
^amp, 1259 & U+0026 (Ampersand)
^rsquo, 142 U+2019 (Right curly apostrophe)
^gt, 118 > U+003E (Greater-than sign)
^pi, 71 π U+03C0 (pi symbol)
^^093, 47 ] U+005D (Right Bracket)
^^091, 46 [ U+005B (Left Bracket)
^sbquo, 35 U+201A (Openning quote mark)
^lt, 26 < U+003C (Less-than sign)
^ndash, 26 U+2013 (En dash)
^ouml, 23 ö U+00F6 (o with diaeresis)
^uuml, 15 ü U+00FC (u with diaeresis)
^eacute, 10 é U+00E9 (e with acute)
^euro, 10 U+20AC (Euro sign)
^^x0301, 10 ́ U+0301 (Combining acute accent)
^egrave, 9 è U+00E8 (e with grave)
^ldquo, 7 U+201C (Left curly quotation)
^rdquo, 6 U+201D (Right curly quotation)
^aacute, 5 á U+00E1 (a with acute)
Still early stages, but I can't spare any more time. Still haven't cracked how exactly "Горный Алтай" became File:Đ^ldquo,ĐžŃ^euro,Đ˝Ń^lsaquo,Đš Đ^ĐťŃ^sbquo,Đ°Đš - panoramio - Tanya Dedyukhina.jpg. —Dispenser (talk) 16:37, 17 October 2017 (UTC)[reply]
-- ; sql --cluster analytics commonswiki_p
/* Safe replacements for caret encoding of:
 *  ^^39, => &#39; => ' U+0027 and  ^quot, => " U+0022 and ^amp, => &
 */
SELECT page_title, REPLACE(page_title, "^^39,", "'") AS Replacement
FROM page
WHERE page_namespace=6 AND page_is_redirect=0
AND page_title LIKE "%^%,%"
/* Limited ASCII to avoid messing with Mojibake */
HAVING Replacement REGEXP "^[A-Za-z0-9_().,&'\"\"\\-]+$"
   AND Replacement != page_title
   AND page_title  NOT LIKE "%&%"
   AND Replacement NOT REGEXP "&[^_]"
   AND Replacement NOT LIKE "%''%"
   AND Replacement NOT LIKE '%""%';
That should fix 4,845 for ^^39, and 520 for ^quot, Dispenser (talk) 01:24, 18 October 2017 (UTC)[reply]
Added 847 for ^amp,, but restricted since we have double caret encoded titles like File:Barcelona, entrance of Parc G^amp,uuml,ell by Antoni Gaud^amp,iacute - panoramio.jpg. —Dispenser (talk) 14:15, 18 October 2017 (UTC)[reply]
Updated table with total counts. Dispenser (talk) 20:52, 18 October 2017 (UTC)[reply]
Just for the record: ^^39, and ^^39 is fixed now. There have been some cases of bad encoding in conjunction with vowels, where a^^39, was meant to be à or á, same for other vowels like in cafe^^39,. --Achim (talk) 08:59, 18 November 2017 (UTC)[reply]
That's the user not knowing how to enter it. The sequence in Word is Ctrl+' then the letter, on MacOS its  Opt+e, and on Windows either type the alt code (Alt+0233) or copy & paste from the Character Map. We probably should be making a list of such problems.

I'd also note we have 310 files using two single quotes instead of a double quote: SELECT img_name FROM image WHERE img_name REGEXP "(^|[^'])''(?!')([^']|'[^'])+?(?<!')''[^']";Dispenser (talk) 14:28, 18 November 2017 (UTC)[reply]

On that last note, we should consider avoiding © ® ™ too. About 24,000 SELECT img_name FROM image WHERE img_name REGEXP "©|®|™";Dispenser (talk) 15:32, 18 November 2017 (UTC)[reply]