User talk:Dispenser/HTML entities

Caret encoding[edit]

Hi Dispenser, thank you for that list! I just found a different pattern of bad encoding: Look for "^euro" (without ;) or "^tilde" or "^trade" or "^rdquo" or "^sbquo". There are a few others. Best, --Achim (talk) 19:59, 16 October 2017 (UTC)[reply]

...`&-195;&-162;^euro,^tilde,`...	Fixes for decoding to UTF-8
`â€&tilde;`	Assume HTML named entities
`â€˜`	Mojibake, let try encode in Windows-1252 `.encode('cp1252')`
`e2 80 98`	Try to decode as UTF-8 sequence `.decode('utf-8')`
`‘` U+2018 (Left curly apostrophe)	Matches character from source URL

@Multichill: How many pipelines did that pass through!?! Matching file names for HTML named entities we ~~9,719~~ ~~5,174~~ 12,043 files to rename. Recent upload with this brokenness are:

User:Panoramio upload bot (Dec 2016, 3,880 files)
~~User:Fæ (July 2017, 3,945 files)~~
~~User:BD2412 (Oct 2017, 34 files)~~
~~User:Alkari (3 days ago, 71 files)~~

Dispenser (talk) 23:36, 16 October 2017 (UTC)[reply]

Oh no, I've been fixing these, not making them. BD2412 T 23:50, 16 October 2017 (UTC)[reply]

Looks like your right, I forgot page_is_redirect=0 and page moves seem to break rev_parent_id=0. I've updated the figures. Dispenser (talk) 23:59, 16 October 2017 (UTC)[reply]

-- ; sql --cluster analytics commonswiki_p
SELECT REGEXP_REPLACE(page_title,
  ".*?(\\^\\^[0-9]+,|\\^\\^[xX]?[0-9a-fA-F]+,|\\^\\w{2,10}[,.]).*",
  "\\1") AS entity, COUNT(*)
FROM page
WHERE page_namespace=6 AND page_is_redirect=0
AND page_title REGEXP
"\\^(\\^[0-9]+|\\^\\^[xX]?[0-9a-fA-F]+|quot|amp| ...[lot of HTML entity names]... )\\b,"
GROUP BY 1 ORDER BY 2;

Top caret entities
Entity	Files	Char	Info
^^39,	7116	'	U+0027 (Apostrophe)
^quot,	3584	"	U+0022 (Quote Mark)
^amp,	1259	&	U+0026 (Ampersand)
^rsquo,	142	’	U+2019 (Right curly apostrophe)
^gt,	118	>	U+003E (Greater-than sign)
^pi,	71	π	U+03C0 (pi symbol)
^^093,	47	]	U+005D (Right Bracket)
^^091,	46	[	U+005B (Left Bracket)
^sbquo,	35	‚	U+201A (Openning quote mark)
^lt,	26	<	U+003C (Less-than sign)
^ndash,	26	–	U+2013 (En dash)
^ouml,	23	ö	U+00F6 (o with diaeresis)
^uuml,	15	ü	U+00FC (u with diaeresis)
^eacute,	10	é	U+00E9 (e with acute)
^euro,	10	€	U+20AC (Euro sign)
^^x0301,	10	́	U+0301 (Combining acute accent)
^egrave,	9	è	U+00E8 (e with grave)
^ldquo,	7	“	U+201C (Left curly quotation)
^rdquo,	6	”	U+201D (Right curly quotation)
^aacute,	5	á	U+00E1 (a with acute)

Still early stages, but I can't spare any more time. Still haven't cracked how exactly "Горный Алтай" became File:Đ^ldquo,ĐžŃ^euro,Đ˝Ń^lsaquo,Đš Đ^ĐťŃ^sbquo,Đ°Đš - panoramio - Tanya Dedyukhina.jpg. —Dispenser (talk) 16:37, 17 October 2017 (UTC)[reply]

-- ; sql --cluster analytics commonswiki_p
/* Safe replacements for caret encoding of:
 *  ^^39, => &#39; => ' U+0027 and  ^quot, => " U+0022 and ^amp, => &
 */
SELECT page_title, REPLACE(page_title, "^^39,", "'") AS Replacement
FROM page
WHERE page_namespace=6 AND page_is_redirect=0
AND page_title LIKE "%^%,%"
/* Limited ASCII to avoid messing with Mojibake */
HAVING Replacement REGEXP "^[A-Za-z0-9_().,&'\"\"\\-]+$"
   AND Replacement != page_title
   AND page_title  NOT LIKE "%&%"
   AND Replacement NOT REGEXP "&[^_]"
   AND Replacement NOT LIKE "%''%"
   AND Replacement NOT LIKE '%""%';

That should fix 4,845 for ^^39, and 520 for ^quot, Dispenser (talk) 01:24, 18 October 2017 (UTC)[reply]

Added 847 for ^amp,, but restricted since we have double caret encoded titles like File:Barcelona, entrance of Parc G^amp,uuml,ell by Antoni Gaud^amp,iacute - panoramio.jpg. —Dispenser (talk) 14:15, 18 October 2017 (UTC)[reply]

Updated table with total counts. Dispenser (talk) 20:52, 18 October 2017 (UTC)[reply]

Just for the record: ^^39, and ^^39 is fixed now. There have been some cases of bad encoding in conjunction with vowels, where a^^39, was meant to be à or á, same for other vowels like in cafe^^39,. --Achim (talk) 08:59, 18 November 2017 (UTC)[reply]

That's the user not knowing how to enter it. The sequence in Word is Ctrl+' then the letter, on MacOS its ⌥ Opt+e, and on Windows either type the alt code (Alt+0233) or copy & paste from the Character Map. We probably should be making a list of such problems.

I'd also note we have 310 files using two single quotes instead of a double quote: SELECT img_name FROM image WHERE img_name REGEXP "(^|[^'])''(?!')([^']|'[^'])+?(?<!')''[^']"; —Dispenser (talk) 14:28, 18 November 2017 (UTC)[reply]

On that last note, we should consider avoiding © ® ™ too. About 24,000 SELECT img_name FROM image WHERE img_name REGEXP "©|®|™"; —Dispenser (talk) 15:32, 18 November 2017 (UTC)[reply]

User talk:Dispenser/HTML entities

Caret encoding[edit]

Navigation menu

Search