Commons:Graphics village pump/June 2018

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

SVG text rendered incorrectly despite using officially supported fonts

Hi,

meta:SVG fonts explicitly mentions DejaVu Sans and Liberation Sans. Also, Wikimedia pages themselves use the "Linux Libertine" font for section headings. The heading of this section is rendered in Linux Libertine!

I have created four SVGs containing text in Inkscape. The text should be easily editable using a text editor, the file should stay PD-ineligible, and the file size should not increase too much. For these three reasons, I do not want the text to be replaced by paths. Instead, I would like to use a font that allows correct rendering of thumbnails. I tried the three fonts mentioned above and failed.

The four SVGs: File:Smartscreen-warning-1.svg, File:Smartscreen-warning-1-arrow.svg, File:Smartscreen-warning-2.svg, File:Smartscreen-warning-2-arrow.svg

The text in these images has characters which overlap each other for no apparent reason. Instead of a normal text flow, characters are incorrectly moved to the left/right, causing them to stick together visually. It almost looks as if these characters are stored in the SVG one-by-one, each with a specific (wrong) position. This, however, does not actually seem to be the case, as opening the SVG with a text editor shows.

I tried searching the village pump archive and the Help:SVG page, but I might have overlooked something. If the issue is known, I would be happy about someone pointing me to a relevant FAQ entry or previously solved case. ~ ToBeFree (talk) 16:15, 6 June 2018 (UTC)

Additional note: The "more info" text has a "text-decoration: underline;" CSS attribute, which is rendered correctly in Firefox when opening the SVG. The MediaWiki thumbnail renderer, however, appears to completely ignore it. ~ ToBeFree (talk) 16:19, 6 June 2018 (UTC)

You've run into the librsvg small font-size quantization bug: Phab:T36947. Small font escapement and baseline placement have problems. For example, one of your files (File:Smartscreen-warning-2-arrow.svg) uses font sizes of 2.82222223px, 4.23333311px, 6.3499999px, 10.58333302px, 16.93333244px. That's small enough to tickle the bug.

For font-families, the list of supported fonts is not always right. I think your fonts can be found by librsvg, but they may not be found by other user agents (such as my browser). In any event, the font-family should include a generic font such as serif, sans-serif, or monospace.

The coordinate system used in the diagram is contorted (viewBox="0 0 132.29166 66.145833", translate(0,-230.85417)).

Glrx (talk) 02:03, 7 June 2018 (UTC)

Wow Glrx, this is amazing. Thank you very much for taking the time to explain all these points; I would probably never have found out!

I'm copying this to my talk page on en.wikipedia to remind myself of fixing these issues as soon as I can. The contorted coordinate system is a really strange thing. I was already surprised that Inkscape's SVG coordinate system seems to begin at the bottom left, but it made sense to me (mathematical x-y-coordinate systems do look like that). I have no idea where the crazy value of "-230.85417" comes from, and I will have a look at SVG manuals to learn what this "translate" attribute does. I'll inspect the source code and do a lot of cleanup there. The phabricator bug is also very nice to know. About the font-families, strange, I would have expected Inkscape to do that sort of thing automatically, at least for the very basic "serif" or "sans-serif" or "monospace" fallback. I wasn't aware that this is likely missing from all SVGs I have ever created!

The Commons SVG checker seems to be an awesome tool and just what I have been looking for, too. Thanks again! ~ ToBeFree (talk) 04:10, 7 June 2018 (UTC)

White space around SVG image

I uploaded a SVG file and did not notice it had a large white space around it. Do I start over or can I edit in Commons?

@BrucePL — Preceding unsigned comment added by BrucePL (talk • contribs) 19:48, 6 June 2018 (UTC)

Hi @BrucePL: ! You can overwrite existing files. If you don't have access to the original file anymore, you can download it from Commons. Edit it, possibly using a text editor or your SVG editing tool, correct the whitespace, and then feel free to "Upload a new version of this file". The link for that is available on the image page.

~ ToBeFree (talk) 21:19, 6 June 2018 (UTC)

Stripping metadata

Tracked in Phabricator
Task T20871

Wikimedia Commons strips IPTC metadata - descriptions, attribution, copyright management information and that sort of thing - from all but its “full resolution” downloads.

Note that I’m not talking about Exif camera logging information here. I’m talking about information that creators deliberately attach to their work to enhance its cultural value and to protect its copyright. (The same information that we require of all contributors, actually.) And note also, that we DO respect that information on ONE of the offered renditions of a photo.

Clearly, this is a bad thing. Karmically, it’s an unkind thing to do to contributors’ works. Legally, decisions are slowly marching toward the day when penalties for the removal of copyright management information will be applied specifically to the removal of embedded metadata. (And regardless of Terms of Use or Safe Harbor defenses.) At scale, those penalties can be ghastly. (As penalties should be, really. That’s the point of them.)

I doubt there would be any argument that we shouldn’t fix this. So, the question is: HOW do we fix this? WHO, exactly, needs to fix this, for that matter?

On most tiny little websites like mine, it’s a matter of a few minutes effort and it’s all fine. For a site this big, it could be a whole different kettle of fish. Or not, it could actually be easier. The first step is to figure out how to proceed.

Now, I’m also not addressing the fairly obvious question of why people - in droves - who are sharing works that depend on attribution would post them without taking the effort to write their names on them. (Ah, because they didn’t know that they could, let alone should?) That’s an education issue. But prerequisite to educating, scolding, and cajoling, we should make sure that we don’t delete contributors’ good efforts.

Education and encouragement is a conversation that we absolutely should have, but later, IMHO.

I advocate on this issue. I have a blog that’s full of sometimes-nerdy information on the subject: metadatamatters.blog

I’ll help in any way I can. — Preceding unsigned comment added by Carlseibert (talk • contribs) 19:01, 21 June 2018 (UTC)

Hi @Carlseibert: , thanks for posting here. This particular feature has been debated for years now − you may see some historical information on phab:T20871.

To answer your question « would there be any argument against this », my understanding of why it’s being done is because EXIF take up a lot of space (in some cases bigger than the thumbnail itself).

Hope that helps, Jean-Fred (talk) 07:02, 22 June 2018 (UTC)

Ok, so I checked and there are some EXIF fields preserved in Thumbnails, including Author and Copyright.

@Carlseibert: , could you clarify which fields you would like to see preserved? I realise you said IPTC and I read EXIF :) but it would be helpful nonetheless to understand which fields matter and which not.

Jean-Fred (talk) 07:17, 22 June 2018 (UTC)

Hi @Jean-Fred,

Let me first do a little bit of background here for the benefit of anybody following along.

Descriptive metadata - IPTC metadata - is stored in two data blocks in an image file (JPEG here, but others are similar). This is the data that I would argue we're karmically and, almost certainly in the near future if not already, legally, obliged to protect.

The IIM block contains basic information, like the caption, author, copyright, title, and some workflow stuff. The IIM is tiny, usually no more than a few hundred bytes.

The XMP block contains duplicates of all of those fields, plus additional fields like Rights/Usage Terms and contact information for the copyright owner, and potentially a bunch more data about rights and model releases and the like. The XMP is padded to be a (usually) consistent size regardless of what's in it, generally a tad less than 4 KB.

Then there's the Exif block. Exif data is log data from the camera, plus (often) one or more thumbnail images, GPS data, and some image information like orientation and the original capture time. Exif data can often be less than a couple of KB, but because of the thumbnails, it is often in the 12 KB to 20 KB range. And I've seen corner cases that are huge, like hundreds of KB.

As far as I know, we preserve all three blocks on the full-size versions of images.

Most of the time, Exif data is no longer useful after the first couple steps in a workflow. I usually advise readers and clients who are concerned about page load time that they can feel free to delete it. (After first making sure that they've extracted anything they might later need, like maybe GPS data.)

However, Wikimedia Commons does care about Exif data, and we expose some of it on an asset's page. Whether the next person to use the image needs or wants that data in the downloaded file is a matter for consideration.

The XMP block can carry data other than descriptive IPTC data. Applications, like the Adobe programs, might write log data there, for instance. In rare cases, that can make the XMP block exceed 4 KB. Last year, I wrote about a photo where it was over 40 KB. That wasn't an issue in itself, but the information, in that case, was quite unhelpful. But again, that's unusual. Usually, we're looking at 4 KB.

There are three corner case fields in the Exif which are semantically equivalent to three IPTC fields. Cameras rarely write to them, and by the time a photo is ready to post or publish, these fields are normally irrelevant. I suggest for the sake of sanity, we shouldn't worry about them one way or the other.

By the way, some people refer to the IIM as the "IPTC/IIM". That's not incorrect. But it's confusing, so I don't do it. :-)

In general, any or all of the IPTC fields can contain copyright management information, and certainly I would argue that if a creator took the initiative to put certain data on an asset, we should respect it and pass it on. There's no point in picking individual fields to keep or strip anyway, since, barring considerable extra effort, the XMP block is going to be 4 KB regardless.

I'm assuming here that Wikimedia Commons creates the small renditions on ingest and storage is therefore an issue. Many DAM systems create renditions on output. If that's the case, it's a moot point.

I'm also assuming that we do actually want to bother with small renditions. I would never download one myself, but they must be popular, or they wouldn't be here. (I assume)

I see in the other thread you referenced that Wikimedia Commons runs on PHP. That means that there are two main choices for an imaging library: GD or ImageMagick. GD is pretty hopeless. It has barely any metadata functionality at all and by default simply strips the output file clean. ImageMagick by default preserves all the data in all three of those blocks. (And allegedly produces better quality.) On a WordPress or Drupal site, simply enabling ImageMagick makes most metadata stripping problems go away. It's possible that it might be that simple here.

I have encountered systems that use ImageMagick and strip metadata by (usually long forgotten) choice. In that case, changing the arguments for ImageMagick does the trick.

ImageMagick can easily strip the Exif wholesale and leave the IIM and XMP intact. (If we're able to pass arguments to it.) Whether it would be practical to try to address certain fields - like possible XMP application log data or the Exif thumbnails - granularly, is another matter. My guess would be no.

ExifTool can indeed do granular manipulations, but there would be development and server load costs. That said, the notion of appending a note that the image came from Wikimedia Commons to the caption or one in Special Instructions or Rights/Usage Terms explaining that the image is released under Creative Commons would be very attractive.

I guess that's a long-winded way of saying that preserving the metadata that we should preserve would cost about 5 KB per file, or 15 or so KB if we choose to keep all of the Exif data.

Hope this helps. - Carl — Preceding unsigned comment added by Carlseibert (talk • contribs) 19:27, 28 June 2018 (UTC)

Hi Jean-Fred (Again)

In that referenced thread there was a lot of discussion about legality. I guess I should go ahead and contribute what I know about that. [Insert standard. “I am not a lawyer. This is not legal advice” disclaimer here.]

First, let me get two things off my chest.

In that thread, people were conflating IPTC data with Exif data right and left. That makes me crazy! And it makes it hard to have a meaningful discussion. They are two different things. Please, folks.

Second, this is an Open Source community. Legality here is secondary to moral consideration, IMHO. It’s not “what we can get away with”. For us, it’s “What is the right thing to do.”

That said, here’s what I know:

The Creative Commons license says: 1. If You Share the Licensed Material, You must:

    A.Retain the following if it is supplied by the Licensor with the Licensed Material:
       i.  identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner 
            requested by the Licensor (including by pseudonym if designated);
        ii. a copyright notice;
       iii. a notice that refers to this Public License;
        iv. a notice that refers to the disclaimer of warranties;
         v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
             .....

All subject to the “reasonable to the medium, means, and context” qualification.

Is it “reasonable” to spend 5 KB of storage to retain data supplied by the creator and meet the requirements? My non-lawyer opinion is yes, probably. And my community opinion is that we should bend over backward to comply with our own license, just on general principles.

(If the trade-off had to be eliminating one of the five small renditions to make up the storage space, I would argue that to be a good trade.)

Sections 1202 and 1203 of the US DMCA make it illegal to alter or remove copyright management information conveyed with a work, provide definitions of CMI that look like they wrote the law with a copy of the IPTC standard open before them, and provide statutory damages for violations of $2,500 USD to $25,000 USD “per instance”.

There is no requirement for registration before infringement or infringement after notice to qualify for statutory damages. This is very significant because it means that copyright owners who otherwise wouldn’t qualify for statutory damages and therefore would have no means of compensating a lawyer now have the practical ability to sue.

Lately, there have been more and more infringement suits claiming section 1202 violations. There have been significant judgements.

Following the judgements, it seems clear to me that judges are applying more and more liberal definitions of what CMI is.

I haven’t yet seen a CMI removal judgement that explicitly refers to embedded metadata. Most of the cases so far focus on visible-to-the-naked-eye CMI, like watermarks and even notices printed elsewhere on web pages. But, having read the law, it’s pretty obvious to this non-lawyer that the day is coming.

I have a lawsuit on my laptop as we speak that IS explicitly about embedded metadata CMI and IS about a Creative Commons licensed picture. Downloaded from Wikimedia Commons, now that I think about it.

(That alleged infringer, who absolutely does strip CMI, BTW, could use as a defense that the victim would have to prove that the (ahem, alleged) infringer downloaded the full resolution version of the picture, rather than a CMI-less rendition of the image. Preserving metadata on all renditions would help protect our contributors. That’s a karmic argument, not a legal one)

In the thread, there was a 2015 quote from a Wikipedia lawyer who seemed to be asserting that Safe Harbor protections would apply to Wikimedia Commons here. It’s 2018 now. Safe Harbor has been wildly abused for years. People are sick of it. People apparently including some judges. Safe Harbor defenses have lately been knocked down like dominos. Considering the way Wikimedia Commons works, my layman’s opinion is that claims to Safe Harbor would be pretty weak in the first place. At this point in time, I wouldn’t bet a week-old donut on Safe Harbor, much less the future of my organization.

In the US DMCA - and just about every big country has one - it’s pretty explicit that a copyright owner can grant permission for the stripping of CMI from their works. So the big US social media outfits can hide their CMI stripping behind their terms of use, which extract a very broad license from contributors. In Europe, in Germany at least, apparently not so much. Last year, Facebook suffered an adverse judgement there for embedded metadata CMI stripping, nevermind the TOU.

It seems to me that the legal army is marching on copyright management information stripping. When, or if, it might overrun our village is anybody’s guess.

But, again, that doesn’t matter in my opinion. This website (and every website) should preserve authors’ data is just because it’s the right thing to do.

-Carl — Preceding unsigned comment added by Carlseibert (talk • contribs) 23:25, 28 June 2018 (UTC)

Commons:Graphics village pump/June 2018

SVG text rendered incorrectly despite using officially supported fonts

White space around SVG image

Stripping metadata

Navigation menu

Search