User:Fæ/sandbox4

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Geograph mapping proposal for automated tidy-up of raw HTML imports on Geograph image pages[edit]

Please see Commons:Bots/Work_requests#Geograph_raw_html_tidy-up for a context for this sandbox analysis. -- (talk) 08:02, 16 September 2012 (UTC)

Mapping example taken from: http://commons.wikimedia.org/w/index.php?title=File:Disused_railway_building_and_platform_-_geograph.org.uk_-_1989471.jpg&oldid=77328823

Source text

{{en|1=Disused railway building and platform, CTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" id="geograph">
<head>
	
	<title>Disused railway building and platform:: OS grid SE8279 :: Geograph Britain and Ireland - photograph every grid square!</title>
		<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
	<meta name="description" content="SE8279 :: Disused railway building and platform, near to Low Marishes, North Yorkshire, Great Britain" />
		<meta name="ICBM" content="54.202974525125, -0.74288542171216"/>	<meta name="DC.title" content="Geograph:: Disused railway building and platform:: OS grid SE8279"/>
	<meta name="pinterest" content="nopin" />
	<link rel="stylesheet" type="text/css" title="Monitor" href="http://s0.geograph.org.uk/templates/basic/css/basic.v7593.css" media="screen" />
	<link rel="shortcut icon" type="image/x-icon" href="http://s0.geograph.org.uk/favicon.ico"/>
			<link rel="alternate" type="application/vnd.google-earth.kml+xml" href="/photo/1989471.kml"/>
		<link rel="search" type="application/opensearchdescription+xml" title="Geograph Britain and Ireland search" href="/stuff/osd.xml" />
	<script type="text/javascript" src="http://s0.geograph.org.uk/js/geograph.v7508.js"></script>
</head>
<body>
<div id="header_block">
  <div id="header">
    <h1 onclick="document.location='/';"><a title="Geograph home page" href="/">Geograph - photograph every grid square</a></h1>
  </div>
</div>
<div class="content_photowhite" id="maincontent_block"><div id="maincontent">
<div style="float:right; position:relative; width:5em; height:4em;"></div>
<div style="float:right; position:relative; width:2.5em; height:1em;"></div>
<div itemscope itemtype="schema.org/Photograph"><meta itemprop="isFamilyFriendly" content="true"/>
<h2><a title="Grid Reference SE8279 :: 22 images" href="/gridref/SE8279">SE8279</a> : Disused railway building and platform</h2>
 <h3 itemprop="contentLocation"><span title="about 2 km from">near to Low Marishes, North Yorkshire, Great Britain.
}}

Pseudocode
  1. Remove text up to meta tag description
  2. Indent description
  3. Use content of meta tag ICBM to create a ICBM line (this could be processed as a microformat)
  4. Trim text up to contentLocation
  5. Use text of contentLocation to create a Location line.

Output text

{{en|1=Disused railway building and platform , CTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" id="geograph"> <head>

<title>Disused railway building and platform:: OS grid SE8279 :: Geograph Britain and Ireland - photograph every grid square!</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name="description" content="

  • Data from Geograph:
    • Description: SE8279 :: Disused railway building and platform, near to Low Marishes, North Yorkshire, Great Britain" />

<meta name="ICBM" content="

    • {{green|ICBM: 54.202974525125, -0.74288542171216

"/> <meta name="DC.title" content="Geograph:: Disused railway building and platform:: OS grid SE8279"/> <meta name="pinterest" content="nopin" /> <link rel="stylesheet" type="text/css" title="Monitor" href="http://s0.geograph.org.uk/templates/basic/css/basic.v7593.css" media="screen" /> <link rel="shortcut icon" type="image/x-icon" href="http://s0.geograph.org.uk/favicon.ico"/> <link rel="alternate" type="application/vnd.google-earth.kml+xml" href="/photo/1989471.kml"/> <link rel="search" type="application/opensearchdescription+xml" title="Geograph Britain and Ireland search" href="/stuff/osd.xml" /> <script type="text/javascript" src="http://s0.geograph.org.uk/js/geograph.v7508.js"></script> </head> <body> <div id="header_block"> <div id="header"> <h1 onclick="document.location='/';"><a title="Geograph home page" href="/">Geograph - photograph every grid square</a></h1> </div> </div> <div class="content_photowhite" id="maincontent_block"><div id="maincontent"> <div style="float:right; position:relative; width:5em; height:4em;"></div> <div style="float:right; position:relative; width:2.5em; height:1em;"></div> <div itemscope itemtype="schema.org/Photograph"><meta itemprop="isFamilyFriendly" content="true"/> <h2><a title="Grid Reference SE8279 :: 22 images" href="/gridref/SE8279">SE8279</a> : Disused railway building and platform</h2> <h3 itemprop="contentLocation"><span title="

    • Location: (about 2 km from)"> near to Low Marishes, North Yorkshire, Great Britain.

}}


Final result

English: Disused railway building and platform
  • Data from Geograph:
    • Description: SE8279 :: Disused railway building and platform, near to Low Marishes, North Yorkshire, Great Britain
    • ICBM: 54.202974525125, -0.74288542171216
    • Location: (about 2 km from) near to Low Marishes, North Yorkshire, Great Britain.