OpenStreetMap logo OpenStreetMap

b-jazz's Diary

Recent diary entries

Analysis of Bounding Box Sizes Over the Last Eight Years

Posted by b-jazz on 5 August 2020 in English. Last updated on 16 August 2020.

Impetus

I read recently in one of the weekly OSM newsletters of a discussion thread on the OSM-talk mailing list about limiting or adding a warning to editors to let users know if they are editing something that will result in an unusually large bounding box for the changeset. As someone that has made the mistake of accidentally editing nodes in entirely different parts of the country and being horrified that I created a massive bounding box, I was curious as to how often this happens and what a typical bounding box size would be for your average mapper.

Gathering the Data

I set about gathering the data on changesets and bounding boxes and picked the current month (July at the time) to look at. I found that there was a minutely “feed” of changeset data that also included the computed bounding box in the replication/changesets directory on planet.openstreetmap.org site (and luckily mirrored to a single place in the U.S. that I could use). After my internet connection started glowing red after a day of transferring just a week’s worth of July data, I figured that was probably enough to get something useful up. I wrote a few lines of Python to uncompress the by-minute files, convert them into SQL statements, and start loading them into a Postgres/PostGIS database. (With a non-trivial detour to learn just enough on how to actually work with polygons and WKT and how to calculate the area using the right spatial reference system.)

First Look

The first graph I generated was a simple bar chart for the first week in July. I posted the following chart on the OSM US Slack server in a new channel that I created called #data-is-beautiful (after a popular subreddit on the “front page of the internet” website known as reddit.com)

changeset bounding box frequency

It’s important to note the scale of the X-axis in that each bucket of the histogram is twice the size of the previous bucket. And I call out a special size of “exactly 0 sq. meters” because changesets with a single edited node show up as having zero area, as do changesets that are empty (something new that I learned was possible). Other than that first bucket, each following bucket is greater than 2^(N-1) and less than or equal to 2^N (square meters of course).

Since I had the workflow all in place to gather the thousands of files necessary and process them, I decided to gather the rest of the month and see if that changed anything. So I let my scripts take over my internet connection and relegated myself to watching degraded Netflix video streams for the next day or two. (I’m kidding, it really wasn’t that bad.) After all of that was processed, the graph for the month of data vs the week of data looked essentially identical (short of the scale of the Y-axis).

Quick Executive Summary Sidenote

I’m not sure what the technical name is for the most common bucket size. It is kind of like the “median” bucket, but I don’t think that is 100% accurate. But we’ll use that term here anyway. The median bucket for changesets for this particular sample of data is 2^19th square meters, or roughly half of a square kilometer.

An Innocent Comment

I got some great feedback and comments on my posting in the Slack channel, but Ian Dees made an innocent comment about how it would be cool to see the data over the time and represented in a heatmap. Well, I couldn’t just let that idea hang out there unfulfilled so I went about trying to figure out how I could make that happen on my slow, rural internet connection, and a 7 year old desktop linux computer.

My first thought was that sampling was going to be the answer to making this possible. Seeing that the month data vs the week data on my first attempt showed essentially the same results, I decided that I’d grab just one week of data from each month (the first through the seventh). I wanted an entire week in case there was different behavior from weekday mappers (possibly influenced by corporate mappers) vs. weekend mappers (possibly included more “hobby” mappers). I also knew that gathering every single changeset for the week would take longer than I was willing to wait, and would probably be more space than I would be able to store without doing some cleanup and rearranging of data on my computer. So I decided to get every third minute available. That seemed like a reasonable amount of data, and something that I could accomplish and store in a reasonable amount of time and space.

Visualizing the Data

After waiting about a week for all of the samples to come over the while, I was finally able to get it into Postgres and do some collating and dumping to a file that I could process into a heatmap. I did a little research into what software to use, but just ended up writing a simple python script to generate an SVG file that I could see in a browser.

Early on, I thought that there were two ways that I wanted to view the heatmap. The first was by splitting each column up (a single month) and setting the color gradient based on the max value for that particular month. But I also thought it might be interesting to see what it would look like with the max value for the entire dataset so that you could also see the growth in the number of changesets over time.

These heatmaps are what I came up with:

Percent based on max of each month

and…

Percent based  on max of all data

@imagico had a good idea of showing the circumference or perimeter of the bounding box so you don’t have the problem of a change in two different parts of the world, but very similar latitudes showing up as a very wide box, but with a very short height leading to a smaller area than some normal in-city editing might turn up. I ran the numbers again using the PostGIS ST_PERIMETER() function and came up with the following heatmap:

Perimeter this time

Another view of the data is to look at the num_changes data for each changeset and compare the area of the changeset to how many objects are being edited.

Bounding box area per number of changes

Misc. Notes

  • The changeset data only went back to late 2012. It would be interesting to see it going back all the way to the start, but this is what was easily available to me
  • Changesets didn’t start including num_changes until late 2013.
  • The source site (planet.osm.org) is missing a swath of data from April 2013.
  • I’m not good at color, so this isn’t as vibrant as it could be. Sorry for that. If you have suggestions on how to pick a better color palette, I’d love to hear it.
  • I’m happy to share my JSON file in case you want to do your own visualization. It is only 24K so just let me know where I can email it or drop it for you.
  • It looks like diary entries with images might not allow you to click on them to see the full size image. If that’s the case, the heatmaps are available at https://i.ibb.co/ynJQtmt/image.png and https://i.ibb.co/qxFm57x/image.png.

Future Direction

Now that I have this 9GB database available to me, I want to poke around with it a little more. I might do a sampling of some of the larger changesets and see if there is anything interesting to find there. How many of them are just editing two nodes across a large area? How many are editing a single, very large feature? How does number of modified objects compare to the size of the changeset for large changesets? Are there particular usernames that tend to make rather large changesets (on accident or on purpose)?

If you have ideas on what else I can do, please let me know. I might as well make use of the disk space that I’ve dedicated to this.