I’m sad to see Data.gov go, but it seems dozens of small startups are rising up to take its place as sources of freely availably public data. This week is the Data2.0 conference, and it seems that without really meaning to, the Focus of the event will be the collapse of Data.gov.
Yet, not all is lost: there are over 50 startups at the Data 2.0 Conference which specifically aim to make data accessible and useful with or without Data.gov.
As Nick Ducoff, CEO of InfoChimps, wrote:
“It would be very helpful if the government would devote its limited resources on simply pointing us to public data sets wherever they live in the wild. Socrata, Infochimps and others can do the rest of the heavy lifting (appending metadata, making the data findable, etc.). [U.S. CIO] Aneesh Chopra, Todd Park and others have been great cheerleaders for open data and I hope this doesn’t take the wind out of their sails.”
Several of the early-stage data startups pitching at the Data 2.0 Pitch Day (including DataMarket.com, Envirogent.org, opencorporates.com, opensignalmaps.com, and micello.com) are themselves new data sources giving businesses and consumers better access to data.
Hopefully the many startups using Data.gov as their primary source of data can switch to these other companies, and hopefully they will all embrace the same goals of open and transparent access to the data.
I initially didn’t report this, hoping it was a sinister April Fool’s day joke, but it seems it’s legit and frankly disappointing. It seems that several US Government websites dedicated to open and easily accessible data are being shut down, a victim of budget cuts. The list so far:
And more. Now, several of these sites have taken their ‘open’ foundations to the next level, and begun dumping their source bases to projects like Code For America, hoping that someone independent will at least attempt to continue their efforts. Some sites have decided to take their fight right to Congress, and the Sunlight Foundation seems to be spearheading the “Save The Data” initiative, hoping to get Congress to keep these important transparency & accountability projects alive.
I’ve seen some really great stuff coming out of the Sunlight Foundation and Data.gov over the last few years, and they’re just the tip of the iceberg in what I would really like to see for Open Data in government. Let’s do what we can to keep them alive!
Every visualization scientist knows that while we enjoy creating the visualizations, the bulk of our time is spent in finding, processing, and formatting the data into some usable form. Over at ProPublica they have a nice comprehensive series on various tools, applications, and SDK’s for handling data in a wide variety of formats.
These recipes may be most helpful to journalists who are trying to learn programming and already know the basics. If you’re already an experienced programmer, you might learn about a new library or tool you haven’t tried yet.
If you are a complete novice and have no short-term plan to learn how to code, it may still be worth your time to find out about what it takes to gather data by scraping web sites — so you know what you’re asking for if you end up hiring someone to do the technical work for you.
As disk sizes grow, so does our risk of losing massive amounts of data in hard drive failures or theft. A new infographic on YouTube shows some of the risks with some truly scary statistics:
Your Hard Drives have a 1 in 10 chance of failing this year
Human Error & Faulty Media are the 2 leading reasons of Data Loss
Only 1 in 20 companies that suffer a serious data loss will remain in business
Average time to be “up and running” After a restore is 4 hours
Got a presentation coming up and need some data for a graph? Well, Data360 aims to please with their open data collection.
Data360 is an open-source, non-profit and free website. The site hosts a common and shared database, from which any organization which is committed to neutrality and non-partisanship (meaning “let the data speak”), can use the site for presentation of their reports and visualizations about the data.
They’ve got a great collection of visualizations and datasets, all freely available. Some of the charts we’ve covered before as they come from the New York Times or USA Today, but there is a lot of new material there.
An article from Jodi McDermott re-introduced me to something I haven’t heard of in years, Anscombe’s Quartet. For the unknowing, it’s a collection of 4 small datasets that when analyzed statistically are identical (same min, max, mean, and variance).
What’s my point, you might ask? When each one of these datasets is plotted out visually, they have completely different appearances (just click on the link above and you’ll see what I mean). There are outliers where one would not expect to see them — identifying both opportunities and risks in your data depending on what you are analyzing. However, one would never see the variance in data patterns if it was not plotted in a chart or graph (or analyzed data point by data point).
New South Wales has created an online data repository focusing the home of Sydney, and focuses heavily on innovation and visual design.
The website culls data from many different sources, ranging from the usual “archival” suspects like the Australian Bureau of Statistics, the State Records NSW and the National Archives of Australia, in addition to some refreshing collections from the NSW Film and Television Office, the Historic Houses Trust and the Australian Dictionary of Biography.
This means users can do more than searching for images and articles on popular topics related to New South Wales, and also browse artworks, heritage sites, museum artefacts and related information on a map, or explore demographic data and compare different regions to each other.
Information Aesthetics has a great collection of links about the design and construction of the system that make for great reading.
The Visualizar’09 conference, subtitled “Public Data, Data in Public”, has issued a call for projects and papers to be presented at this year’s conference, to be held November 12-27.
At the first two annual Visualizar events, almost a hundred participants from Spain and abroad collaborated with international experts in the field of data visualization in developing seventeen prototypes that tell stories through data: from measuring atmospheric pollution in the streets of Madrid to the conversations held by users of social webs such as Twitter. Each year, Visualizar also includes a specialized symposium, educational activities OPEN to the public, and an exhibit of the projects carried out.
In 2009, Visualizar will highlight the importance of data structures today in public decision-making and governance processes. There is a growing movement in favour of making databases generated by scientific research and the vast quantities of data generated by public administrations available to everyone, in formats that make it possible to reuse them and foster citizen innovation processes. What are the social benefits of fostering a culture of free, OPEN data?
Even if you’re not attending the conference, they’ve collected a good list of public databases and visualizations suitable for a wide variety of research and publication, such as Data.gov and OpenSecrets.
It’s been a year since the Data Intensive Cyber Environments (DICE) group moved from UC San Diego to UNC Chapel Hill, and in that time they’ve worked closed with the Renaissance Computing Institute (RENCI) so build a network of data repositories across the state called the “Data Grid”.
When completed, the Data Grid in action might work like this: Data on development patterns around the North Carolina would be stored at RENCI at UNC Charlotte, where researchers at the RENCI engagement center study urban growth patterns and their implications. An urban planner in eastern North Carolina would be able to access that data as well as the software tools that allow it to be viewed in a visual, intuitive format. Those same researchers also would be able to access coastal floodplain maps and storm surge visualizations stored at other data hubs and to use all of the information to plan sustainable coastal developments.
A great solution for visualization as a decision-making tool, if they can deploy it across a wide enough audience.
Anyone working in Data Analysis and Visualization will tell you that the #1 problem facing them is file storage. As the datasets get bigger and bigger, moving them from the HPC’s to the Visualization Resources becomes a bigger pain. Oak Ridge National Labs has been facing this problem for a while now, and has just recently stood up a distributed fileserver named ‘Spider’ to fix this.
Once a project ran an application on Jaguar, it then had to move the data to the Lens visualization platform for analysis. Any problem encountered along the way would necessitate that the cumbersome process be repeated. With Spider connected to both Jaguar and Lens, however, this headache is avoided. “You can think of it as eliminating islands of data. Instead of having to multiply file systems all within the NCCS, one for each of our simulation platforms, we have a single file system that is available anywhere. If you are using extremely large data sets on the order of 200 terabytes, it could save you hours and hours.”
While this is nice, it still doesn’t solve the problem of then maintaining that data in Memory. But at least you don’t have to spend a month waiting on an FTP to finish anymore.
Update: I spoke with a source at ORNL, and they corrected a few things:
Spider isn’t new, it’s been around for at least a year.
So if it’s not new, why the press release? Not really sure to be honest. Suspicions are it’s because it was previously in a testing mode, but has just officially entered “production” and general availability.
Comments