Featured

UW CSE News reports that UW has waived indirect cost on cloud services, removing disincentives to the selecting research computing and storage options.

eScience Director Ed Lazowska writes

This decision removes one of several bizarre disincentives to the rational selection of research computing and storage options – disincentives that plague universities nationwide.

Federal guidelines waive indirect cost on purchased equipment – so purchasing a $100K cluster costs a grant budget $100K, despite the fact that this equipment must be housed, powered, cooled, backed up, replaced …

Meanwhile, indirect cost is charged on outsourced cloud services – so purchasing $100K of AWS or Azure services costs $157K (at UW’s rates – different institutions have different markups), despite the fact that the only actual overhead is paying an invoice.

UW IT and the UW Office of Research have now decided to unilaterally waive this nonsensical charge.

Progress! Hopefully others will follow!

Read more here.

Three footnotes:

  1. There is precedent for national action: several years ago it was ruled, nationally, that indirect cost should not be charged on outsourced gene sequencing services.
     
  2. There are additional bizarre disincentives to the rational selection of research computing and storage options. If you want to purchase a large cluster, your NSF program officer will send you to the Major Research Instrumentation program, which is not charged against any specific Program, Division or Directorate – so it’s “free” to his/her program … what could be finer? And once the cluster arrives at your university, Santa Claus pays for the power, Mrs. Santa Claus pays for the cooling, Rudolf shares his space, and the Elves do the backup … all of these, which have very real costs, appear free to the investigator at most universities.
     
  3. Finally, it goes without saying that cloud services are not the right choice for every application. What UW’s decision does is simply to take one step towards leveling the playing field, leading to rational choice.

 

Data Science Venn Diagram
Data Science Venn Diagram

O'Reilly Publishing has released a preface to Python Data Science Handbook (Early Release) by Jake VanderPlas, eScience's Senior Data Scientist and Director of Research, Physical Sciences.

"What is data science?" VanderPlas writes. "It's a surprisingly hard definition to nail down, especially given how ubiquitous the term has become. Despite its hype-laden veneer, [data science] is perhaps the best label we have for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia."

And why Python? "[It] has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets."

VanderPlas' book is geared toward technically-minded students, researchers, and developers with a strong background in writing code and using computational and numerical tools, focusing on a broad overlapping data science "mental model" of computational, statistical, and domain expertise known as the Data Science Venn Diagram. The first four sections of Python Data Science Handbook focuses on the computational component of the programming language and the extensive ecosystem of data-focused tools available within it, with the rest of the book a discussion about the fundamental concepts of statistics and mathematics, and their use in analyzing datasets. "The goal," says VanderPlas, "is that by the end readers will be poised to use these Python tools process, describe, model, and draw inferences from the various data they encounter."

VanderPlas encourages readers not to think of data science as a new domain or expertise to learn, but "a new set of skills that you can apply within your current area of expertise. Whether you are reporting election results, forecasting stock returns, optimizing online ad clicks, identifying microorganisms in microscope photos, seeking new classes of astronomical objects, or working with data in any other field, my goal is that the content of this book would give you the ability to ask and answer new questions about your chosen subject area."

You can read the preface to Python Data Science Handbook (Early Release) here:
https://beta.oreilly.com/learning/introduction-to-pandas

 

University of Washington's Andrew Connolly recently spoke with UW Today's Peter Kelley to discuss his work on the Large Synoptic Survey Telescope (LSST), calling it "one of the most exciting experiments in astrophysics today."

Connolly, a professor in UW's Department of Astronomy as well as a member of the eScience Executive Committee, oversaw UW's data management group to develop the software to study information received from the telescope.

"The LSST isn't the biggest telescope in the world, nor does it have the highest-quality images. What it does have is a very large field of view and the largest digital camera in the world (with 3.2 billion pixels). This means it can survey half of the sky every three nights to discover if anything has changed or moved."

The LSST will begin scanning the sky in 2022 from on top of the Cerro Pachón mountain in northern Chile.

You can read the full interview with Professor Connolly here:

http://www.washington.edu/news/2015/06/23/visualizing-the-cosmos-uw-astronomer-andrew-connolly-and-the-promise-of-big-data/

DSSG student researchers join eScience staff for a social outing on the program's first afternoon.

The eScience Institute kicked off its inaugural Data Science for Social Good (DSSG) summer program the week of June 15th. Modeled after similar programs at the University of Chicago and Georgia Tech, with elements from our own Data Science Incubator, the goal of the DSSG program is to enable new insight by bringing together data and domain scientists to work on focused, collaborative projects that are designed to impact public policy for social benefit.

The theme for this year’s DSSG projects was Urban Science. We encouraged project proposals that involved analysis and visualization and/or software engineering of data from urban environments across topic areas including public health, sustainable urban planning, crime prevention, education, transportation, and social justice.

Below are summaries of the four DSSG projects selected.
 


 

Assessing Community Well-Being Through Open Data and Social Media’s objective is to provide neighborhood communities a better understanding of the factors that impact their well-being. Through crowd-sourced community networks that leverage diverse social media and open data sources, neighborhoods can identify emerging issues, see how they compare with other neighborhoods on key factors, and coordinate a community response. While the project’s goal is to provide tools that can serve all neighborhoods, Community Well-Being hopes to actively engage underserved neighborhoods in designing the program.

Project Lead: Shelly D. Farnham, Third Place Technologies

http://thirdplacetechnologies.com/
 

King County Metro Paratransit is an on-demand public transportation program that provides a vital link to mobility for people with disabilities who are unable to use traditional fixed route services, picking up passengers at or near their doorstep and delivering them to their specified destination. Currently, King County Metro paratransit trips cost approximately ten times as much as an equivalent trip using a fixed-route service, and to date little investment and research has been made surrounding the technical complexities of providing ADA paratransit. By analyzing current Metro system information and providing real time cost analysis, the project aims to help dispatchers and schedulers make informed and more efficient routing decisions that improves the paratransit services offered to passengers while containing the costs of those services.

Project Lead: Anat Caspi, University of Washington, Computer Science & Engineering

http://metro.kingcounty.gov/tops/accessible/programs/access.html
 

Open Sidewalk Graph for Accessible Trip Planning is an information challenge to design an open source software toolkit and set of algorithms to help those with limited mobility plan a commute. By developing city-wide sidewalk accessibility analytics and applying routing algorithms, the project hopes to assemble disconnected sidewalk segments into a coherent graph, providing rapid and convenient routing for those with limited mobility that avoids steep hills, uncrossable intersections, stairs, or construction that blocks sidewalks.

Project Lead: Nick Bolten, University of Washington Department of Electrical Engineering

http://www.geekwire.com/2015/app-that-helps-people-in-wheelchairs-plan-travel-routes-wins-first-place-at-civic-hackathon/
 

Photo courtesy Bill & Melinda Gates Foundation
Photo courtesy Bill & Melinda Gates Foundation

Predictors of Permanent Housing for Homeless Families in King, Snohomish, & Pierce County’s mission is to make homelessness rare, brief, and one time. There are over 5,000 homeless families with children in the Puget Sound region, spending an average of eight months moving from shelter to shelter. The project’s main objectives are to identify the barriers preventing homeless families from finding housing, as well as the trends and factors that affect a family’s length of stay in a homeless shelter. The research will be used to improve decision making and prioritizing resources to help homeless families find permanent housing and reduce their length of stay in a shelter.

Project Leads: Neil Roche & Anjana Sundaram, Bill & Melinda Gates Foundation

http://www.impatientoptimists.org/Posts/2015/02/Better-Data-to-Reduce-Homelessness#.VWdwjFxVhBd

Overview: The NSF-sponsored Graduate Data Science Workshop will bring together 100 graduate students from diverse domain sciences and engineering with Data Scientists from industry and academia to discuss and collaborate on Big Data / Data Science challenges.

Participation: To participate in the workshop, submit a white paper in PDF format that describes a Big Data / Data Science challenge faced by your scientific or engineering discipline or an idea for a new tool or method addressing Big Data / Data Science problem. White papers will be reviewed using NSF scoring criteria and attendees will be selected based on the strength of their position papers.  If you are selected for attendance, you must bring a poster to present on one of either of the two poster presentation sessions.  The authors of the very highest scoring white papers will be invited to give lightning talks of a few slides during the plenary session to describe their challenges or methods. The white paper submission deadline is June 20th, 2015. Invitees will be notified on July 1st, 2015.

Program: In addition to keynote presentations from high profile speakers, the participants will present posters covering their own research and work collaboratively to begin to solve some of the Grand Challenge problems facing Data Enabled Science & Engineering disciplines. 

Community building: After the workshop, the output from the collaborative teams will be published in an open access environment. Through the shared work at the workshop and beyond, the participants will form lasting, collaborative relationships with their peers and the senior academia partners and industry participants including those from companies like Amazon, Google and Microsoft.

If you are invited to participate, travel support of up to $1,000 will be available which can be used to cover the registration and lodging fees in addition to airfare.  Most meals are included. Workshop registration is $200.  Lodging is $123 (two beds), $195 (single) for two nights.