A Vision For Smart and Connected Cities

Author: Deepti Kumra, South Hub Intern

During the second South Big Data Hub Smart Cities community call, Daniel Morgan, chief data scientist for the U.S. Department of Transportation (DOT), addressed what he sees as the DOT’s biggest data challenges. The DOT is actively encouraging developers to come up with applications to improve public safety, facilitate access to transportation, and help the department better understand traffic congestion.

The biggest challenge the DOT faces is fatal crashes. In August 2016, the Census of Fatal Crashes revealed a loss of 35,072 lives on the road, a 7.2% increase compared to 2014. Hypothesized causes for this increase in fatal crashes are an improving economy leading to more travel, climate and weather changes, alcohol involvement, attitudes towards seat belt use, and distracted driving (for example drivers texting or talking on the phone).

To understand why fatalities are on the rise, the DOT has started working with the private sector to bring more data to the table. Street Light Data is an initiative that processes a variety of data into metrics such as Zone Activity Analysis Metrics, Trip Attribute Metrics, and Traveller Attribute Metrics. This breakdown allows analysis of who is moving on the roads during different times of day and days of the week. Analysts use the tool to understand how traffic conditions change at a local level, which could allow the DOT to offer advice on changing systems to make roads safer.

Another addition to the DOT’s smart cities focus is a recent Smart City Challenge, which encouraged cities throughout the country to share their most creative ideas for innovatively addressing challenges through data analysis and Intelligent Transportation Systems (ITS) technologies. SmartColumbUS was chosen as the winner because of the city’s Connected Columbus Transportation network, which is the central infrastructure driving its smart city application.

The DOT found one of the most important elements of the SmartColumbUS initiative to be the connection of human services transportation needs to their smart city application. SmartColumbUS goals include safety, mobility, addressing climate change, and ladders of opportunity connecting people to healthcare, jobs, and education.

Another key element of the program is an integrated data exchange that keeps all data open throughout the life of a project. The exchange will bring non-transportation data generated from the SmartColumbUS network and traditional transportation data from the city into one place that allows private sector, public sector, and academia to collaborate.

In addition to SmartColumbusUS, the DOT is bringing even more pilot projects online – like the Advanced Transportation and Congestion Management Technologies Deployment (ATCMTD) awards and Mobility on Demand (MOD) Sandbox awards. These advanced transportation awards are ways in which state DOTs can help fund research for more data and solutions.

To learn about the South Big Data Hub and how it is helping people connect to and strengthen Smart Cities, follow or join our working group.

Human Errors Drive Growing Death Toll in Auto Crashes

Highlighting Big Data at HBCUs

The C.R.E.D.I.T. Center: Big Military Data at HBCU’s

Author: Taylor Mitchell

logocreditThe Center of excellence in Research and Education for big military Data InTelligence otherwise known as the C.R.E.D.I.T Center is Prairie View A&M University’s premiere graduate level program for the processing and effective sorting of complex data. Funded by the Department of Defense, the C.R.E.D.I.T Center is one of three centers funded by the DoD at Historically Black Colleges and Universities. It is a one-stop-shop for engaging students in Big Data education, analytics and solving complex real-time problems for the military.

Continue reading

Working Group to Host Series of Demos


The South Big Data Hub Data Sharing and Infrastructure working group has enlisted the help of members from the south region and is working in collaboration with the Midwest, West, and NE Big Data Hubs, including representatives from the National Data Service, XSEDE, DataNet Federation, and iRODS Consortium. The WG will conduct a requirements analysis of Hub spokes and members, map existing assets, schedule demos of key components for a federated system, and through a testbed, demonstrate an analysis integrating NDS Labs, XSEDE and Discovery Environment.

Continue reading

Friday Webinar to Discuss Smart and Connected Cities

smart-cities-imageThe explosion of digital data means changes in how we work, play, and interact with each other and with the technologies and devices we depend on. Nowhere is that change more apparent than in the than in movement to create smart and interconnected cities.

What started as an effort to integrate multiple information and communication technologies with sensors that collect data about transportation systems, power plant usage, water supply networks, and more has evolved into a transformation of urban environments using a data infrastructure that can monitor events, troubleshoot problems, and enable a better quality of life.

Continue reading

PyData Carolinas offers tools and tips for bioinformatics research


Clarence White, PhD student at North Carolina A&T University

PyData Carolinas 2016 brought together hundreds of professionals, researchers and students interested in data analysis to discuss how best to apply Python tools to meet challenges in data management, processing, analytics and visualization.  Among the attendees was Clarence White, one of two students from North Carolina A & T who was sponsored by the South Big Data Hub to attend. The Hub was also a silver sponsor of PyData Carolinas.  Below are Clarence’s thoughts on the conference.

My name is Clarence White, a Ph.D. student in computational science and engineering at North Carolina A&T State University.  In my research, I’m working on applying machine learning methods to bioinformatics problems.  Some areas of interest to me have been beta lactamase and phosphorylation site prediction. Beta lactamase is one of the main reasons behind the development of antibiotic resistance among pathogenic bacteria, and protein phosphorylation plays an important role in a wide range of cellular processes.

Continue reading

South Big Data Hub announces awards that apply data science to regional challenges

Model Release-NO

Ashok Goel of the Georgia Institute of Technology is principal investigator for one of the three research teams that will receive Spoke awards from the South Big Data Hub.  (photo courtesy of Georgia Tech)

Awards part of $11 million in National Science Foundation Big Data Hub “Spoke” awards

 Three research teams in the Southern U.S. will receive funding for projects designed to use data science and data analytics to address challenges related to healthcare, environmental sustainability, and updating and improving power grids. The funding will be awarded through the “Big Data Spokes” program of the National Science Foundation’s (NSF) Big Data Regional Innovation Hub initiative. Continue reading

Learning the nuts and bolts of data integration: A Data Start Fellow’s perspective


Aziz Eram reflects on her DataStart experience.

Aziz Eram, a master’s student at the University of Arkansas at Little Rock studying information quality, is one of six graduate students who participated in the South Big Data Hub’s DataStart Program. DataStart provides funding that allows talented graduate students to work as student fellows with startups who need data science talent. She served her summer fellowship with Black Oak Analytics in Little Rock. Below are her thoughts about the program.

My name is Aziz Eram and I had the opportunity to intern at Black Oak Analytics, a Little Rock data startup, through a DataStart Fellowship managed by the South Big Data Hub. I did not come to the program with any industry knowledge, but I have a bachelor’s degree in computer science and statistics and also a master’s in applied mathematics. I was excited to be hired as an intern at Black Oak and to say that I have learned a lot in my internship is an understatement. I have grown tremendously, learning foundational data mining and data-driven marketing skills. Black Oak Analytics is a company that provides advanced solutions that allow organizations of any size to convert data into recommendations and actions designed to improve profitability, competitiveness, and customer satisfaction.

What does the company actually do? Once you gather large amounts of data about your customers and prospects, the quality controls around that data often remain a low priority. The effectiveness and success of any solution is directly tied to the quality and organization of the data it is based on. Poor data quality can be costly and damage a company’s reputation. By assessing the full lifecycle of an organization’s data, from initial source acquisition through internal and external systems, Black Oak Analytics can identify areas in which improvements can be made to the quality and treatment of data. Black Oak uses software called the High Performance Entity Resolution System (HiPER), an entity identity information management system that supports the full lifecycle of entity identity information. Also Black Oak offers a rock-solid data governance plan to help customers make sense of their most valuable asset.

Black Oak’s mission is to a become their clients’ trusted partner by helping them manage information as a corporate asset and use it as a competitive differentiator. The talent that surrounded me at Black Oak was fantastic and I am very fortunate to have worked for a company that values collaboration, creativity, and culture. This internship gave me the opportunity to get my foot in the door while building on my education, helped me develop professionally, and fueled my confidence.

My internship mainly focused on data integration of unstructured entity references. The primary goal of my work was to develop and test a more general approach to the problem of resolving entity references in free text format. To do this I have been using HiPER, which runs as a stand-alone entity identity management service and mainly focuses on increasing both reliability and matching of data. HiPER has a plugin interface that builds custom comparators in addition to a wide array of built-in, industry standard comparator functions.

Many industries and companies have data that exists in free text format, such as merchant/transaction descriptions on credit card statements, retail inventory details, medical and pharmacy records, etc. I was provided with two main data sets:

  • Lender name data sourced from public record information. This kind of data is mostly used by third party data compilers to create hotline marketing files of new homeowners and new borrowers.
  • Credit card transaction data from one of the top three credit card issuers in the country.

My task was to design and implement two new comparators called Business Parser and MAC (Multi-Valued Attribute Comparator). The Business Parser Comparator helps to match different unstructured data to a single structured data identifier. For example: “FREEDOM MTG CORP,“ “FREEDOM MOBILE HM SALES INC” and “FREEDOM MTG CONSULTANTS INC” were matched to a single identity “FREEDOM MORTGAGE CORPORATION.” While Business Parser generates a matching link based only on one identity attribute, MAC generates a matching link based on more than one attribute.

How will these comparators be useful?  If the data is in structured format it will be useful to organizations in many ways. For example, it can be used to generate more accurate reports, which in turn can result in improved inventory management, elimination of inconsistent pricing, improved sales, and improved operational efficiencies.  A majority of my work at Black Oak Analytics dealt with entity resolution practices.

I have learned many different skills during my internship, including data mining, data matching, and data linking, and these will all help me to build my career in data science.  I want to thank my supervisors, Steve Sample and Dr. John Talburt, the HiPER team, and all the members of Black Oak for supporting and guiding me. Without their help, I would not have been able to complete the project. Last but not the least, I am extremely thankful to the DataStart program for giving me this wonderful opportunity to work with these amazing people.

DataStart is supported in part by the Computing Research Association’s Computing Community Consortium.

NIH BD2K offers lecture series on fundamentals of data science

As big data becomes ubiquitous in research and business, more and more people are finding they need guidance on how to make the most of their data and follow best practices. The National Institutes of Health (NIH) Big Data to Knowledge (BD2K) initiative recognizes that for biomedical researchers and clinicians to take full advantage of the data revolution they need…well, data—in the form of training, guidelines, expert advise, use cases, and more.

To meet this need, the BD2K now offers a virtual lecture series on the data science underlying biomedical research, featuring weekly presentations from experts on the fundamentals of data management, representation, computation, statistical inference, data modeling, and other topics relevant to big data in biomedicine. The BD2K Guide to the Fundamentals of Data Science Series offers live streaming presentations every Friday from noon to 1 p.m. Eastern time. The presentations are also recorded and posted online for future viewing and reference.

Two sessions are already online: Introduction to Big Data and the Data Lifecycle, and Data Indexing and Retrieval. They can be viewed on YouTube here. The next live presentation, called Finding and Accessing Data Sets, Indexing and Identifiers, will be held Sept. 23 and will feature Lucila Ohno-Machado, MD, PhD, and chair of the department of biomedical informatics at the University of California at San Diego.

There is no cost for attending or viewing a presentation and no registration is required. For more information about the series, including a list of upcoming lectures, visit the BD2K Training Coordinating Center website.



South Hub Sponsors Materials and Advanced Manufacturing Workshop


On August 25, nearly sixty people gathered for a workshop on Data Infrastructure for Materials and Advanced Manufacturing. Attendees came from throughout the southern US to attend the event, sponsored by the South Big Data Hub and the Computing Community Consortium, to assess and deliberate on the current state of the data infrastructure supporting the accelerated insertion of new and advanced materials into commercial products.

Stakeholders from industry, academia, national laboratories, and nonprofits convened to share their perspectives on challenges surrounding the use of data and informatics in materials discovery and development, and advanced manufacturing. The expertise of participants spanned materials science and engineering, design and manufacturing sciences, and computer and data sciences.

Speakers from industry included Rick Barto of Lockheed-Martin, Kaisheng Wu of Thermo-Calc, Bryce Meredig of Citrine Informatics, Ramesh Subramanian of Siemens, and Rajiv Naik of Pratt & Whitney. In addition, Chuck Ward from the Air Force Research Laboratory and Turab Lookman from Los Alamos National Laboratory also presented from their perspectives.

Following the talks, a series of smaller concurrent breakout sessions formed to discuss feasible crossover areas between industry and academic research. Michael Valley of Sandia National Laboratories moderated the session “high impact applications of data science in the materials-manufacturing sector.” Daniel Wheeler from the National Institute of Standards and Technology moderated a discussion on “challenges in the automation of the materials data life-cycle.” Raymundo Arroyave from Texas A&M University moderated a session on “education and training in materials-manufacturing data science and informatics.” David Fries of the Florida Institute for Human & Machine Cognition was the moderator of a discussion on “the materials-manufacturing innovation cyber-ecosystem.”

The whole group then reconvened for two all-inclusive round table discussions. Jason Hattrick-Simpers from the University of South Carolina, and David McDowell of Georgia Tech drove a discussion on developing a set of objectives and an associated roadmap for those. Surya Kalidindi of Georgia Tech led a discussion on establishing an advanced materials and manufacturing “Spoke” at the South Big Data Hub.

After a reception poster session, the event closed with a call to action to collect resources, create an online community for locating resources and for networking, and to develop an administration transition paper. Co-Executive Director Renata Rawlings Goss is currently seeking leadership roles in developing resources for this new community. To participate, please contact her at rrawlings.goss@gatech.edu.

September* Opportunities and Announcements

[*September and beyond]

Dear South Big Data Hub Community,

This post contains the content that went out in our monthly newsletter–a listing of news, events, and opportunities in data science, analytics, engineering, and policy. If you have announcements or information that you would like to submit for next month’s newsletter or if you would like to contribute a guest post to the South Big Data Hubbub! blog, please use our submission form or email announcements@southbdhub.org.

The South Big Data Hub Team

Continue reading