Duke ITAC - November 5, 2009 Minutes
ITAC Meeting Minutes
November 5, 2009 4:00-5:30
RENCI Engagement Center
- Announcements & meeting minutes
- Research computing directions (Alvy Lebeck, Jeff Chase, John Pormann)
- DSCR cluster growth and constraints (John Pormann)
- VoIP and campus wiring update (Angel Wingate, Bob Johnson)
Announcements & Meeting MinutesTerry Oas opened by asking ITAC members present at the October 22, 2009 meeting if they had comments on the minutes. Kevin Davis said the minutes were not distributed yet since the audio recording failed to complete.
Research computing directions - Alvy Lebeck, Jeff Chase, John PormannTerry said a number of years ago the Provost decided to evaluate how best to invest university resources into research computing. This question led to numerous discussions on researching computing. The Provost approved the administrative structure that was proposed based on these discussions. Recent economic issues changed the priorities of the original administrative plan. Terry said this ITAC discussion was intended to update ITAC members on the changed priorities resulting from new economic realities. Specifically, what items were removed and what continued.
Alvy Lebeck said the research computing goal was to provide institutional support for common research computing needs. Some of the reasons to do this are to gain economies of scale and reduce administrative overhead. In addition, this would provide a central place to have front line support that primarily supports research. Another goal was also to broaden participation, that is, welcome “new users” as a point of entry into research computing.
Alvy said the proposed idea became the Research Computing Center (RCC). He said there is a high-level faculty advisory board, the Research Computing Advisory
Council (RCAC), which Alvy chairs. In addition, each research computing center has a faculty advisory board. Some of these members serve on RCAC. There is also a faculty director (Jeff Chase) and a research director, he said. The two initial research centers were for Scalable Computing and Visualization. Alvy showed a diagram that presented the overall institutional IT support structure for research computing. There are staff that actually implement decisions, and the faculty group that set policy.
Alvy said RCAC provides oversight for research computing as a whole. The budget impact has been that there is only one research computing center. There is no visualization research computing center. He said the goal was to stand up one scalable computing center first. Alvy showed a list of the members of the Research Computing Advisory Council. The group provides an ear for the university’s core research computing needs.
Jeff Chase, a Computer Science professor, chairs the Scalable Computing Advisory Committee (SCAC). The Scalable Computing Support Center (https://wiki.duke.edu/display/SCSC/Home) is led by John Pormann and currently has one dedicated staff member. There is another current opening for a second consultant. The SCSC also depends heavily on OIT for operational infrastructure support.
Jeff said this is not just for High Performance Computing. It also for “medium scale computing.” The purpose is not exclusively for jobs, clusters, or the Duke Shared Cluster Resource (https://wiki.duke.edu/display/SCSC/DSCR ). This effort is an umbrella for all of these things. The center’s priority is to maximize the research ROI. This includes not just the capital expenditure, but also reducing faculty startup time. The center is very data driven to make informed, quality decisions about how resources are used. Another component is to expand access and provide support for different computing models and groups. This is an effort to ensure Duke University can fit into, and gain visibility within, the larger international cyber-infrastructure.
Jeff said the DSCR is really a hardware substrate. He added that the DSCR service runs on top of that substrate. The center wants to move towards using the infrastructure for multiple uses. An example of another service that uses the same hardware substrate is the Virtual Computing Lab (VCL). An additional effort is to measure and improve the “green effect” of these environments.
The model behind the DSCR, which Jeff called “Gang Computing,” uses the aggregation of shareholders environments into a common substrate and places that under the management of a single Provider. This Provider group, OIT, provides additional hardware substrates (network, power, etc.). This model means faculty should have better ease of use. In addition, faculty do not have to spend money on managing that substrate since the Provider does it. In addition, customers have a chance to use other resources at times when they may not be actively used by other groups. The Provider also gets economies of scale and efficiency gains. Jeff said the Provider may be able to take advantage of surplus resources, should policies permit it.
Jeff said there are policy questions to ensure faculty get ready access to their resources. Faculty want to make sure they get what they paid for.
Jeff said there are other policy questions. For example, how does Duke manage the lifecycle? How does Duke set standards for what equipment can go into the cluster? There are concerns regarding retirement of machines from the cluster. What are the Provider’s requirements regarding turnaround time for installing new equipment?
Jeff said the SCAC identified two core issues to address. One is to develop methods to encourage owners of older machines, that may be limiting overall optimal cluster performance, to retire that equipment. Another issue is ensuring the Provider, OIT in this case, is committed to providing good service, supporting the infrastructure, and quickly deploying newly delivered machines, he said.
Jeff said some trends have been identified. One is the need to calculate the growing cost of the substrate socket, specifically, the storage and networking components. Virtualization support is becoming more lightweight. In addition, there are more multi-core machines and that is driving demand. Jeff said there are changing demands for applications and storage access. The current batch computing process runs from a single NetFiler service that has experienced some issues. Lastly, Jeff noted a trend that groups have many demands for faster interconnections, and the SCSC has been exploring heterogeneous computing, such as GPUs.
John Board asked Jeff to characterize the demands of new users. John Pormann said the usage of the environment is over 90%, in part due to some challenges bringing some new equipment online. He said if they could keep the usage closer to historical averages of 70-80%, they would likely not experience some of those same issues. The long-term utilization rate of 70-80% in an 800-machine cluster translates to about 100,000 monthly CPU hours that are not being used.
John B. asked if there have been concerns raised about the fairness of use. John P. said that all users have equal access to low priority machines. If a group has purchased equipment, they have high priority access to those machines which provides those customers the ability to direct traffic to their machines. Jeff said jobs that come into the high priority queues may result in lower priority queues being suspended.
Terry stated that if in principle all the nodes were “equal”, then users would not care how to submit their jobs. He asked if customers are required to specify the machine that they want to run their jobs on. John P. said those jobs are currently tied to specific physical machines. Robert Wolpert said it is very difficult to keep a homogeneous environment. Jeff said there are many different issues around reducing the diversity of the underlying substrates.
Terry asked to what level the socket enforces a different level of homogeneity in terms of the substrate. Jeff said the socket is standard network file services and IP. In that sense, it does not limit the substrate at all, he said. In the case of the blade environments, whose goal is to create something that was more manageable, there are questions about what specifically the faculty are paying for. Terry clarified that when faculty purchase a node, they add the infrastructure that supports the node in addition to the node itself.
Dave Richardson said it is possible for faculty to “get their feet wet.” What is the process for faculty to go from evaluation to a “full time member”? John P. said some users have only one or two machines at a cost of about $3000/machine. He added that the main driver for adding their own hardware is to have access to the high priority queues.
Jeff said new customers don’t necessarily have to buy machines. The environment may have surplus availability and policy may allow that to be used for this group of users. Dave R. asked if there is a cost structure to do the small sampling. Jeff said the cost is effectively the power consumption. Jeff said an open question is what is the cluster’s right to restrict a faculty member’s job submission if there are in fact available resources to execute the job.
Susan Gerbeth-Jones asked if faculty can write or execute parallel code that could impact other faculty. John P. said only 10% of the jobs are parallel jobs. Robert W. described these jobs as ones that coordinate multiple processors simultaneously.
John P. said BioInformatics may launch as many as 500 to 1000 jobs all at once but those each only use one CPU at a time. Jeff described a recent job that created unexpected spikes.
Robert W. asked about jobs that can leverage CUDA and GPU. He said it may also be possible to leverage some government NSF or NIH computing resources. He asked if the DSCR served as a clearinghouse for these types of services. John P. said they have worked with RENCI to access the Open Science Grid. John P. said with a Gigabit Ethernet connected system, jobs will not scale beyond ten to twenty machines to these outside networks.
Jeff said Duke was in the 2006 Petascale proposal. They surveyed the faculty to see what kinds of applications might be able to take advantage of that leadership class computing but found insufficient need for it. Terry said this may speak to a common ITAC refrain of improved publicity and communication. Jeff said John P. runs numerous training courses through the SCSC so people can find out about these technologies. The last SCAC meeting suggested cultivating specific contacts in departments.
Rafael asked if they have run into software licensing issues. John P. said that the cluster is willing to run the software, but not to host it. He added that much of the software is not priced competitively for clusters. Robert asked if this was a common issue. John P. said the DSCR is seeing an increase in Matlab and Mathematica use.
Dave R. asked what kind of software and hardware are running in this environment. The cluster is running CentOS 4.5 and will soon be upgraded to CentOS 5. The compilers are from Intel. Jeff said upgrading the OS across the cluster is a significant operational staffing commitment.
Michael Ansel asked if the DSCR was going to be virtualized or stay “bare metal”. Jeff said the DSCR substrate is running bare metal. He noted that virtualization is getting cheaper. Deploying virtualization could be viewed as taking away resources from some groups. Jeff said he thought this would be a productive move when the time was right.
DSCR cluster growth and constraints - John PormannJohn P. showed a chart that highlighted the DSCR growth of machines added to the cluster. The machines listed were not indicative of the amount of overall processing power. He highlighted that there has been significant annual variation in then number of machines added. Jeff added that the last 500 machines added actually have eight cores. John said in 2003 the machines were generally dual core, in 2007 they were generally quad-core, and in 2008 they were dual quad-core. When this is accounted for, there is potentially a “2-3x performance increase per processor.” This represents greater than an order of magnitude improvement for overall computer performance, John P. said.
John said his group has begun decommissioning machines from 2004. John P. noted that there were some periods where data center constraints limited the ability to add new nodes. John noted that the data he showed was not indicative of interest in the service. Specifically, a lull in new machines in July 2004 was a result of moving data centers, thus, new machines could not be added. In June 2006, there was a need to add overhead cooling so, again, new machines were postpone. In July 2007, CSEM shut down and there were concerns about who to talk to in terms of adding machines to the cluster.
There has been an average of 140 machines growth per year. John P. said that the Duke Institute for Genome Sciences and Policy (http://www.genome.duke.edu/ ) added many machines last year. John said over the last six to nine months his group has removed 190 machines. John P. showed a chart that examined the growth in cores, despite the drop in the number of machines. John B. suggested the dat reflected a ten to one increase in performance difference.
John P. said the DSCR is in 145 North Building. They have nineteen racks. Each rack supports three Dell chassis. At that point, the room’s AC is maxed out. Nine hundred and twelve Dell blades will max the physical rack capacity out. John P. said Duke could be even more compact, for example, putting four chassis in a rack. This would require an additional 14 tons of AC and running more power from the street. John said if Duke needed additional capacity, this data center could be utilized. Robert asked if there was another place, like CIEMAS. John said that the Fitzpatrick phases were four and nine kilowatts per rack. He added that the Dell blade requirements exceed that. Jim Siedow asked if the changes would really add significant abilities.
Robert asked if all of this would actually fit the need. John said the target would be to replace machines at the five-year mark. In addition, the annual increase would need to remain steady at around 140 in order to remain in 145 North Building. John B. said the opportunity cost of leaving old machines in place can be significant for others. Jeff C. said the SCAC understands that these expectations need to be in an agreement that commits members to upgrade.
Steve Woody asked if there were any concerns about storage needs. John P. said they have largely stayed out of the storage game. This environment has a fourteen Terabyte NetApp Filer. Groups generally stay at or below the preset 100GB quota, though increases can be made.
Robert asked if there are users with mass data needs. John said that IGSP has there own data files on their own servers. Fiber was run to connect the DSCR to the IGSP storage, John P. said. He added that multi-Terabyte customers provide a challenge. Michael A. asked if customers could purchase their own storage. Jeff said that is the IGSP solution. John added that cost is a concern, in addition to Quality of Service.
Susan asked how remote access is provided to the data. John P. said it is FTP and SCOPY. He added that WebDAV is being considered.
John P. said there is some available capacity currently available. There are currently ten open chassis slots. There are plans to decommission 172 machines over the coming months. In effect 275 new blades can be added over the coming year.
John P. said GPU processing can actually be a significant increase in the power draw. He said some GPU cards could perform a trillion operations per second.
John P. said the traditional 1U systems draw about 300 Watts per rack unit. The blade-based systems require around 480 watts per rack. John P. said some GPU rack configurations result in twice the power density.
Julian Lombardi asked how much cloud computing could help. John P. said studies show that cloud computing computational capability in support of scientific research is not significant. However, some jobs could be sent to the cloud. Terry asked if customers are looking to process tasks in the cloud, if the SCSC would be the conduit for that. John P. said some of that could be handled behind the scenes. Terry asked if it was realistic for the SCSC to serve as the central clearinghouse for all cloud computing concerns with its current staffing levels. Jeff said not for all of the institution’s needs. Jeff said cloud computing does not require groups to go through the SCSC.
SGJ asked if the center would be able to provide storage. John P. said that is not something that they are looking to do.
VoIP and campus wiring update - Angel Wingate, Bob JohnsonBob Johnson said the original scope was to replace end of life phones. The goal was to converge to a single platform. The number of lines to migrate is about 30000. The project is approximately 31% completely. (http://www.oit.duke.edu/vvw/telecom/voip/university.php) This does not account for all the discovery work and consultation with departments. Jim Siedow asked if this will be for all campus. Bob confirmed it would be. Alvy asked if this project was going by building or by department. Bob said the goal is to migrate departments jointly to maintain commonality of service.
Bob said the campus is approximately 60% complete. The deployment rate slowed in November to accommodate some large construction projects.
The campus operators’ migration, a very technical project, is scheduled to be completed in January 2010. Bob said the Automatic Call Distribution Centers (ACD) will be migrated by March 2010.
Bob said the Avaya that is end of life will be retired in August 2010. Robert W. asked what is going to happen to the old G3. Angel said this equipment is EOL and out of support.
Robert asked if there are any concerns with analog devices, like faxes, working in the new environment. Angel said they will continue to function though some dialing information might need to be updated on the devices. Robert W. asked if departments can move phones without contacting OIT. Angel said they will be able to do that for digital phones.
Bob said that the overall project is estimated to complete around July 2011.
Bob said we have experienced issues with the soft phone deployment. This is an application on computers rather than a physical phone. He added that the Macintosh OS had more challenges with this application.
Another concern was the need to harden critical phone services when designated. In these environments, there is a new process for installing remote power supplies (RPS). In addition, OIT is working with contacts to identify critical services. The plan is to have closets dual-corded into two different power supplies to mitigate that risk. Bob Newlin added that this “hardening” of the network closet is in effect improving the computing network reliability as well.
Another effort recently begun was to have Cisco Advanced Services certify our environment. OIT is working to bring our current environment up to their recommendation configuration and explore a Service Level Agreement with Cisco. The Cisco assessment provided high marks on redundancy and availability. Bob said the recommendations where primarily around future manageability. Bob said the recommendations were for both approach and future direction. One example was that Duke has never looked at voicemail as an essential service.
Robert W. asked if power over Ethernet (POE) interferes with any services. Bob said that the campus has been in that environment for a while.
Ed Gomes asked if departmental contacts have worked with technical staff. Bob said a 9000-port survey showed that Duke averaged about 20-21MB per port. Bob said the replacement crews have been trained to inspect the layout of equipment in offices to ensure optimal configurations. Angel said other groups identified the contacts OIT works with. She said OIT relies on those staff members to engage the appropriate IT staff.
Bob Newlin said the phones do not have to use POE: they can be powered separately. Rafael noted that they removed all of those external power connections because it caused problems in their areas.
Terry asked to what extent VoIP impacts extant Internet access for departments. Bob said OIT testing had not shown an issue with this. He asked that any perceptions of service degradation should be reported. There are contingencies for dealing with areas with unique bandwidth needs.
Angel said the estimated annual savings from the VoIP project is $2.7million. Angel said another possible savings could be in reduced power consumption. Rafael said for new buildings the goal is to have the same phone system and reduced wiring costs.
Bob N. added that there is enhanced functionality with the new phones.
Terry asked if there are any Long Distance savings. Angel said Long Distance charges are still incurred. Angel said OIT is looking at how to structure the long distance packages into different allocation options, for example, unlimited long distance in the US. Rafael said he expected additional savings for intra-Duke Medicine calls when the conversion was complete.
Terry said that he sometimes uses Skype for international collaboration. He asked if VoIP would change anything about that. John B. said it would not change anything about that use case.
Rafael said in places with call centers, Duke could locate agents in different locations. He added that this would likely introduce new business and workflow models.