DUKE ITAC - December 4, 2003 Minutes
December 04, 2003
Members present:Pakis Bessias, John Board, Brian Eder, Tracy Futhey, Michael Gettes, Patrick Halpin, Billy Herndon, David Jamieson-Drake represented by Bob Newlin, Kyle Johnson, Roger Loyd, Greg McCarthy, Melissa Mills, George Oberlander, Lynne O'Brien, Mike Pickett, Rafael Rodriguez, Molly Tamarkin, Robert Wolpert, Steve Woody
Guests present:Jim Roberts, Provost's Office; Dan McCarriar, OIT; Chris Meyer, OIT; Chris Cramer, OIT; Amy Campbell, CIT; Ginny Cake, OIT; David Menzies, OIT; Deb Johnson, Provost's Office; Neal Caidin, CIT; Michael Garvin, OIT; Heather Flanagan, OIT; Sean Dilda, OIT; Bill Rankin, CSEM
I. Review of minutes and announcements:
Tracy Futhey talks briefly about Dan McCarriar and his transition into a new role as Director of Network Services.
Mike Pickett mentions the upcoming digital archive Futures Forum. Paul Conway will coordinate. Mike says that anyone wishing to describe their archiving approach at the Futures Forum should talk to Paul.
II. Password strength checking
Presented by Chris Cramer
Chris Cramer explains that a year and a half ago - April 2002 - we talked about password strength and checking the NetID, thought it was a very good idea, and we should go forward and do it. A year and a half later we have the equipment, the tools are working, and in January we will start doing regular checking of the NetID database. There will be a week-and-a-half of strength checking. Ginny Cake and Chris Cramer have started discussions about how to disseminate this information to users, and if any type of information campaign is necessary.
III. Blackboard Outage Report
Presented by Michael Gettes
Michael Gettes provides a summary of a report on the issue (handout). In September there were some systems administration mistakes and some management mistakes in communications leading to a Blackboard failure. The failure was not one of technology failure. The communication breakdowns caused a loss of Blackboard data. OIT has taken a number of steps to identify problems as noted in the report having to do with how data is backed up, what to do in case of a system failure, and what best practices should be employed when implementing changes.
[Michael is asked to clarify the reporting structures.]
Michael says in the context of Blackboard, there is a team of folks working together. The system wasn't being treated as an enterprise system like SAP and that caused some breakdown as to how the team was communicating. The system administrators report up to Heather Flanagan and up to Michael Gettes and Carl Ross up to Pat Driver and Billy Herndon.
Billy Herndon comments that the model did not map out like PeopleSoft. In this case, we did not have one group in the applications group to coordinate. Chris Meyer is now the owner for this application within OIT.
[Question is asked what the difference is between this system and others?]
Tracy Futhey says the issue is with the internal operating structure; everybody had responsibilities but nobody had "the buck stops here" responsibility. Blackboard isn't the only one with this issue, but it's the one that has been identified. This was an OIT problem. Many enterprise systems have developed with operating responsibility within OIT, this one had shared ownership. OIT is looking more broadly at other systems - Blackboard gave us the wake-up call. Regardless of the origins of a system, someone in OIT will be responsible.
Billy Herndon says even where we found where best practices needed improvement, we're looking through our whole inventory of support to see where we need changes and adjustments - where does it need to be applied in other places.
IV. High performance computing update
Presented by Mike Pickett
Mike Pickett provides a starting point for the discussion: with the demise of the North Carolina Computing Center, a need was created that is starting to be filled by the Linux high-performance computing cluster. Bill Rankin of CSEM and the OIT team stepped-up and built a core cluster; Tracy Futhey provided funding for the cluster and along with CS, has provided space down in North 011. Faculty members can bring funding to the table to have nodes purchased by OIT that will be installed and taken care of by OIT and CSEM; if you fund a node in the cluster, you will have priority access. When the node is not being used, it will be shared with other users. OIT has hired Sean Dilda to be the cluster systems administrator. Another exciting resource is the Sunfire 12000. The Center for Human Genetics received a grant for it, OIT hired a Sunfire systems administration expert, Mike Garvin. Mike and Sean report to Heather, and make sure these large systems keep running. Bill Rankin and John Pormann provide algorithm guidance to scientists and help guide the cluster design. Mike Pickett discusses the large memory model machine (Sunfire 12000) - unlike small PC node this machine can handle very large memory projects. John Harer is the leader of the CSEM efforts, and has created a steering committee, mostly composed of faculty. This committee will allocate time on the machines; if you need additional nodes but have no funding, you can still apply for spare time.
Bill Rankin says the original core was 68 Dell Zeon nodes. Tom Keppler's group brings extra contributions, as well as the Environmental School. Other groups want to participate. Tracy Futhey says in terms of capacity we are upgrading air handling equipment.
[Question is asked, what will we max out at?]
Bill Rankin replies we should be able to handle a couple more racks, after the new air handler is added - the room should be a lot cooler. Perhaps 200 more nodes - we are not space-bound right now, but the cooling and power will need to be watched carefully.
[Melissa Mills asks how depreciation and backup are handled?]
Bill Rankin replies that currently there is around 1TB disk space backed up with a local tape library five nights a week. We will keep using that on an interim basis until we run out of disk space, we have had conversations about centralized backup and adding more disk. Tracy Futhey says anyone who buys into the cluster buys a three-year presence. You fund the nodes and they will be available for the three year term. Melissa Mills says disk failures will accelerate over time.
Tracy Futhey says she assumes we can keep them replaced and up and running. Three year maintenance purchase is part of the configuration.
Melissa Mills says it is very good to have minimum requirements for the type of node you will provide.
Brian Eder mentions min specs, is there a certain vendor, will they keep escalating?
Bill Rankin says we tried to stay off bleeding edge of the technology curve, but tried to keep up to date. Whatever is current on the technology curve, we tried to be vendor agnostic. Up to this point it's been Dell but that's not to say it will stay that way. Virginia Tech did a Mac cluster of 1100 G5s, and there are other large Dell clusters at Penn State.
[Question is asked: right now we have to copy files back to data environment; are we waiting to come up with something better?]
Bill Rankin says there is technology out there that can allow for data sharing - will have to look at slightly more expensive ones, and it will have to be demand-driven, on a case-by-case basis.
[Question is asked about Itanium versus Zeons?]
Bill Rankin says there are a couple options, mentions that could be segue to Mike Garvin's presentation on Sun. They can be integrated into the cluster, and it is a fine growth model once we look to go into that.
Robert Wolpert mentions two different participation models - one was supply a box and share a cycle, the other is more private that your machine would be yours all the time.
Tracy Futhey says there is no model in place to do the latter yet.
Robert Wolpert says those interested are in the process of writing grants, but a certain amount of support for grant writing is needed. Who will that come from? Duke is supplying cost-sharing, we need somebody to help with the accounting.
Tracy Futhey says John Harer is working on some boiler plates for the grant process.
Melissa Mills says this came up earlier in the summer. Someone wrote John Harer about infrastructure for computer sharing - would be really useful to have one place where we can not only describe what is there but also all the facilities - some central place where we can send new information to.
Michael Garvin comments that the unique situation at Duke is that we are providing researchers with alternatives with the Linux cluster and 64 bit systems. The Sunfire 12000 was brought up and divided into three parts. One for database, one for applications like genome sequencing, the 3rd part is being used by the campus for high performance computing; 96GB of memory allocated among the resources, and the entire 96GB is addressable. We have just completed the initial setup with Sun on the system, and are looking for availability to campus by early January. We will be providing resources for folks who need to bring codes over. Over the long-term we will be working between two systems to allow researchers to run codes between resources and pick codes for unique applications. Over time as we work with folks on an NC grid we will make it easier for researchers to access NC grid resources and vice-versa. They can run their codes on the 12k without any modification. Mike Pickett says the nice thing about the collaborative approach between CSEM and OIT is we have people who know system architecture as well as other people who know the scientific research aspects.
[Comment is made that participation models are significantly different for the Sunfire model.]
Tracy Futhey says the Sunfire does have the headroom so one could imagine if there was a grant there is room to add more hardware.
Michael Gettes says it remains attractive because you don't have to add-on to the rest of the Sunfire.
Michael Garvin adds that additional board and processors is $125-$150,000 dollars for four one-gig processors.
V. Introduction to Deborah Johnson
Jim Roberts says one of the themes last year was the PeopleSoft upgrade and the other thing was to look at the delivery of student information services. As an extension of the system work a working group would be convened - think through vision of student admin services. With PeopleSoft in there, are we optimally organized to provide service to students and departments? We had a series of presentations regarding best practices - discussed with Peter and Tallman - and concluded that while we offer some very good services within the silos we can also look outside the silos. The organizational layer of PeopleSoft has been harder to attend to; calling our domain student administrative services - billing, financial aid at the core but it should extend to any domain where students sign-up and pay for anything. It doesn't mean organizations will be merged into any different kind of reporting structure. Our report did a couple of things - discussed why quality matters to Duke - principles and general standards to examining current services, identified key assets - organization, technology, facilities (all about delivery of student services). The presence of Bursar and Registrar in prime Allen space has always been questioned. We left-off work with impetus that we can do better than we are doing and with some creativity, we can do a whole lot better - need someone with a job to do just that in order to move ahead. Deb Johnson was hired - from long career in student admin services - financial aid specialist - has been at Duke as chief admin officer within school of Med academic programs - responsible for financial oversight.
Deb Johnson says she is looking forward to challenges, has only been at it a few short days but is looking forward to doing this. She is already working together with Ginny Cake on a pilot for the undergraduate portal. [Comment is made that many of these projects have serious academic implication - would like faculty to have their voices heard in the process.]
Jim Roberts says interactions with Registrar's office are particularly important - also thinking about students and faculty. They are in the process of designing working groups and governance committees, and are happy to involve faculty members, establish connections. They can come back to this group as invited, happy to do that. It will be six months for low-hanging fruit and then some serious planning - portal experiment/pilot is certainly important.
Tracy Futhey says thanks for another good term, this is the last meeting for calendar year. She wishes everyone good end of the semester, happy holiday season.