图灵访谈之四十一:《大数据》作者Jeffery Ullman教授访谈英文修订版

中文版—>Jeffery Ullman教授:大数据不是噱头,值得投入我们的力量(图灵访谈)

Jeffrey David Ullman是一位计算机科学家,现任斯坦福大学的教授。他编写的关于编译器的教科书(各种版本非常流行被称为“龙书”)、关于计算理论的书(被称为“灰姑娘书”),以及数据结构和数据库的相关书籍都被视为是业界的规范。1995年,他成为美国计算机协会(ACM)的院士,2000年被授予 Knuth奖。他还和John Hopcroft一起获得2010年IEEE颁发的冯诺依曼奖章。详细信息。

1.“Big data” now is a very hot topic in China, conferences, books, talks, that anything about it will draw attentions more than ever. What role do you think massive dataset mining is in this big picture?

I think “massive dataset mining” means essentially the same thing as “big data.” That is not to say that the MMDS book covers everything about big data. Anand Rajaraman and I were pretty selective about the algorithms we covered. In particular, we stayed away from what is now called “machine learning.” There is a very powerful community of researchers who call what they are doing “machine learning,” even though many of the algorithms they deal with, such as clustering or gradient descent, were known and studied seriously long before “machine learning” became the hot topic it is today. In reality, “machine learning” is just a label for certain classes of algorithms, and there are other algorithms, just as important or more so, that are needed to analyze data effectively. The most extreme case is “locality-sensitive hashing” (LSH), which is not thought of as machine learning, and was not invented by machine-learning people. As I travel around and listen to people with computational challenges, the most common idea I find missing is knowledge of LSH techniques. As a result, we decided to give LSH the prominence it deserves in our text.

2.As a scholar and an educator how do you react to the “big data” favor? What are the reasons that you want to keep the book Mining of Massive Datasets free and updated?

“Big data” represents a real class of challenging and important problems. Unlike many other ideas that become popular buzzwords for a few years and give us nothing, I think the study of big-data algorithms is worth all the effort that is being applied. But you really asked two different questions: why are we updating the MMDS book and why is it free?

Why update the book? We are fortunate that Jure Leskovec joined the Stanford faculty a few years ago, and he has taken over much of the teaching responsibilities for the data-mining courses. Jure has a somewhat different viewpoint from Anand or me; he is more favorable to machine learning algorithms, and his personal research involves social networks and related graph problems. So Jure has now joined as an additional author, and we have deployed a chapter on algorithms for social network analysis. In the near future, we hope to deploy chapters on large-scale machine-learning algorithms and on large-scale dimensionality-reduction algorithms. In addition, through my personal research with people at Google, Stanford, and elsewhere, I have learned a lot about the nature of good map-reduce algorithms. So I recently updated Chapter 2 to incorporate this new knowledge.

Why is the book free? There are a number of reasons. Probably the biggest is that neither Anand nor I needed the small amount of royalties that might come from the book. We were happy that Cambridge University Press was willing to publish the book even while a free version was available, as they have with a number of other books recently. Their point of view, as expressed to me by the editor David Tranah, is that they wish they could make money on their activities, but for hundreds of years they have recognized that, as a nonprofit university publisher, their primary responsibility was to disseminate information.

That leads us to a second reason: the for-profit publishers have become more and more greedy and raised the prices of books in the United States way beyond anything that can be justified by the cost. As a result, no one buys the book, or if they do they resell it, so total sales of books are small compared with what they were in the 1970's or 1980's. It is becoming more rational for authors to choose to make their book free and see many more people use their book, rather than make a little bit of money through a for-profit publisher. As an example, the MMDS book gets about a quarter of a million downloads a year. That's 10 times the yearly sales of any of my hardcopy books. That much attention to what Anand and I have done is worth a lot to us.

But the real reason we decided to make the book free is that illegal file-sharing systems have no respect for our intellectual property anyway, so the only ones who would buy our books are the honest people who don't use these pirate systems. Those are the people from whom we'd least want to take money.

3.Frameworks of Map-Reduce focus more on offline processing, what computing framework would you recommend for on-line processing? And besides Map-Reduce, what else massive datasets mining frameworks do you recommend?

I see two meanings of “on-line processing.” One is transaction processing. Data mining in general does not need transaction processing, so surely “big-data” applications do not need transactions. The second meaning is ad-hoc querying, where you type whatever query pops into your head, see the result in a few seconds and then type another query if it turns out that was not what you wanted. Big-data applications tend to require long execution time, and so are not really suitable for ad-hoc querying. There are some new systems that go beyond map-reduce and often can offer responses to queries on truly massive amounts of data in a few seconds. You might look at the Dremel system, http://www.wking-china.com/xpjylc/pubs/pub36632.html , which I understand is being cloned as an open-source system called “Dream.”

4.Many readers reflect that this book is abundant in practical examples, has Dr. Rajaraman contributed a lot to this part? And also some say it's really tough to for people to understand without practical experiences, do you have advices for these readers?

Anand contributed to almost all parts of the book. He was interested in certain applications because he was involved in a startup, Kosmix, at the time, and Kosmix was doing some of these applications, including advertising and collaborative-filtering. I do agree that a really good education in this subject, or any Computer-Science subject for that matter, should involve some implementation. When Anand and I were coteaching the data-mining course, we would ask students to organize themselves into small teams, select a project based on what they had learned in the course, and implement their project. That didn't work as well as we had hoped, because students needed to spend most of the course learning the material before they could apply it. So when Jure joined the faculty, we split the course into two. The first quarter, Jure lectures, and in the second quarter, we select student teams to do a project that they have already designed based on what they learned in Jure's course. Each of us – Anand, Jure, and I – coaches about four teams.

5.We have plans to publish the Chinese edition of Foundations of Computer Science, I saw that you have very high recommendation of this book. Do you mind sharing with us that why do you think such an “old” book still hold its stand today?

When the “Foundations” book was published by Freeman & Co., it did not sell well, and was eventually taken out of print. Al Aho and I always thought it was the right way to present Computer-Science theory: viewing mathematics and programming as two sides of the same coin. For example, we explain that inductive proofs and recursive programming really come from the same idea. However, the book did not become widely used until we made it available for free on the Web. I believe that is no coincidence. Faculty in the US are reluctant to ask students to buy an expensive textbook, even though the students’ tuition plus lost opportunity cost (the fact that they are not earning money while they are in school) may be 100 times the cost of the textbook. This viewpoint is foolish, but I blame the publishers for making books too expensive in the US, and thus killing their own market.

6.You have taught so many brilliant students, some co-write books with you, and some start Google. Who is your favorite student? Are there any interesting stories while teaching them?

I really shouldn't answer a question about who is my favorite. We can all agree that Sergey Brin is the most successful of my students. But I really didn't teach him anything. Anand Rajaraman and his cofounders Venky Harinarayan and Ashish Gupta (all three were founders of Junglee, and Anand and Venky went on to found Kosmix) are all very successful as well. But I didn't help them very much either. Two students were actually responsible for pushing my own research in a good direction. Matt Hecht got me started on code optimization, and Allan Van Gelder got me into logic programming, which led to the whole Datalog branch of database research. But probably the students I feel best about are a few where I think they would never have gotten to finish a PhD if I had not intervened to put them on a new path. I'm not going to name names, obviously.

7.Do you think there is a hacker culture among college students of the U.S.? How does this culture interact with academic studies?

There are several interpretations of “hacker.” In one meaning, it is someone who is skilled at breaking into other people's computer systems and exploiting them. In that sense, no; there are very few students who are inclined in that direction.

A second meaning is someone who is preoccupied with programming and technology in general. We see some of that at Stanford, but not all that much. Even the best software students tend to have other interests. Stanford does not even allow students to focus exclusively on one subject. In order to get an undergraduate degree at Stanford, only about 1/3 of the work can be in your major subject. That is quite typical for US schools.

But what is unusual about the Stanford culture is the idea that anyone can start a company. Perhaps more try than really should, and not all are successful. But it is amazing how many students think in terms of getting start-up money when they leave school, rather than going to work for an existing company. There are a few courses on “entrepreneurship,” but really the culture is passed from student to student.

8.From a teacher's point of view, how do you see Chinese students on your class? Is there any advice you would like to give them?

You might be amused to know that when I teach a lecture course, typically more than half the students are Chinese. Many are US-born masters students, but also a good number come from China and other places in Asia. I don't tell them anything I don't tell all students. First, trust yourselves, not your elders; look how young the founders of the great computer companies – Microsoft, Oracle, Apple, Google, Yahoo!, Amazon, Facebook – were at the founding. Second, don't be afraid to fail. If you aren't failing more often than you succeed, you probably aren't tackling problems worthy of being solved.

9.Recently, Prof. John Hopcroft has actively engaged in visiting China to give speeches to students and open training classes to college instructors, do you have any plan to visit China too?

No; I do not plan to visit China until the Chinese people are free to speak and to access the Internet as they will, rather than as the Chinese government permits them to. I do not wish to be in a place where I could be jailed for saying what I believe. Let me emphasize that I love the Chinese people, but I hate governments whose power comes not from the people they govern.

I understand the views of Kung Fu-Tse that government has a right to rule but a responsibility to further the interests of the people. In many ways the current government of China does further the people's interests, as measured by the rapid economic growth. But without the ability of the people to debate policy, mistakes will happen and cannot be corrected easily. Probably the clearest recent example was the “cultural revolution”, where the few in power somehow decided that it would be a good thing for educated people to be sent to work on farms. That set China back decades. Think where China would be now if its people had been able to vote out of office the rulers responsible. History shows that people prosper best when there the fundamental freedoms, including freedom of speech and Internet access are present. Compare North and South Korea, for example.

So when Google can serve search queries in China and deliver whatever documents are most relevant on a topic, rather than what the government wants its people to see, invite me then. If I am still around, I'd be happy to visit.

10.Who designed the cover of Mining of Massive Datasets? Are there stories behind it, just like the “dragon book” (it's nowadays an inevitable book in computer science)?

Both the Dragon and MMDS covers were designed by my son Scott.

11.What features do you think that datasets oriented operation system should possess?

I don't see data mining as an operating-systems issue. There are issues regarding the suitability of various database management systems, e.g., traditional relational systems versus “no-SQL” systems. I do note that SQL seems not to have gone away, and people are finding ways to integrate it into several of the platforms for managing massive data.


更多精彩,加入图灵访谈微信!