英语阅读 学英语,练听力,上听力课堂! 注册 登录
> 轻松阅读 > 英语文化 >  内容

大数据“染指”文学、艺术和电影领域

所属教程:英语文化

浏览:

手机版
扫描二维码方便学习和分享
Dickens, Austen and Twain, Through a Digital Lens

大数据“染指”文学、艺术和电影领域

ANY list of the leading novelists of the 19th century, writing in English, would almost surely include Charles Dickens, Thomas Hardy, Herman Melville, Nathaniel Hawthorne and Mark Twain.

在任何一份“19世纪顶尖英文小说家”列表中,几乎肯定都会有查尔斯·狄更斯(Charles Dickens)、托马斯·哈代(Thomas Hardy)、赫尔曼·梅尔维尔(Herman Melville)、纳撒尼尔·霍桑(Nathaniel Hawthorne)和马克·吐温(Mark Twain)的名字。

But they do not appear at the top of a list of the most influential writers of their time. Instead, a recent study has found, Jane Austen, author of “Pride and Prejudice, “ and Sir Walter Scott, the creator of “Ivanhoe,” had the greatest effect on other authors, in terms of writing style and themes.

但是在一份“19世纪最具影响力的作家”列表里,他们的名字却没有出现在前面。最近的一项研究发现,在写作风格和主题方面,《傲慢与偏见》 (Pride and Prejudice)的作者简·奥斯汀(Jane Austen)和《劫后英雄传》(Ivanhoe)的作者沃尔特·斯科特爵士(Sir Walter Scott)对其他作家产生的影响最大。

These two were “the literary equivalent of Homo erectus, or, if you prefer, Adam and Eve,” Matthew L. Jockers wrote in research published last year. He based his conclusion on an analysis of 3,592 works published from 1780 to 1900. It was a lot of digging, and a computer did it.

马修·L·乔克斯(Matthew L. Jockers)在去年发表的一份研究报告中写道,奥斯汀和斯科特相当于是“文学领域的直立猿人,如果你愿意的话,也可以称他们是亚当和夏娃”。在分析了 3592部1780年至1900年出版的作品后,他得出了这个结论。该研究所需的大量数据挖掘工作由一台计算机完成。

The study, which involved statistical parsing and aggregation of thousands of novels, made other striking observations. For example, Austen’s works cluster tightly together in style and theme, while those of George Eliot (a k a Mary Ann Evans) range more broadly, and more closely resemble the patterns of male writers. Using similar criteria, Harriet Beecher Stowe was 20 years ahead of her time, said Mr. Jockers, whose research will soon be published in a book, “Macroanalysis: Digital Methods and Literary History” (University of Illinois Press).

乔克斯把数千本小说聚集在一起,对它们进行了统计句法分析,该研究获得的其他一些发现也很引人注目,比如,奥斯汀的作品在风格和主题方面比较统一,而乔治·艾略特(George Eliot,即玛丽·安·埃文斯[Mary Ann Evans])的作品有更加多变的风格和主题,模式更接近于男性作家。乔克斯说,以类似的标准来看,哈里特·比彻·斯托(Harriet Beecher Stowe)领先于她的时代20年。乔克斯的研究成果很快将发表在《宏观分析:数字方法与文学史》(Macroanalysis: Digital Methods and Literary History)(伊利诺伊大学出版社[University of Illinois Press])一书中。

These findings are hardly the last word. At this stage, this kind of digital analysis is mostly an intriguing sign that Big Data technology is steadily pushing beyond the Internet industry and scientific research into seemingly foreign fields like the social sciences and the humanities. The new tools of discovery provide a fresh look at culture, much as the microscope gave us a closer look at the subtleties of life and the telescope opened the way to faraway galaxies.

这些发现算不上是盖棺定论。就目前来说,这样的数字分析主要还是一种有趣的迹象:大数据(Big Data)技术正在向互联网和科研以外的领域稳步推进,出现在了一些看似陌生的地带,比如社会科学和人文科学。这些新的探索工具为我们提供了一种审视文化的新视角,就像显微镜让我们仔细查看生活的细微之处,望远镜为我们打开了看向遥远星系的通路一样。

“Traditionally, literary history was done by studying a relative handful of texts,” says Mr. Jockers, an assistant professor of English and a researcher at the Center for Digital Research in the Humanities at the University of Nebraska. “What this technology does is let you see the big picture — the context in which a writer worked — on a scale we’ve never seen before.”

“传统上来说,文学史研究使用的文本相对较少。”乔克斯说。他是内布拉斯加(Nebraska)大学人文科学数字研究中心的研究员,也是英语专业的助理教授。“这项技术能让你以前所未见的宏大规模统观全局——作家写作的背景。”

Mr. Jockers, 46, personifies the digital advance in the humanities. He received a Ph.D. in English literature from Southern Illinois University, but was also fascinated by computing and became a self-taught programmer. Before he moved to the University of Nebraska last year, he spent more than a decade at Stanford, where he was a founder of the Stanford Literary Lab, which is dedicated to the digital exploration of books.

乔克斯现年46岁,是在人文科学推动数字进步的代表性人物。他是南伊利诺伊大学的英语文学博士,但是对计算机技术也十分着迷,是一位自学成才的程序员。去年他搬到了内布拉斯加大学,此前他在斯坦福大学工作了十多年,参与创建了致力于用数字技术探索图书的斯坦福大学文学实验室(Stanford Literary Lab)。

Today, Mr. Jockers describes the tools of his trade in terms familiar to an Internet software engineer — algorithms that use machine learning and network analysis techniques. His mathematical models are tailored to identify word patterns and thematic elements in written text. The number and strength of links among novels determine influence, much the way Google ranks Web sites.

如今,乔克斯用互联网软件工程师熟悉的术语来描述他在工作中用到的工具——使用机器学习和网络分析技术的计算方法。他的数学模型是专门为识别书面文字的用词模式和主题元素建立的。小说的影响力则依据小说之间联系的数量和强度来判断,跟谷歌给网站排名的方法非常类似。

It is this ability to collect, measure and analyze data for meaningful insights that is the promise of Big Data technology. In the humanities and social sciences, the flood of new data comes from many sources including books scanned into digital form, Web sites, blog posts and social network communications.

大数据技术可以为你提供收集、测量和分析数据,从而获得有效发现的能力。在人文和社会科学领域,像书籍扫描而成的数字图书、网站、博客文章,以及社交网站上的帖子等多种来源,产生了大量的新数据。

Data-centric specialties are growing fast, giving rise to a new vocabulary. In political science, this quantitative analysis is called political methodology. In history, there is cliometrics, which applies econometrics to history. In literature, stylometry is the study of an author’s writing style, and these days it leans heavily on computing and statistical analysis. Culturomics is the umbrella term used to describe rigorous quantitative inquiries in the social sciences and humanities.

以数据为中心的专业迅猛发展,导致了一系列新词汇的产生。在政治学中,这种定量分析被称为政治方法学。历史学中则有历史计量学,也就是把计量经济学运用在历史上。在文学中,文体学研究的是作家写作风格,如今,文体学在朝着计算和统计分析的方向严重倾斜。“文化组学”则是涵盖性术语,用来描述社会科学和人文学科领域中严谨的定量调查。

“Some call it computer science and some call it statistics, but the essence is that these algorithmic methods are increasingly part of every discipline now,” says Gary King, director of the Institute for Quantitative Social Science at Harvard.

“有人把它叫做计算机科学,有人称之为统计学,但从本质上说,这些计算方法正在越来越多地成为每个学科的一部分。”加里·金(Gary King)说,他是哈佛大学定量社会科学研究所的所长。

Cultural data analysts often adapt biological analogies to describe their work. Mr. Jockers, for example, called his research presentation “Computing and Visualizing the 19th-Century Literary Genome.”

文化数据分析师常常会把自己的工作跟生物学做类比。比如乔克斯就把他的研究简报命名为“对19世纪的文学基因组进行的计算和可视化展现”。

Such biological metaphors seem apt, because much of the research is a quantitative examination of words. Just as genes are the fundamental building blocks of biology, words are the raw material of ideas.

这种生物学隐喻用得非常恰当,因为这项研究的大部分工作就是在对词语进行定量分析。正如基因是生物学的基本构建单位一样,词语也是思想的原材料。

“What is critical and distinctive to human evolution is ideas, and how they evolve,” says Jean-Baptiste Michel, a postdoctoral fellow at Harvard.

“人类进化的一个关键而独特的方面就是思想以及它的进化方式。”哈佛大学博士后研究员让-巴蒂斯特·米歇尔(Jean-Baptiste Michel)说。

Mr. Michel and another researcher, Erez Lieberman Aiden, led a project to mine the virtual book depository known as Google Books and to track the use of words over time, compare related words and even graph them.

米歇尔和另一位研究员埃雷兹·利伯曼·艾登(Erez Lieberman Aiden)领导开展了一个研究项目:挖掘虚拟书库“谷歌图书”的数据,追踪词语在一段时间中的使用状况,比较与之关联的词语,甚至是用图表来展示它们。

Google cooperated and built the software for making graphs open to the public. The initial version of Google’s cultural exploration site began at the end of 2010, based on more than five million books, dating from 1500. By now, Google has scanned 20 million books, and the site is used 50 times a minute. For example, type in “women” in comparison to “men,” and you see that for centuries the number of references to men dwarfed those for women. The crossover came in 1985, with women ahead ever since.

谷歌跟他们合作开展这个项目,而且还开发了一个软件来制作供公众观看的图表。谷歌文化探索站点最初于2010年年底建成,当时它有藏书500多万册,历史可上溯至1500年。迄今为止,谷歌已经扫描了2000万册图书,用户们每分钟使用该网站50次。比如说,输入“女人”和“男人”这两个词进行比较,你会看到,几个世纪以来,“男人”这个词出现的次数远远多于“女人”,但1985年是个转折点,之后“女人”就一直处在领先位置。

In work published in Science magazine in 2011, Mr. Michel and the research team tapped the Google Books data to find how quickly the past fades from books. For instance, references to “1880,” which peaked in that year, fell to half by 1912, a lag of 32 years. By contrast, “1973” declined to half its peak by 1983, only 10 years later. “We are forgetting our past faster with each passing year,” the authors wrote.

2011年,米歇尔和研究小组在《科学》(Science)杂志上发表了一篇论文,描述他们利用谷歌图书的数据来研究“过去”从书本上消失的速度有多快。例如,“1880”的提及次数在1880年当年达到了顶峰,到1912年时下降了一半,滞后时间为32年。相比之下,“1973”在仅仅10年后,即1983年,提及次数就降到鼎盛时期的一半。“每过一年,我们都更快地忘记了我们的过去,”研究者写道。

Jon Kleinberg, a computer scientist at Cornell, and a group of researchers approached collective memory from a very different perspective.

乔恩·克莱因伯格(Jon Kleinberg)是康奈尔大学的一名计算机科学家,他和研究团队从一个非常不同的角度来研究集体记忆。

Their work, published last year, focused on what makes spoken lines in movies memorable. Sentences that endure in the public mind are evolutionary success stories, Mr. Kleinberg says, comparing “the fitness of language and the fitness of organisms.”

他们研究的课题是“是什么让电影中的台词令人难忘”,论文已经在去年发表。克莱因伯格说,令公众难以忘怀的台词是进化中的胜利者,他把“语言的‘适者生存’比作生物的‘适者生存’”。

As a yardstick, the researchers used the “memorable quotes” selected from the popular Internet Movie Database, or IMDb, and the number of times that a particular movie line appears on the Web. Then they compared the memorable lines to the complete scripts of the movies in which they appeared — about 1,000 movies.

研究人员从人气互联网电影数据库IMDB上选择了“经典台词”,并使用电影台词在网络上出现的次数作为衡量尺度。然后,他们把这些经典台词跟台词所在的完整剧本做比较——总共约1000部电影。

To train their statistical algorithms on common sentence structure, word order and most widely used words, they fed their computers a huge archive of articles from news wires. The memorable lines consisted of surprising words embedded in sentences of ordinary structure. “We can think of memorable quotes as consisting of unusual word choices built on a scaffolding of common part-of-speech patterns,” their study said.

他们在电脑里建立了一个巨大的新闻媒体文档库,以便让统计算法了解常见的句子结构、词序和使用最广的词语。结果他们发现,很多经典台词是把惊人之词嵌入到了结构普通的句子中。“我们可以这样想,经典台词是在常见的词序结构中,填入不寻常的词语。”他们在研究报告中写道。

Consider the line “You had me at hello,” from the movie “Jerry Maguire.” It is, Mr. Kleinberg notes, basically the same sequence of parts of speech as the quotidian “I met him in Boston.” Or consider this line from “Apocalypse Now”: “I love the smell of napalm in the morning.” Only one word separates that utterance from this: “I love the smell of coffee in the morning.”

比如来自电影《甜心先生》(Jerry Maguire)的一句台词:“我对你一见倾心”(You had me at hello)。克莱因伯格指出,它的词序基本上跟“我在波士顿遇到了他”(I met him in Boston)是一样的。又比如《现代启示录》(Apocalypse Now)中的台词“我喜欢早晨汽油弹的气味”(I love the smell of napalm in the morning),跟“我喜欢早晨咖啡的气味”(I love the smell of coffee in the morning)只相差一个词。

This kind of analysis can be used for all kinds of communications, including advertising. Indeed, Mr. Kleinberg’s group also looked at ad slogans. Statistically, the ones most similar to memorable movie quotes included “Quality never goes out of style,” for Levi’s jeans, and “Come to Marlboro Country,” for Marlboro cigarettes.

这种分析可以运用在各种文本上,包括广告语。克莱因伯格的小组也确实研究了广告语。据统计,跟经典台词最类似的广告语包括李维斯牛仔裤的“质量永远不会过时”(Quality never goes out of style),或万宝路的“请来万宝路之乡”(Come to Marlboro Country)。


用户搜索

疯狂英语 英语语法 新概念英语 走遍美国 四级听力 英语音标 英语入门 发音 美语 四级 新东方 七年级 赖世雄 zero是什么意思合肥市恒大帝景(半汤路)英语学习交流群

网站推荐

英语翻译英语应急口语8000句听歌学英语英语学习方法

  • 频道推荐
  • |
  • 全站推荐
  • 推荐下载
  • 网站推荐