New algorithm for learning languages

Cornell University and Tel Aviv University researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences.

The development — which has a patent pending — has implications for speech recognition and for other applications in natural language engineering, as well as for genomics and proteomics. It also offers new insights into language acquisition and psycholinguistics.

“The algorithm — the computational method — for language learning and processing that we have developed can take a body of text, abstract from it a collection of recurring patterns or rules and then generate new material,” explained Shimon Edelman, a computer scientist who is a professor of psychology at Cornell and co-author of a new paper, “Unsupervised Learning of Natural Languages,” published in the Proceedings of the National Academy of Sciences (PNAS, Vol. 102, No. 33).

“This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical new sentences and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics,” he said.

Unlike previous attempts at developing computer algorithms for language learning, the new method, called Automatic Distillation of Structure (ADIOS), successfully identifies complex patterns in raw texts. The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.

For example, the sentences I would like to book a first-class flight to Chicago, I want to book a first-class flight to Boston and Book a first-class flight for me, please may give rise to the pattern book a first-class flight — if this candidate pattern passes the novel statistical significance test that is the core of the algorithm.

If the system also encounters the sentences I need to book a direct flight from New York to Tel Aviv andI would like to book an economy flight , it may infer that the phrases first-class, direct and economy are equivalent in the context of the new pattern. “Because such equivalence sets can contain other patterns — in turn containing further patterns, and so on — the resulting body of knowledge grows recursively, as a sort of forest of branching trees of possibilities,” said Edelman.

He added, “ADIOS relies on a statistical method for pattern extraction and on structured generalization — two processes that have been implicated in language acquisition. Our experiments show that it can acquire intricate structures from raw data, including transcripts of parents’ speech directed at 2- or 3-year-olds. This may eventually help researchers understand how children, who learn language in a similar item-by-item fashion and with very little supervision, eventually master the full complexities of their native tongue.”

In addition to child-directed language, the algorithm has been tested on the full text of the Bible in several languages, on artificial context-free languages with thousands of rules and on musical notation. It also has been applied to biological data, such as nucleotide base pairs and amino acid sequences. In analyzing proteins, for example, the algorithm was able to extract from amino acid sequences patterns that were highly correlated with the functional properties of the proteins.

The new method was developed jointly with David Horn and Eytan Ruppin, professors of physics and computer science, respectively, at Tel Aviv University, and with Zach Solan, a doctoral student there and the lead author on the paper. Their collaboration with Edelman was supported in part by the U.S.-Israel Binational Science Foundation.

From Cornell University

Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.

43 thoughts on “New algorithm for learning languages”

Anonymous
April 14, 2009 at 10:12 am
Yes, it’s great feature for spam prevention development.
Anonymous
February 9, 2008 at 1:10 am
Yes it’s pity but it is so:(
Search engines become smarter only because spa*mers getting smarter and invent such linguistic tools.
[email protected]
February 24, 2006 at 8:30 am
create program , that can easily generate texts with content, that answers exactly to human’s questions. I think it’ll be possible, when smth about background knowledge will be created.
Anonymous
February 24, 2006 at 8:28 am
I study Applied Linguistics in my university, Computational and other lingustics. My purpose is to
[email protected]
January 29, 2006 at 5:37 pm
Well this is interesting I imagine the algorithm is in some version of C+ or something. There was once three or four competing lists of rules for English syntax, about 33 or so, so this is a good thing for computers to do, so to speak, sort of a real-time syntactical concordance analyses.
Anonymous
September 3, 2005 at 7:40 pm
I once studied syntax and transformational grammar at Stony Brook University, along with other linguistic classes for Anthropology and find this an interesting development. When computer languages got started there was one SNOBOL which processed language, instead of numbers, which I thought might some day be developed, why even Bill Gates once promised a SNOBOL for Windows. (Where is it Mr. Gates?) Well this is interesting I imagine the algorithm is in some version of C+ or something. There was once three or four competing lists of rules for English syntax, about 33 or so, so this is a good thing for computers to do, so to speak, sort of a real-time syntactical concordance analyses. Bravo!
George J Myers, Jr. (my first post here, I’m awed)
Anonymous
September 1, 2005 at 9:57 am
I need a French/English version now please…before it’s too late.
Anonymous
September 1, 2005 at 1:25 am
or even better, a spam generator!
-bugmenot.com-
Anonymous
August 31, 2005 at 10:44 pm
Since spam often consists of randomly generated sentences could this algorithm be used as a spam filter?

New algorithm for learning languages

Related

43 thoughts on “New algorithm for learning languages”

Leave a Comment Cancel reply