Gary Marcus proposes a systematic criticism of deep learning

While people's application of AI technology is gradually moving toward the right track, the pioneers of artificial intelligence have already turned their attention to the distance. At the beginning of 2018, Gary Marcus, a professor at New York University and a former Uber AI Lab supervisor, published a long article critically exploring the current state and limitations of deep learning. In the article, Marcus said: We must step out of in-depth learning so that we can usher in truly universal AI.

Although the history of deep learning can be traced back decades ago, this method, even the term deep learning, has only just become popular 5 years ago, that is, the field was similarly sponsored by papers like Krizhevsky, Sutskever and Hinton. When the research results are rekindled. Their paper is now a classic deep network model on ImageNet.

What have you found in this area in the following five years? In the areas of speech recognition, image recognition, games, etc., where there has been considerable progress and the enthusiasm of the mainstream media has risen, I have raised ten concerns about deep learning, and if we want to achieve universal artificial intelligence, I suggest other technical supplements. Deep learning.

Many problems (visual and speech) that result in better solutions derived from deep learning have become less effective during the 2016-2017 period. - François Chollet, Google, Keras Author, 2017.12.18

"Science is marching on the funeral," and the future is determined by the students who have questioned everything I said. - Geoffrey Hinton, Deep Learning Godfather, Google Brain Leader, 2017.9.15

1. Learn to hit the wall deeply?
Although the roots of deep learning can be traced back decades ago (Schmidhuber, 2015), up to 5 years ago, people's attention to it was extremely limited. In 2012, Krizhevsky, Sutskever and Hinton published the paper “ImageNet Classification with Deep Convolutional Neural Networks” (Krizhevsky, Sutskever, & Hinton, 2012) and achieved top results in the ImageNet Target Recognition Challenge (Deng et al.). With the publication of such a series of high-impact papers, everything has undergone a fundamental change. At the time, other laboratories were already doing similar work (Cireşan, Meier, Masci, & Schmidhuber, 2012). At the end of 2012, she learned in depth the front page of the New York Times. Then, quickly become the most well-known technology in artificial intelligence. The idea of ​​training multi-layer neural networks is not new (and indeed it is), but deep learning has become practically available for the first time because of increased computing power and data.

Since then, deep learning has produced many top results in areas such as speech recognition, image recognition, and language translation, and has played an important role in many current AI applications. Large companies have also begun investing hundreds of millions of dollars to dig deeper learning talent. One of the key advocators of deep learning, Wu Enda, thought farther and said, “If one person takes a mental task to take less than one second to consider, we may use AI in the present or in the near future. Its automation." (A, 2016). A recent article in the New York Times Sunday Magazine about deep learning suggests that deep learning technology is "ready to reinvent itself."

Nowadays, deep learning may be approaching the corner. Most of it is similar to what I expected in the rise of deep learning (Marcus 2012), and it is similar to the important people like Hinton (Sabour, Frosst, & Hinton, 2017) and Chollet (2017). It has been hinted for months.

What exactly is deep learning? It shows what the characteristics of intelligence? What can we expect from it? When will it not work? How far are we from GM? How close? When dealing with unfamiliar issues, when can machines be as flexible as humans? The purpose of this article is not only to ease the expansion of irrationality, but also to consider the direction that needs to be advanced.

The paper is also written for researchers in the field, writing to AI consumers who lack a technical background and may want to understand the field. As a result, in the second part I will briefly and non-technically introduce what the deep learning system can do and why it is done well. Then, in the third part, we introduce the weaknesses of deep learning. The fourth part introduces the misunderstanding of deep learning ability. Finally we introduce the direction we can move forward.

Deep learning is not likely to die, nor should it die. But five years after the rise of deep learning, it seems time to critically reflect on the ability and insufficiency of deep learning.

2. What is deep learning? What can deep learning do?
Deep learning is essentially a statistical technique based on sample data that classifies patterns using multilayer neural networks. The neural network in the deep learning literature includes a series of input units representing pixels or words, multiple hidden layers containing hidden units (also called nodes or neurons) (the more layers, the deeper the network), and a series of output units There is a connection between the nodes. In a typical application, such a network can be trained on large sets of hand-written numbers (input, represented as images) and labels (output, represented as images) that represent the category to which the input belongs.

Over time, an algorithm called back-propagation has emerged which allows the connection between the units to be adjusted through a gradient descent process so that any given input can have a corresponding output.

In general, we can understand the relationship between the inputs and outputs learned by the neural network as a map. Neural networks, especially neural networks with multiple hidden layers, are particularly good at learning input-output mapping.

Such systems are often described as neural networks because input, hidden, and output nodes are similar to biological neurons but have been greatly simplified. The connections between nodes resemble connections between neurons.

Most deep learning networks heavily use convolution techniques (LeCun, 1989), which constrains the neural connections in the network and allows them to instinctively capture translational invariance. This is essentially an object that can slide around the image while maintaining its own characteristics; as in the figure above, assuming it is a circle in the upper-left corner, even if it lacks direct experience, it can eventually transition to the lower-right corner.

Deep learning also has a well-known ability to self-generate intermediate representations, such as internal units that can respond to more complex elements in horizontal lines or graph structures.

In principle, for a given infinite number of data, the deep learning system can show the limited deterministic "mapping" between a given input set and the corresponding output set, but in practice whether the system can learn such mappings depends on many factors. A common concern is the local minimum trap, that is, the system is suboptimal and there is no better solution in the nearby solution space. (Experts use a variety of techniques to avoid such problems and achieve better results.) In practice, large data sets usually have better results because they have a large number of possible mappings.

For example, in speech recognition, neural networks learn the mapping between sets of speech and labels (such as words or phonemes). In target recognition, the neural network learns the mapping between image sets and tag sets. In DeepMind's Atari game system (Mnih et al., 2015), the neural network learns the mapping between pixel and joystick positions.

The deep learning system is most often used as a classification system because its mission is to determine the category to which a given input belongs (defined by the output unit of the neural network). As long as there is enough imagination, the ability to classify is enormous; the output can represent almost everything, such as words, positions on the chess board.

In a world with unlimited data and computing resources, other technologies may not be needed.

3. The limitations of deep learning<br> The limitation of deep learning is above all the negative thesis: the data that we live in is not infinite. Systems that rely on deep learning often have to generalize to unseen data, which is not infinite, and the ability to ensure high-quality performance is limited.

We can use generalization as an extrapolation between known samples and data beyond the known training sample space (Marcus, 1998a).

For a neural network with good generalization performance, usually there must be a large amount of data, and the test data must be similar to the training data, so that the new answer is interpolated in the old data. In Krizhevsky et al.'s paper (Krizhevsky, Sutskever, & Hinton, 2012), a 9-layer convolutional neural network with 60 million parameters and 650,000 nodes is trained on approximately 1 million different samples from approximately 1000 categories.

This method of violence works well on a limited data set such as ImageNet, and all external stimuli can be classified into smaller categories. It also works well in stable areas. For example, in speech recognition, data can be mapped to a limited set of speech categories in a conventional manner, but for many reasons, deep learning is not a universal solution for artificial intelligence.

The following are the ten challenges facing the current deep learning system:

3.1 Current Deep Learning Requires Large Data <br> Humans can learn abstract relationships with only a few attempts. If I tell you schmister is a sister between the ages of 10 and 21. You may only need one example, and you can immediately launch that you have no schmister, your good friend has no schmister, your child or parent has no schmister and so on.

You don't need hundreds or even millions of training samples. You just need to use an abstract relationship between the variables of a few class algebras to give schmister an accurate definition.

Humans can learn such abstractions, either through accurate definitions or more implicit means (Marcus, 2001). In fact even a 7-month-old baby can learn abstract linguistic rules from a small number of unlabeled samples in just two minutes (Marcus, Vijayan, Bandi Rao, & Vishton, 1999). A subsequent study (2012) conducted by Gervain and colleagues showed that newborns can also perform similar learning.

Deep learning currently lacks a mechanism for learning abstract concepts through the definition of exact words. When there are millions or even billions of training samples, deep learning can achieve the best performance, such as DeepMind's research in chess games and Atari. As Brenden Lake and his colleagues recently emphasized in a series of papers, humans are much more efficient at learning complex rules than deep learning systems (Lake, Salakhutdinov, & Tenenbaum, 2015; Lake, Ullman, Tenenbaum, & Gershman, 2016 (see also related research work by George et al., 2017). My study with Steven Pinker's comparison of the over-regularization error in children and neural networks also proves this point.

Geoff Hinton also expressed concern about deep learning relying on a large number of labeling samples. This view was expressed in his latest Capsule Network study, which points out that convolutional neural networks may encounter “index inefficiencies,” resulting in network failures. Another problem is that the convolutional network has difficulty in generalizing to a new perspective. The ability to handle transformations (invariance) is inherent in the network, and for other common types of transformation invariance, we have to choose between duplicate feature detectors on the grid. The computational cost of this process is exponentially increasing. , Or increase the size of the marked training set, the amount of calculation is also exponential growth.

For problems without large amounts of data, deep learning is usually not the ideal solution.

3.2 Deep learning is still superficial and does not have sufficient ability to migrate <br> It is very important here to realize that “deep” is a technical and architectural nature in deep learning (ie, in modern neural networks) Use a large number of hidden layers, rather than conceptual meanings (ie, the representation of such network acquisition can naturally be applied to concepts such as "justice," democracy," or "intervention.")

Even concrete concepts like "ball" or "opponent" are difficult to learn by deep learning. Consider DeepMind's research on Atari games using deep reinforcement learning, which combines deep learning and reinforcement learning. The results seem superb: The system uses a single "hyper-parameter" set (control network properties such as learning rate) to reach or defeat human experts in a large number of game samples, where the system does not have knowledge of the specific game Even the rules do not know. However, people can easily overinterpret this result. For example, according to an extensive video about the system learning to play Bricklayer Atari game Breakout, “After 240 minutes of training, the system realized that breaking the brick wall through one channel was the most efficient technique to get high marks.”

But in fact, the system has not learned this kind of thinking: it does not understand what the channel is, or what the brick wall is; it only learns specific strategies in specific scenarios. Migration testing (where a deep-enhanced learning system needs to face slightly different scenes from the training process) suggests that depth-enhancement learning methods often learn something. For example, Vicarious's research team showed that DeepMind's more advanced advanced technology - the Atari system "Asynchronous Advantage Actor-Critic" (also called A3C) is playing a variety of changes in the version of Breakout (such as changing the racket's Y coordinate, or There was a failure when a wall was added to the center of the screen. These counterexamples demonstrate that DFS cannot learn to generalize concepts such as brick walls or racquets; more precisely, such comments are caused by over-attribution in biological psychology. The Atari system did not really learn about the robust concept of brick walls, but only superficially broken brick walls in a narrow collection of highly trained scenarios.

I found similar results in the ski game scenario of the research team of the startup Geometric Intelligence (later acquired by Uber). In 2017, a team of Berkeley and OpenAI found that it was easy to construct confrontation samples in multiple games, making DQN (original DeepMind algorithm), A3C, and other related technologies (Huang, Papernot, Goodfellow, Duan, & Abbeel, 2017) is invalid.

Recent experiments by Robin Jia and Percy Liang (2017) show similar views in different fields (language). They trained a variety of neural networks for question answering system tasks (known as SQuAD, Stanford Question Answering Database) where the goal of the task was to mark words related to a given question in a particular paragraph. For example, there is a trained system that can accurately identify the winner of the Super Bowl XXXIII based on a short essay as John Elway. However, Jia and Liang showed that simply inserting interference sentences (for example, claiming that Google's Jeff Dean won in another cup) can make the accuracy rate drop sharply. In the 16 models, the average accuracy rate dropped from 75% to 36%.

In general, the pattern of deep learning extraction is actually more superficial than giving a first impression.

3.3 Deep learning so far has no natural way to deal with hierarchical structures. <br> For linguists like Chomsky, it is not surprising that the problems Robin Jia and Percy Liang have documented. Basically, most current deep learning methods are based on language models to express sentences as pure word sequences. However, Chomsky has always believed that languages ​​have a hierarchical structure, that is, a small component loop is built into a larger structure. (For example, in the sentence "the teenager who has recently changed the Atlantic set a record for flying around the world", the main sentence is "the teenager set a record for flying around the world", and "who was the past for the flying around the world". One sentence of identity.

In the 1980s, Fodor and Pylyshyn (1988) also expressed the same concern, which is an early branch of neural networks. In my 2001 article, I also speculated that a single recurrent neural network (SRN) is the predecessor of today's more complex deep learning approach based on a recurrent neural network (ie, RNN); Elman, 1990) has difficulty expressing and expanding each The recursive structure of the unfamiliar statement (see the essay paper for a specific type).

Earlier in 2017, Brenden Lake and Marco Baroni tested whether such pessimism was still correct. As they were written in the title of the article, contemporary neural networks “have remained unstructured for years”. RNN may "be different between training and testing... Generalization is good when it's very small, but when it comes to systematically combining skills to generalize, RNN is extremely unsuccessful."

Similar problems may also be exposed in other areas, such as planning and motor control, which require complex hierarchical structures, especially when new environments are encountered. We can indirectly see this from the difficulty mentioned above of the Atari game AI. More generally, in the robotics field, systems cannot generally summarize abstract plans in a brand new environment.

At the very least, the core issue of deep learning at present is that its learning feature set is relatively smooth or non-hierarchical. It is like a simple, unstructured list. Each feature is equal. Hierarchical structures (eg, syntax trees that distinguish main clauses from clauses in sentences) are not inherent in such a system, or are expressed directly. As a result, deep learning systems are forced to use a variety of fundamentally inappropriate agents, such as sentences. The sequence position of the word.

Systems like Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013) express a single word as a vector with appropriate success. There are also a number of systems using little tricks trying to express complete sentences in deep learning compatible vector spaces (Socher, Huval, Manning, & Ng, 2012). However, as demonstrated by Lake and Baroni's experiments, the capacity of the circulating network is still limited, and it is not sufficient to accurately and reliably express and summarize the rich structural information.

3.4 Deep learning so far cannot be openly reasoned <br> If you can't figure out the difference between "John promised Mary to leave" and "John promised to leave Mary," you can't tell who is leaving and who What will happen down. Current machine reading systems have achieved some degree of success in tasks such as SQuAD, where answers to given questions are explicitly contained in texts or integrated in multiple sentences (called multi-level inference ) Or integrate in a few explicit sentences of background knowledge, but no specific text is marked. For humans, we can often make extensive inferences when reading texts, forming new and implicit thinking. For example, we can determine the intention of a character only through dialogue.

Although Bowman et al. (Bowman, Angeli, Potts & Manning, 2015; Williams, Nangia & Bowman, 2017) have taken some important steps in this direction, for now, no deep learning system can be based on real-world knowledge. Open reasoning and achieve human-level accuracy.

3.5 Insufficient deep learning to date <br><br> The characteristics of the "black box" of neural networks have been the focus of discussions in the past few years (Samek, Wiegand & Müller, 2017; Ribeiro, Singh & Guestrin, 2016). In the current typical state, the deep learning system has millions or even billions of parameters. Its developer-recognizable form is not a conventionally used ("last_character_typed") human identifiable tag, but only in a complicated network. Applicable geographic form (eg, the activity value of the ith node at layer j in network module k). Although visual tools allow us to see individual node contributions in a complex network (Nguyen, Clune, Bengio, Dosvitskiy & Yosinski, 2016), most observers believe that the neural network as a whole remains a black box.

In the long run, the importance of this situation is still not clear (Lipton, 2016). If the system is robust and self-contained, then there is no problem; if the neural network occupies an important position in a larger system, its debuggability is crucial.

To solve the problem of transparency, the potential for deep learning in some areas such as financial or medical diagnosis is fatal, and humans must understand how the system makes decisions. As Catherine O.Neill (2016) pointed out, this opacity can also lead to serious bias problems.

3.6 So far, deep learning has not been well combined with prior knowledge . An important direction of deep learning is hermeneutics, which isolates itself from other potentially useful knowledge. The way deep learning works usually involves finding a training data set, the various outputs associated with the input, learning through any sophisticated architecture or variation, and data cleansing and/or enhancement techniques, and then learning the relationship between input and output. The method of the problem. Of these, there are only a few exceptions, such as LeCun's study of convolutional constraints on neural network connections (LeCun, 1989). Prior knowledge is intentionally minimized.

Thus, for example, the system proposed by Lerer et al. (2016) learns the physical properties of falling objects from the tower, and there is no prior knowledge of physics (except for the content implied by the convolution). Here, Newton's law is not coded, and the system learns this law from raw pixel-level data (and on some limited levels) and approximates them. As pointed out in my forthcoming paper, deep learning researchers seem to have a strong prejudice against prior knowledge, even though (such as physics) these prior knowledge is well-known.

In general, integrating prior knowledge into deep learning systems is not easy: partly because the knowledge representation in deep learning systems is mainly the relationship between features (mostly opaque), not abstract quantitative statements ( If mortals eventually die, see the discussion of universal quantification of one-to-one mapping (Marcus, 2001), or generics (a statement that can be violated if the dog has four legs or the mosquito carries the Nile virus (Gelman, Leslie, Was & Koch, 2015 )).

This problem is rooted in the machine learning culture, emphasizing that the system needs to be self-contained and competitive – no need for even a little prior knowledge. The Kaggle Machine Learning Contest platform is an illustration of this phenomenon. Participants strive to obtain the best results for a given task on a given data set. The information needed for any given problem is neatly packaged with the relevant input and output files. We have made great progress in this paradigm (mainly in the field of image recognition and speech recognition).

The problem is, of course, that life is not a Kaggle contest; children don't pack all the data neatly into a single directory. In the real world we need to learn more fragmented data, and the problem is not so neatly encapsulated. Deep learning is very effective on issues such as speech recognition, which have many marks, but few people know how to apply them to more open issues. How to pick the rope stuck on the bicycle chain? Do I major in Mathematics or Neuroscience? The training set will not tell us.

The further away from the classification, the closer to common sense the more the problem cannot be solved by deep learning. In a recent study of common sense, I started with Ernie Davis (2015), starting with a series of easily deducible inferences, such as Prince William and his children Prince George who are taller? Can you use a polyester shirt for salads? If you put a needle on a carrot, is there a hole in the carrot or a hole in the needle?

As far as I know, there is no common sense to allow deep learning to answer such questions.

These very simple questions for human beings need to integrate a large number of different sources of knowledge, so there is still a long way to go from the category of deep learning to the style used. On the contrary, this may mean that if we want to achieve a human-level flexible cognitive ability, we need a completely different tool than deep learning.

3.7 So far, deep learning has not yet fundamentally differentiated causal relationships and related relationships <br> If the causal relationship does not really equate with the related relationship, then the difference between the two is also a serious problem for deep learning. Roughly speaking, deep learning is a complex correlation between input features and output features, rather than an inherent causal representation. Deep learning system can regard people as a whole and it is easy to learn that height and vocabulary are related, but it is more difficult to characterize the interrelationship between growth and development (children learn more words at the same time as they grow longer. Big, but it doesn't mean that growing taller will lead them to learn more words, and learning more words will not lead them to grow taller.) Causality has been a central factor in other methods for artificial intelligence (Pearl, 2000), but perhaps it is because the goal of deep learning is not these problems, so the field of deep learning has traditionally tackled this problem. less.

3.8 Deep learning assumes that the world is generally stable. The approach adopted may be probabilistic. The logic of deep learning is that in a highly stable world (such as rule-invariant go), the effect is likely to be best, while in politics and The effects of changing areas such as the economy are not as good. Even if deep learning is applied to tasks such as stock forecasting, it is likely to encounter the same fate as Google Flu Trends; Google Flu Trends can predict epidemiological data very well based on search trends, but Missed the 2013 flu season completely (Lazer, Kennedy, King, & Vespignani, 2014).

3.9 So far, deep learning is only a good approximation, and its answer is not completely reliable. <br> This problem is partly the result of other problems mentioned in this section. Deep learning is fairly comparable in a given area. Most of them work well, but they are still easy to be tricked into fooling.

More and more papers have shown this flaw, from the linguistic case given by Jia and Liang mentioned in the previous article to a large number of examples in the visual field, such as the image description system with deep learning mistakes the yellow and black stripes patterns. Think of the school bus (Nguyen, Yosinski, & Clune, 2014) as mislabeling the parking sign as a full-fledged refrigerator (Vinyals, Toshev, Bengio, & Erhan, 2014), while others appear to perform well.

There have also been cases where real-world stop signs have been mistaken for speed limit signs after being slightly modified (Evtimov et al., 2017), and 3D printed turtles have been mistaken for rifles (Athalye, Engstrom, Ilyas, & Kwok , 2017). There is also a recent news that a system of British police is difficult to distinguish between nudity and sand dunes. [10]

The paper that first proposed the "spoofability" of the deep learning system may be Szegedy et al. (2013). Four years have passed. Although research activities are very active, we have not yet found a robust solution.

3.10. So far, deep learning is hard to use in engineering. With the problems mentioned above, another fact is that it is still difficult to use deep learning for engineering development. As Google’s research team stated in the title of an important but still unanswered paper in 2014 (Sculley, Phillips, Ebner, Chaudhary, & Young, 2014): Machine learning is a “high interest technical debt credit card” This means that machine learning is relatively easy (short-term benefit) in creating systems that can work in certain limited environments, but it is important to ensure that they also work in other environments with entirely new data that may be different from previous training data. Difficulties (long-term debt, especially when a system is used as part of another larger system).

Leon Bottou (2015) compared machine learning with aircraft engine development in an important speech at ICML. He pointed out that while aircraft designs rely on the use of simpler systems to build complex systems, it is still possible to ensure reliable results and machine learning lacks the ability to obtain such assurance. As pointed out by Google’s Peter Norvig in 2016, machine learning currently lacks the progressiveness, transparency, and debuggability of traditional programming. In order to achieve deep learning robustness, some trade-offs need to be made in terms of simplicity.

Henderson and colleagues recently extended these ideas around deep learning, pointing out that the field faces some serious problems related to robustness and reproducibility (Henderson et al., 2017).

Although there has been some progress in the automation of the machine learning system's development process (Zoph, Vasudevan, Shlens, & Le, 2017), there is still a long way to go.

3.11 Discussion <br> Of course, deep learning is itself mathematics; none of the problems given above is because there is a loophole in underlying mathematics for deep learning. In general, deep learning is a perfect way to optimize complex systems that characterize the mapping between input and output when there are enough large data sets.

The real problem lies in misunderstanding what deep learning is good at and what it is not. This technique is good at solving the closed classification problem, that is, mapping a large number of potential signals to a limited number of classifications when there is enough available data and the test set is similar to the training set.

Deviating from these assumptions may cause problems; deep learning is only a statistical technique, and all statistical techniques will have problems when deviating from the assumptions.

When the amount of available training data is limited or there is a significant difference between the test set and the training set or the sample space is wide and there is a lot of new data, the effect of the deep learning system is not so good. And under the restrictions of the real world, some problems cannot be regarded as classification problems at all. For example, open natural language understanding should not be seen as a mapping between different large collections of finite sentences, but should be viewed as a mapping between a potentially infinite range of input sentences and an array of meanings of the same scale, many of which Never encountered before. The use of deep learning in such a problem is like a roundabout chisel, and it can only be a rough approximation. There are certainly solutions in other places.

By considering a series of experiments I had done many years ago (1997), I could gain an intuitive understanding of the current errors when I tested language development on a class of neural networks that later became popular in cognitive science. Some simple aspects. This network is much simpler than the current model. They use no more than three layers (1 input layer, 1 hidden layer, 1 output layer) and do not use convolution techniques. They also use back-propagation technology.

In language, this problem is called generalization. When I heard a sentence "John pilked a football to Mary," I could grammatically infer "John pilked Mary the football." If I know what pilk means, I can infer a new sentence "Eliza pilked the ball. The meaning of "to Alec" was even heard for the first time.

I believe that the extraction of a large number of language problems as simple examples is still a concern at the moment. I have run a series of experiments on training three-level perceptrons (full connection, no convolution) on the identity function f(x) = x.

Training samples are characterized by input nodes (and related output nodes) that characterize binary numbers. For example, the number 7 is represented as 4, 2 and 1 on the input node. To test the generalization ability, I trained the network with a variety of even sets and tested it with odd and even inputs.

I experimented with a variety of parameters and the output was the same: the network could apply the identity function exactly to the trained even number (unless only the local optimum was reached), and some other even numbers, but applied to all Odd numbers have encountered failures, such as f(15)=14.

In general, the neural networks I tested can all learn from the training samples and can generalize the set of points to the neighbors of these samples in the n-dimensional space (ie, the training space), but they cannot infer the results beyond the training space. .

Odd numbers are outside this training space, and the network cannot generalize identity functions out of this space. Even adding more hidden units or more hidden layers is useless. Simple multi-layer perceptrons cannot be generalized beyond the training space (Marcus, 1998a; Marcus, 1998b; Marcus, 2001).

The above is the generalization challenge in the current deep learning network, and it may exist for twenty years. Many of the issues mentioned in this article - data hungriness, dealing with the fragility of fraud, dealing with open inference and migration - can all be seen as extensions of this basic issue. Modern neural networks have a good generalization effect on data that is close to the core training data, but the generalization effect on data that differs greatly from training samples begins to collapse.

Widely used convolution ensures the resolution of specific categories of problems (similar to my identity problem): the so-called translation invariance, the object still maintains its identity after the position is transformed. However, this solution does not apply to all issues, such as Lake's recent display. (Data augmentation extends the training sample space and provides another way to address the challenges of deep learning extrapolation, but such techniques are more effective in language 2d than in languages.)

There is currently no universal solution to the generalization problem in deep learning. For this reason, if we want to implement universal artificial intelligence, we need to rely on different solutions.

4. The potential risks of over-hype <br> One of the biggest risks of current AI over-hype is to once again experience the cold winter of AI, just like in the 1970s.尽管现在的AI 应用比1970 年代多得多,但炒作仍然是主要担忧。当吴恩达这样的高知名度人物在《哈佛商业评论》上撰文称自动化即将到来(与现实情况有很大出入),过度预期就带来了风险。机器实际上无法做很多普通人一秒内就可以完成的事情,从理解世界到理解句子。健康的人类不会把乌龟错认成步枪或把停车牌认成冰箱。

大量投资AI 的人最后可能会失望,尤其是自然语言处理领域。一些大型项目已经被放弃,如Facebook 的M 计划,该项目于2015 年8 月启动,宣称要打造通用个人虚拟助手,后来其定位下降为帮助用户执行少数定义明确的人物,如日历记录。

可以公平地说,聊天机器人还没有达到数年前炒作中的预期。举例来说,如果无人驾驶汽车在大规模推广后被证明不安全,或者仅仅是没有达到很多承诺中所说的全自动化,让大家失望(与早期炒作相比),那么整个AI 领域可能会迎来大滑坡,不管是热度还是资金方面。我们或许已经看到苗头,正如Wired 最近发布的文章《After peak hype, self-driving cars 14 enter the trough of disillusionment》中所说的那样(https://)。

还有很多其他严重的担忧,不只是末日般的场景(现在看来这似乎还是科幻小说中的场景)。我自己最大的担忧是AI 领域可能会陷入局部极小值陷阱,过分沉迷于智能空间的错误部分,过于专注于探索可用但存在局限的模型,热衷于摘取易于获取的果实,而忽略更有风险的「小路」,它们或许最终可以带来更稳健的发展路径。

我想起了Peter Thiel 的著名言论:「我们想要一辆会飞的汽车,得到的却是140 个字符。」我仍然梦想着Rosie the Robost 这种提供全方位服务的家用机器人,但是现在,AI 六十年历史中,我们的机器人还是只能玩音乐、扫地和广告竞价。

没有进步就是耻辱。AI 有风险,也有巨大的潜力。我认为AI 对社会的最大贡献最终应该出现在自动科学发现等领域。但是要想获得成功,首先必须确保该领域不会陷于局部极小值。

5. 什么会更好?
尽管我勾画了这么多的问题,但我不认为我们需要放弃深度学习。相反,我们需要对其进行重新概念化:它不是一个普遍的解决办法,而仅仅只是众多工具中的一个。我们有电动螺丝刀,但我们还需要锤子、扳手和钳子,因此我们不能只提到钻头、电压表、逻辑探头和示波器。

在感知分类方面,如果有大量的数据,那么深度学习就是一个有价值的工具。但在其它更官方的认知领域,深度学习通常并不是那么符合要求。那么问题是,我们的方向应该是哪?下面有四个可能的方向。

5.1 无监督学习 <br> 最近深度学习先驱Geoffrey Hinton 和Yann LeCun 都表明无监督学习是超越有监督、少数据深度学习的关键方法。但是我们要清楚,深度学习和无监督学习并不是逻辑对立的。深度学习主要用于带标注数据的有监督学习,但是也有一些方法可以在无监督环境下使用深度学习。但是,许多领域都有理由摆脱监督式深度学习所要求大量标注数据。

无监督学习是一个常用术语,往往指的是几种不需要标注数据的系统。一种常见的类型是将共享属性的输入「聚类」在一起,即使没有明确标记它们为一类也能聚为一类。Google 的猫检测模型(Le et al., 2012)也许是这种方法最突出的案例。

Yann LeCun 等人提倡的另一种方法(Luc, Neverova, Couprie, Verbeek, & LeCun, 2017)起初并不会相互排斥,它使用像电影那样随时间变化的数据而替代标注数据集。直观上来说,使用视频训练的系统可以利用每一对连续帧替代训练信号,并用来预测下一帧。因此这种用第t 帧预测第t+1 帧的方法就不需要任何人类标注信息。

我的观点是,这两种方法都是有用的(其它一些方法本文并不讨论),但是它们本身并不能解决第3 节中提到的问题。这些系统还有一些问题,例如缺少了显式的变量。而且我也没看到那些系统有开放式推理、解释或可调式性。

也就是说,有一种不同的无监督学习概念,它虽然很少有人讨论,但是仍然非常有意思:即儿童所进行的无监督学习。孩子们通常会为自己设置一个新的任务,比如搭建一个乐高积木塔,或者攀爬通过椅子的窗口。通常情况下,这种探索性的问题涉及(或至少似乎涉及)解决大量自主设定的目标(我该怎么办?)和高层次的问题求解(我怎么把我的胳膊穿过椅子,现在我身体的其他部分是不是已经通过了?),以及抽象知识的整合(身体是如何工作的,各种物体有哪些窗口和是否可以钻过去等等)。如果我们建立了能设定自身目标的系统,并在更抽象的层面上进行推理和解决问题,那么人工智能领域将会有重大的进展。

5.2 符号处理和混合模型的必要性 <br> 另一个我们需要关注的地方是经典的符号AI,有时候也称为GOFAI(Good Old-Fashioned AI)。符号AI 的名字来源于抽象对象可直接用符号表示这一个观点,是数学、逻辑学和计算机科学的核心思想。像f = ma 这样的方程允许我们计算广泛输入的输出,而不管我们以前是否观察过任何特定的值。计算机程序也做着同样的事情(如果变量x 的值大于变量y 的值,则执行操作a)。

符号表征系统本身经常被证明是脆弱的,但是它们在很大程度上是在数据和计算能力比现在少得多的时代发展起来的。如今的正确之举可能是将善于感知分类的深度学习与优秀的推理和抽象符号系统结合起来。人们可能会认为这种潜在的合并可以类比于大脑;如初级感知皮层那样的感知输入系统好像和深度学习做的是一样的,但还有一些如Broca 区域和前额叶皮质等领域似乎执行更高层次的抽象。大脑的能力和灵活性部分来自其动态整合许多不同计算法的能力。例如,场景感知的过程将直接的感知信息与关于对象及其属性、光源等复杂抽象的信息无缝地结合在一起。

现已有一些尝试性的研究探讨如何整合已存的方法,包括神经符号建模(Besold et al., 2017)和最近的可微神经计算机(Graves et al., 2016)、通过可微解释器规划(Bošnjak, Rocktäschel, Naradowsky, & Riedel, 2016)和基于离散运算的神经编程(Neelakantan, Le, Abadi, McCallum, & Amodei, 2016)。虽然该项研究还没有完全扩展到如像full-service 通用人工智能那样的探讨,但我一直主张(Marcus, 2001)将更多的类微处理器运算集成到神经网络中是非常有价值的。

对于扩展来说,大脑可能被视为由「一系列可重复使用的计算基元组成- 基本单元的处理类似于微处理器中的一组基本指令。这种方式在可重新配置的集成电路中被称为现场可编程逻辑门阵列」,正如我在其它地方(Marcus,Marblestone,&Dean,2014)所论述的那样,逐步丰富我们的计算系统所建立的指令集会有很大的好处。

5.3 来自认知和发展心理学的更多洞见 <br> 另一个有潜在价值的领域是人类认知(Davis & Marcus, 2015; Lake et al., 2016; Marcus, 2001; Pinker & Prince, 1988)。机器没有必要真正取代人类,而且这极易出错,远谈不上完美。但是在很多领域,从自然语言理解到常识推理,人类依然具有明显优势。借鉴这些潜在机制可以推动人工智能的发展,尽管目标不是、也不应该是精确地复制人类大脑。

对很多人来讲,从人脑的学习意味着神经科学;我认为这可能为时尚早。我们还不具备足够的神经科学知识以真正利用反向工程模拟人脑。人工智能可以帮助我们破译大脑,而不是相反。

不管怎样,它同时应该有来自认知和发展心理学的技术与见解以构建更加鲁棒和全面的人工智能,构建不仅仅由数学驱动,也由人类心理学的线索驱动的模型。

理解人类心智中的先天机制可能是一个不错的开始,因为人类心智能作为假设的来源,从而有望助力人工智能的开发;在本论文的姊妹篇中(Marcus,尚在准备中),我总结了一些可能性,有些来自于我自己的早期研究(Marcus, 2001),另一些则来自于Elizabeth Spelke 的研究(Spelke & Kinzler, 2007)。来自于我自己的研究的那些重点关注的是表示和操作信息的可能方式,比如用于表示一个类别中不同类型和个体之间不同变量和差异的符号机制;Spelke 的研究则关注的是婴儿表示空间、时间和物体等概念的方式。

另一个关注重点可能是常识知识,研究方向包括常识的发展方式(有些可能是因为我们的天生能力,但大部分是后天学习到的)、常识的表示方式以及我们如何将常识用于我们与真实世界的交互过程(Davis & Marcus, 2015)。Lerer 等人(2016)、Watters 及其同事(2017)、Tenenbaum 及其同事(Wu, Lu, Kohli, Freeman, & Tenenbaum, 2017)、Davis 和我(Davis, Marcus, & Frazier-Logue, 2017)最近的研究提出了一些在日常的实际推理领域内思考这一问题的不同方法。

第三个关注重点可能是人类对叙事(narrative)的理解,这是一个历史悠久的概念,Roger Schank 和Abelson 在1977 年就已提出,并且也得到了更新(Marcus, 2014; Kočiský et al., 2017)。

5.4. 更大的挑战 <br> 不管深度学习是保持当前形式,还是变成新的东西,抑或被替代,人们也许认为大量的挑战问题会将系统推进到有监督学习无法通过大型数据集学习到知识。以下是一些建议,它们部分摘自最近一期的《AI Magazine》特刊(Marcus, Rossi, Veloso - AI Magazine, & 2016, 2016),该杂志致力于超越我和Francesca Rossi、Manuelo Veloso 一起编辑的杂志《Turing Test》:

理解力挑战(Paritosh & Marcus, 2016; Kočiský et al., 2017)需要系统观看一个任意的视频(或者阅读文本、听广播),并就内容回答开放问题(谁是主角?其动机是什么?如果对手成功完成任务,会发生什么?)。没有专门的监督训练集可以涵盖所有可能的意外事件;推理和现实世界的知识整合是必需的。
科学推理与理解,比如艾伦人工智能研究所的第8 级的科学挑战(Schoenick, Clark, Tafjord, P, & Etzioni, 2017; Davis, 2016)。尽管很多基本科学问题的答案可轻易从网络搜索中找到,其他问题则需要清晰陈述之外的推理以及常识的整合。
一般性的游戏玩法(Genesereth, Love, & Pell, 2005),游戏之间可迁移(Kansky et al., 2017),这样一来,比如学习一个第一人称的射击游戏可以提高带有完全不同图像、装备等的另一个游戏的表现。(一个系统可以分别学习很多游戏,如果它们之间不可迁移,比如DeepMind 的Atari 游戏系统,则不具备资格;关键是要获取累加的、可迁移的知识。)
物理具化地测试一个人工智能驱动的机器人,它能够基于指示和真实世界中与物体部件的交互而不是大量试错,来搭建诸如从帐篷到宜家货架这样的系统(Ortiz Jr, 2016)。
没有一个挑战可能是充足的。自然智能是多维度的(Gardner, 2011),并且在世界复杂度给定的情况下,通用人工智能也必须是多维度的。

通过超越感知分类,并进入到推理与知识的更全面整合之中,人工智能将会获得巨大进步。

6.结语 <br> 为了衡量进步,有必要回顾一下5 年前我写给《纽约客》的一篇有些悲观的文章,推测「深度学习只是构建智能机器面临的15 个更大挑战的一部分」,因为「这些技术缺乏表征因果关系(比如疾病与症状)的方法」,并在获取「兄弟姐妹」或「相同」等抽象概念时面临挑战。它们没有执行逻辑推理的显式方法,整合抽象知识还有很长的路要走,比如对象信息是什么、目标是什么,以及它们通常如何使用。

正如我们所见,尽管特定领域如语音识别、机器翻译、棋盘游戏等方面出现重大进展,尽管在基础设施、数据量和算力方面的进展同样令人印象深刻,但这些担忧中的很多依然存在。

有趣的是,去年开始不断有其他学者从不同方面开始强调类似的局限,这其中有Brenden Lake 和Marco Baroni (2017)、François Chollet (2017)、Robin Jia 和Percy Liang (2017)、Dileep George 及其他Vicarious 同事(Kansky et al., 2017)、 Pieter Abbeel 及其Berkeley 同僚(Stoica et al., 2017)。

也许这当中最著名的要数Geoffrey Hinton,他勇于做自我颠覆。上年8 月接受Axios 采访时他说自己「深深怀疑」反向传播,因为他对反向传播对已标注数据集的依赖性表示担忧。

相反,他建议「开发一种全新的方法」。与Hinton 一样,我对未来的下一步走向深感兴奋。

Bitmain Miner

Bitmain Miner:Bitmain Antminer L7 (9.5Gh),Bitmain Antminer L7 (9.16Gh),Bitmain Antminer L7 (9.05Gh),Bitmain Antminer L3+ (504Mh),Bitmain Antminer L3+ (600Mh),Bitmain Antminer L3++ (580Mh),Bitmain Antminer L3++ (596Mh)


Bitmain is the world's leading digital currency mining machine manufacturer. Its brand ANTMINER has maintained a long-term technological and market dominance in the industry, with customers covering more than 100 countries and regions. The company has subsidiaries in China, the United States, Singapore, Malaysia, Kazakhstan and other places.


Bitmain has a unique computing power efficiency ratio technology to provide the global blockchain network with outstanding computing power infrastructure and solutions. Since its establishment in 2013, ANTMINER BTC mining machine single computing power has increased by three orders of magnitude, while computing power efficiency ratio has decreased by two orders of magnitude. Bitmain's vision is to make the digital world a better place for mankind.

Bitmain Miner,antminer L3 Miner,Asic L3 Antminer,Bitmain Antminer L7,ltc miner

Shenzhen YLHM Technology Co., Ltd. , https://www.sggminer.com