Floating-point and integer workloads. It really bugs me that the highest-level option we currently have for writing vector code is dropping down to instrinsics in C++. I should be able to just add two arrays, or I should be able to express matrix multiplication in such a way that I know this will compile to vectorized code. For now auto-vectorisation only exists as an optimization in C++, and even then it's far from obvious that a given piece code can be auto-vectorized. You can't even express "please give me a compiler error if the following code is not vectorizable". The problem being that the piece of code is scalar code that has to be reverse-engineered into parallel code, rather than parallel code that can naturally be translated into parallel instructions, just like scalar C code translates naturally into scalar assembly code.
Architecturally speaking, for many generations Intel abandoned enhancements to the integer portions of their SIMD pipelines. Basically what we've simply had is SSE front-end instructions shoehorned onto half of the the floating-point AVX data-path. Personally, I've always considered it to be kludge to maintain backwards compatibility with SSE more than anything. With AVX2 (in Haswell) though, Intel has finally has proper support across workloads though.
Compiler wise, you can usually make low-level language compilers tell you if they are promoting operations to vector equivalents (http://gcc.gnu.org/p...torization.html, ICC and VS have options also). When you say that you have to reverse engineer a code to be parallel, I don't really the see the effective difference whether you have SIMD operations in the language or not. Suppose, they have language extensions and you can natively do vector operations in the language. In order to use those, you need to be able to decompose your problem into a parallel SIMD form. Similarly, even without the language extensions you are still need to decompose your problem into a parallel SIMD form. The only effective difference is that you have to explicitly write the loop and the compiler will generate the vector operation for you as opposed you writing out the operation yourself. There are generally intrinsic instructions if you didn't want to do the loop yourself though. The point though is that that the process of decomposition and expression is there regardless.
Again, these are high level languages that aren't purposed for high performance (in fact the ISAs lack any notion of vectorization support). They were explicitly designed to forgo low-level control for automatic optimization and memory management. Manual vectorization would be be against those goals. So, to answer the question, yes, you could feasible enable vectorization opportunities through language extensions, but that goes against the design of the languages. Really the point is that it is the wrong class of languages for such semantics. If you want low-level control for optimization opportunities, then use or interface with a low-level language.
async really changes the way programmers think about task parallelism because now they can write concurrent code just like sequential code; no more callbacks, natural exception handling, etc; the transformation is done systematically by the compiler. That's a great development, but at the same time it's an extension on top of essentially sequential languages. I wonder how farther we could go by designing a language to be concurrent from the ground up, rather than sequential first and concurrent sprinkled on top. Perhaps that stuff is already well figured out in the academia, but I'm still using a language which name starts with "C" and staring at a switch case inside a for loop right now, and I know that none of that can ever be automatically parallelized.
Async/await is just fork-join-parallelism where you spawn a task and then later join back to the task. It's the typical style of parallelism that most languages and threading libraries support (pthreads, omp, windows threads, etc.). The difference here is that it has simplified semantics compared to traditional threading libraries so it's easier to start/manage/join to tasks. I don't see why this changes the way programmers think about parallelism, Regardless, the programmer has to go though the process of identifying fork-join parallelism and decomposing programs into tasks. It isn't as if they can get away with thinking sequentially.
If you want languages that are concurrent from the ground up, then look into functional languages (haskell, Lisp, etc.). These languages consist of statements that are semantically concurrent and thus given infinite resources could be perfectly executed in parallel (as much as data dependencies allow). The problem is they don't map well to real machines. Real machines have limited resources and the granularity of tasks is important for performance. By switching your language to be inherently parallel you are giving the compiler the task of automatically (auto-magically) mapping parallelism to limited resources or more strictly put: the compiler is responsible for optimizing perfectly parallel code to fit on machine with limited parallel resources (chunking things into tasks and simd instructions). This doesn't necessarily end well because it is actually a fairly difficult problem from a compiler standpoint. Hand-tuned manual parallelism using control-flow languages tends to yield better results. And that's where you always end up, because you don't have have infinite resources so you are either trying to (1) parallelize a sequential program, (2) letting a compiler parallelize a sequential program, or (3) letting a compiler sequentialize a parallel program.
How is it "not all that bad"? It's like writing assembly code: low-level, error-prone, and highly remote from the actual domain logic. We got rid of the need to write assembly code in the 70s, and by the 90s programs had gotten so complex that it was downright impossible to write large programs in assembly code. OpenCL is the assembly code of parallel programming. We need to develop some language on top of it so anyone can write complex programs. Right now everyone is writing his own framework on top of it, but isn't that proof that none of these are satisfactory solutions?
If you are going to do comparisons on difficulty then this is more valid comparison: x86 SIMD assembly is akin to PTX assembly (NV assembly). Those are around similar levels of difficulty. OpenCL/CUDA is not on level of either of those things because you don't need to know specific ISA details to write OpenCL/CUDA code. Most things are abstracted away from you and it has actual high-level language concepts like variables, memory management, tasking, etc. Such concepts don't exist in assembly. Of course, these languages are designed to closely map to the architectures they run on and give you relatively low-level control so they aren't exactly super easy to use.
You seem to be making two arguments to me: you want more control in higher level languages but you want less control in lower level languages. Complex programs require complex languages and you aren't going to get a silver bullet of easy-to-use-with-lots-of-power-and-runs-well language. The fact is, no-one has an answer for what the perfect language or runtime is: it's simply an open research topic. Currently, from what I've seen, people think the best alternative to do a layered approach. At the low level you have languages that map well to your architecture and a higher level you have languages that are transformed into lower-level languages. You expose architectural details in the low level languages, but not in the high level languages. On top of that you expose some sort of hinting system at the high level to give help with parallel optimizations (e.g. group these tasks). This is just a vague trend I'm alluding to that has been occurring in HPC.