If by intrinsic you mean the same thing as I do, i.e. low-level wrappers for CPU-specific vector instructions (MMX, SSE, AVX), then it's definitely not the same thing. C = A .+ B means add two arrays element-wise. This could be performed using different means depending on the concrete CPU where the code is executed. It could have a scalar implementation on simpler CPUs, an SSE implementation on older CPUs and an AVX implementation on newer CPUs. Furthermore, to code this using instrinsics you'll still have to code a for loop, paying close attention to bounds (what happens if the number of elements isn't divisible by the width of a vector register?), etc., it's not obvious and it's remote from the logic you're actually trying to express, i.e. element-wise addition.
Note that AVX intrinsics use SSE when on a platform where AVX isn't available. So if you take my previous example of VECTOR_ADD(), it would do exactly what you are saying here: perform AVX or SSE depending on the concrete CPU. Or if you were compiling on some processor that didn't have sse instructions, just do a scalar arithmetic operations (though afaik non-exist).
I DID forget the very important difference that you mention above (bolded): the boundaries, so, you are right, it is fundamentally different. It isn't like you can just use generic size arrays with the intrinsic themselves.
Exactly. The C language is built for an abstract machine that closely resembles machines that were around at the time. We need languages built for an abstract machine that resembles today's machines. Maybe Haskell is the parallel panacea but immutable data structures pose their own performance problems. A common approach in F# is to design things functional first and then refactor to arrays and mutability where the bottlenecks are. I don't think the ideal language is infinitely, ideally parallelizable, but allows the programmer to express parallelizable constructs as easily as sequential ones. You mention "the simple threading people know how to do"... I honestly don't know what manual threading is simple and who knows how to do it. In the language I develop most in, C#, it's very hard to reason about races using just sequential code, threads and locks.
To be fair, I'm coming from a different perspective when I say simple threading. I'm talking about that people conceptually understand threading, not necessarily that it is really easy to use in practice with shared state. Most people can't program correctly sequentially let alone avoid races in parallel code. But, for an example of something people would consider harder, take CnC. You have additional concepts of things called steps, items, and tags. In this analogy items are the data arrays you produce and consume. Steps are the functions that operate on the items. And tags are control mechanism to specify when steps can fire. So in the end, your program ends up describing the relationships between all of these things and then separately you have tuning phase where you can specify optimizations between the different items and steps (e.g. for example, that certain steps should not be attempted to be scheduled before some other ones). CnC has a write once policy and since you explicitly map dependencies between your steps, you don't need any kind locking up front. However, it would be more difficult for me to write an parallel LU decomposition using this methodology than it would to just use arrays, chunk my data into threads, and use barriers in between the phases. You do have correctness guarantees if you use the methodology correctly but decomposing this way is not as simple as with conventional semantics.
That's the beauty of async: it doesn't necessarily mean "on another thread". All different async methods could run on the same thread, unless they actually need to run concurrently. They could even run on the caller thread if the caller thread was idle (say in a UI-driven application). By speaking a higher-level language, you express your actual intent (asynchrony) rather than a specific implementation (threads), and that's a real, semantic difference.
Let's be specific, what do we mean by running an async method on the same thread versus another thread? Are we talking about same hardware resources versus other hardware resources? fundamentally, calling an async method just puts a new task in a queue and the runtime system decides when to schedule it. It could end up running on the same hardware resources or different hardware resources. Threads don't really exist conceptually here and there's certainly not apriori grouping of tasks to run together.
Similarly, if you are using threads in .net, what happens if you create a new thread? What happens is that the native OS thread_create method just puts a new thread in the system thread queue and then the operating system decides when to schedule it. It could end up running on the same hardware resources or different hardware resources. Moreover, this isn't even necessarily the case. If the threading system is a user space threading system then the runtime system decides when to schedule it --> not the OS.
In both cases you are expressing asynchrony in the same manner via spawn/join semantics, the only effective difference is where the scheduling is done. And again as I said, if you have a userspace threading system (if you decided to use threadpools for example) then it ends up being the runtime system's responsibility either way. Note, the one important difference that I see is that the overhead of creating threads is going to be higher than creating async tasks IF you don't use threadpools. So you wouldn't want to go off spawning a bunch of threads that don't do much work if you aren't using threadpools.
At the very least you have to put that in the plural because with multiple processors you have many "currently executing statement"s. Secondly, it's now a many-to-many relationship between programming language statements and machine code statements: a single line of C can translate to many instructions, and a single, say, vector instruction can map to several lines of C. Also processors support out-of-order execution and can even execute two branches of a conditional simultaneously (on the same core). So the idea that "this line of code is what the CPU is now doing" is wrong today on several levels. It has caused many hard to track bugs in multi-threaded code.
You have multiple cores so you have multiple executing statements, but that doesn't throw out the concept of a PC or the dependencies between instructions. Out-of-order execution refers executing already decoded instructions when their dependencies are available from a pipeline point of view -- nothing more.They are still fetched, decoded and committed in order. The point is that if you have fetched and decoded instructions that don't have dependencies that you can execute those in whatever order their operands become available in. This allows you to reduce stalls in the pipeline. But, those instructions are never committed to the register-file out-of-order so from the perspective of anyone outside of the CPU, the instructions were executed completely in-order -- the only difference being that there are reduced stall cycles in the pipeline. At the end of the day, the important parts are only the instruction fetch and commit orders --> those have to be in-order otherwise you break the memory consistency rules. Also, I don't know what you mean by executing two branches of a conditional simultaneously, are you referring to branch prediction? Because that is just one or the other path and is strictly about pushing instructions into the pipeline sooner instead of waiting until the conditional hits the execution stage of the pipeline. It's a similar to OoOE -- to reduce pipeline stalls, and again occurs completely in-order from a commit perspective. The point is that the pipeline of each processor operates in-order from an outside perspective so absolutely the notion of currently executing statements still stands.