Jump to content



Photo

AMD 2014 Roadmap Discussion (CPU/APU)

amd cpu apu

  • Please log in to reply
49 replies to this topic

#31 TheExperiment

TheExperiment

    Reality Bomb

  • Tech Issues Solved: 1
  • Joined: 11-October 03
  • Location: Everywhere
  • OS: 8.1 x64

Posted 14 January 2014 - 18:42

I think for the same price,you can get better performance out of discreet components, but the advantage here and when this apu will really shine is if the software begins to take advantage of stuff like mantle, HSA, true audio. If the software support doesn't get there, then i don't know what amd can do. intel is so far ahead of them in single core performance.

IMO the software support is already there.  A lot of the stuff I'm most excited about is, anyway (though obviously not all.)

 

I've never gotten super excited about non game benchmarks, and it's doing great in games.  With 8.1s hybrid support apps can drop back to the integrated for things that don't need the power, and Mantle will make multiple different GPUs useful instead of needing them both the same.  I just can't order my A10-7850K yet but I'm glad it did well  :)




#32 +snaphat (Myles Landwehr)

snaphat (Myles Landwehr)

    Electrical & Computer Engineer

  • Tech Issues Solved: 29
  • Joined: 23-August 05
  • OS: Win/Lin/Bsd/Osx
  • Phone: dumb phone

Posted 15 January 2014 - 00:24

@snaphat I'm not really advocating SIMD intrinsics in C# or other high-level languages, what I'd like to see is more natural ways to express parallel or vector operations that can then be easily compiled into vector/parallel code without the need for sophisticated reverse-engineering. Again, going back to simply adding two arrays. I should be able to just say

 

int[] c = a + b;

 

where a and b are arrays, and c is an array containing the sum of their respective elements. That is something that can be implemented sequentially, or with SIMD instructions, or that can even run on multiple threads if the arrays are sufficiently large.

Right, I understood what you meant by natural vector expression. It's what matlab would do if you did C = A .+ B. But this is effectively the same thing as an intrinsic. It's just syntactical sugar that would be directly transformed into an intrinsic. That's why I was saying that you are going through the process of decomposition either way. The user has to identify that they want to do a vector operations so what's the difference if you are throwing an intrinsic in saying VECTOR_ADD(c, a, b) instead?  It's just a one liner that looks nice versus one that looks crappier.

 

On a side note, you'd never generally want to create multiple threads or do some form of SMP scheduling automatically with a vector notation like that. The problem is that there is just a too large of optimization search space to make it run well without giving explicit hints (e.g. split into N threads with X work each). And at that point if you hinting, you are just doing manual fork-join parallelism.

 

There is some of that already in functional constructs like LINQ and functional languages like F#, but needless to say the runtime support for actual vectorization/parallelisation isn't there. So there is work to do on runtimes, but there's also work at the language level. Functional languages, although favoring immutable data and therefore being naturally more amenable to parallelism, were not really designed with this purpose in mind. What if a language was built from the ground up for parallelism?

LINQ and F# aren't strictly functional languages and more importantly they aren't pure functional languages. So like you said, there are side effects there and that limits parallelism. Those really aren't designed with parallelism in mind. However, if we are talking about pure functional languages then parallelism is only limited by the data dependencies inherit in the algorithm. If you know the dependencies then you can do a perfectly parallel execution. Pure functional languages can technically be considered the "pancea" for achieving maximal parallelism. This is what I mean by being built ground up for parallelism. Unfortunately, like I said before, there are numerous practical concerns, and vectorization would fall under that. Given the data dependencies you'd know exactly what you could vectorize, but  you wouldn't know what you should vectorize so you end up with an optimization problem.

 

But, being built ground up for parallelism doesn't necessarily make something usable on real machines because it really doesn't necessarily fit with how a real machine operates or looks. When you are talking about something being built from the ground-up for parallelism you are stipulating more that the language should reflect the parallelism inherit in the machine. At least, that's how I'm interpreting what you are saying. I consider the former to be something different. Also, as the Intel guy mentioned in the talk I linked, there are plenty of languages that have been designed around parallelism: http://en.wikipedia....mming_languages. None of them catch on because they aren't a silver bullet and they are far removed from the simple threading people know how to do.

 

 

I wonder if one day optimizers won't beat humans at that game too like they did for assembly code decades ago. In the meantime, you can have declarative/functional constructs and manual parallelism, for instance you insert the AsParallel() extension method where you think the granularity is optimal and everything else stays declarative, so that's already something.

The general rule of thumb I've heard is that for a new architecture, it takes a decade for a compiler to become sufficient enough at optimizing code enough to beat a human who is writing sequential code in assembly. Grain of salt though because I've never seen statistics to back that up just people in HPC saying it off and on. Generally speaking though, no-one is really placing stock in compilers doing automatic parallelization (outside of instruction level parallelism). Most compilers struggle with any form of inter-procedural analysis. It's an NP-Hard problem so I wouldn't count on compilers solving it in our life time. 

 

 

When I say async changes the way programmers think, I mean they can look at a given piece of asynchronous code and reason about it in the same way they do a sequential piece of code: the two look identical with the exception of a few additional keywords, and apart from the asynchrony the logic is the same. Whereas previously you had to build this complicated state machine with callbacks and while the end result was the same, the code looked nothing like the logic you were trying to implement, it was all about handling asynchrony rather than doing what you were actually trying to do.  Here's a good example of that: http://mnajder.blogs...ment-using.html

I'd say this: those constructs lower the curve for a programmer to actually use this form of parallelism. Before, it would been more of pain in .net if you wanted fire off some asynchronous tasks and wait for them, but design wise you could do the same thing with thread/join in C#. I mean StartMenu(), StartGame(), and startTimer() are fundamentally just threads that you join back to at some time. I don't see any reason why you'd do callbacks, build a state machine, etc.

 

 

I pretty much agree with the alternative you describe. I just think that the kind of high level language we use now does not always lend itself well to be transformed into a low-level parallel form, ...

The languages/runtimes I'm talking about are built with parallelism in mind. They are essentially high-level tasking frameworks if you will. CnC and Cilk Plus are examples of this from Intel (the latter is lower level). But, you aren't going to like these because they are not simple to use. Introducing thoughtful parallel constructs is actually very difficult. Sure it's easy when you are talking about vectorization, but not so much otherwise.

 

 

... all the "C" languages have the obsolete notion of a "currently executing statement", no notion of vector variables, no parallel constructs except for library extensions, shared global state everywhere, etc. Only Rust and perhaps Go I think are currently attempting to tackle these issues.

"Currently executing statements" is not an obsolete notion. The machines we run on are program counter based and inherently operate this way. As for the languages, they aren't doing anything particularly novel. The features that they have are found elsewhere (see the list of concurrent languages I mentioned above). The difference is that features are in popular languages now.



#33 Andre S.

Andre S.

    Asik

  • Tech Issues Solved: 14
  • Joined: 26-October 05

Posted 16 January 2014 - 01:02

Right, I understood what you meant by natural vector expression. It's what matlab would do if you did C = A .+ B. But this is effectively the same thing as an intrinsic. 

 

If by intrinsic you mean the same thing as I do, i.e. low-level wrappers for CPU-specific vector instructions (MMX, SSE, AVX), then it's definitely not the same thing. C = A .+ B means add two arrays element-wise. This could be performed using different means depending on the concrete CPU where the code is executed. It could have a scalar implementation on simpler CPUs, an SSE implementation on older CPUs and an AVX implementation on newer CPUs. Furthermore, to code this using instrinsics you'll still have to code a for loop, paying close attention to bounds (what happens if the number of elements isn't divisible by the width of a vector register?), etc., it's not obvious and it's remote from the logic you're actually trying to express, i.e. element-wise addition.

 

When you are talking about something being built from the ground-up for parallelism you are stipulating more that the language should reflect the parallelism inherit in the machine. At least, that's how I'm interpreting what you are saying. 

 

Exactly. The C language is built for an abstract machine that closely resembles machines that were around at the time. We need languages built for an abstract machine that resembles today's machines. Maybe Haskell is the parallel panacea but immutable data structures pose their own performance problems. A common approach in F# is to design things functional first and then refactor to arrays and mutability where the bottlenecks are. I don't think the ideal language is infinitely, ideally parallelizable, but allows the programmer to express parallelizable constructs as easily as sequential ones. You mention "the simple threading people know how to do"... I honestly don't know what manual threading is simple and who knows how to do it. In the language I develop most in, C#, it's very hard to reason about races using just sequential code, threads and locks. 

 

there are plenty of languages that have been designed around parallelism: http://en.wikipedia....mming_languages. None of them catch on 

 

Several of these are actually very popular: C#, Scala, Python, to mention only these. D is now used in production at Facebook, and Rust looks simply brillant. Other than historical inertia I don't see what would prevent new languages, more adapted to today's needs, from catching on.

 

I'd say this: those constructs lower the curve for a programmer to actually use this form of parallelism. Before, it would been more of pain in .net if you wanted fire off some asynchronous tasks and wait for them, but design wise you could do the same thing with thread/join in C#. I mean StartMenu(), StartGame(), and startTimer() are fundamentally just threads that you join back to at some time. I don't see any reason why you'd do callbacks, build a state machine, etc.

 

That's the beauty of async: it doesn't necessarily mean "on another thread". All different async methods could run on the same thread, unless they actually need to run concurrently. They could even run on the caller thread if the caller thread was idle (say in a UI-driven application). By speaking a higher-level language, you express your actual intent (asynchrony) rather than a specific implementation (threads), and that's a real, semantic difference.

 

"Currently executing statements" is not an obsolete notion. 

 

At the very least you have to put that in the plural because with multiple processors you have many "currently executing statement"s. Secondly, it's now a many-to-many relationship between programming language statements and machine code statements: a single line of C can translate to many instructions, and a single, say, vector instruction can map to several lines of C. Also processors support out-of-order execution and can even execute two branches of a conditional simultaneously (on the same core). So the idea that "this line of code is what the CPU is now doing" is wrong today on several levels. It has caused many hard to track bugs in multi-threaded code.



#34 Mr Nom Nom's

Mr Nom Nom's

    Neowinian Senior

  • Joined: 08-January 11
  • OS: OS X 10.10.1
  • Phone: iPhone 6 128GB

Posted 16 January 2014 - 01:27

I haven't done OpenCL specifically, but I've done CUDA and Cell (the latter is much worse). Everything I work with these days tends to be worse than those though  :laugh:. I think the problem is is that we don't know where to go paradigm wise for task parallelism. We have data parallelism pretty much down (compiler and semantic expression wise). It may not be "easy" from a programmer stand point, but it IS doable. You may see CUDA/OpenCL as horrible, but they are really not all that bad of a way to express data-parallelism (remember it is not meant for anything but HPC programming). For task parallelism, we really have no idea.

 

 

There is C++ AMP if you're wanting to go higher level but I guess it depends on mindshare outside of the Windows world - there are implementations of C++ AMP on non-Microsoft platforms where the output is OpenCL code but I guess it comes down to whether it is adopted in serious numbers outside of maybe a few niche areas.



#35 +snaphat (Myles Landwehr)

snaphat (Myles Landwehr)

    Electrical & Computer Engineer

  • Tech Issues Solved: 29
  • Joined: 23-August 05
  • OS: Win/Lin/Bsd/Osx
  • Phone: dumb phone

Posted 16 January 2014 - 04:15

If by intrinsic you mean the same thing as I do, i.e. low-level wrappers for CPU-specific vector instructions (MMX, SSE, AVX), then it's definitely not the same thing. C = A .+ B means add two arrays element-wise. This could be performed using different means depending on the concrete CPU where the code is executed. It could have a scalar implementation on simpler CPUs, an SSE implementation on older CPUs and an AVX implementation on newer CPUs. Furthermore, to code this using instrinsics you'll still have to code a for loop, paying close attention to bounds (what happens if the number of elements isn't divisible by the width of a vector register?), etc., it's not obvious and it's remote from the logic you're actually trying to express, i.e. element-wise addition.

Note that AVX intrinsics use SSE when on a platform where AVX isn't available. So if you take my previous example of VECTOR_ADD(), it would do exactly what you are saying here: perform AVX or SSE depending on the concrete CPU. Or if you were compiling on some processor that didn't have sse instructions, just do a scalar arithmetic operations (though afaik non-exist).

 

I DID forget the very important difference that you mention above (bolded): the boundaries, so, you are right, it is fundamentally different. It isn't like you can just use generic size arrays with the intrinsic themselves.

 

 

Exactly. The C language is built for an abstract machine that closely resembles machines that were around at the time. We need languages built for an abstract machine that resembles today's machines. Maybe Haskell is the parallel panacea but immutable data structures pose their own performance problems. A common approach in F# is to design things functional first and then refactor to arrays and mutability where the bottlenecks are. I don't think the ideal language is infinitely, ideally parallelizable, but allows the programmer to express parallelizable constructs as easily as sequential ones. You mention "the simple threading people know how to do"... I honestly don't know what manual threading is simple and who knows how to do it. In the language I develop most in, C#, it's very hard to reason about races using just sequential code, threads and locks. 

To be fair, I'm coming from a different perspective when I say simple threading. I'm talking about that people conceptually understand threading, not necessarily that it is really easy to use in practice with shared state.  Most people can't program correctly sequentially let alone avoid races in parallel code. But, for an example of something people would consider harder, take CnC. You have additional concepts of things called steps, items, and tags. In this analogy items are the data arrays you produce and consume. Steps are the functions that operate on the items. And tags are control mechanism to specify when steps can fire. So in the end, your program ends up describing the relationships between all of these things and then separately you have tuning phase where you can specify optimizations between the different items and steps (e.g. for example, that certain steps should not be attempted to be scheduled before some other ones). CnC has a write once policy and since you explicitly map dependencies between your steps, you don't need any kind locking up front. However, it would be more difficult for me to write an parallel LU decomposition using this methodology than it would to just use arrays, chunk my data into threads, and use barriers in between the phases. You do have correctness guarantees if you use the methodology correctly but decomposing this way is not as simple as with conventional semantics.

 

 

That's the beauty of async: it doesn't necessarily mean "on another thread". All different async methods could run on the same thread, unless they actually need to run concurrently. They could even run on the caller thread if the caller thread was idle (say in a UI-driven application). By speaking a higher-level language, you express your actual intent (asynchrony) rather than a specific implementation (threads), and that's a real, semantic difference.

Let's be specific, what do we mean by running an async method on the same thread versus another thread? Are we talking about same hardware resources versus other hardware resources? fundamentally, calling an async method just puts a new task in a queue and the runtime system decides when to schedule it. It could end up running on the same hardware resources or different hardware resources. Threads don't really exist conceptually here and there's certainly not apriori grouping of tasks to run together.

 

Similarly, if you are using threads in .net, what happens if you create a new thread? What happens is that the native OS thread_create method just puts a new thread in the system thread queue and then the operating system decides when to schedule it. It could end up running on the same hardware resources or different hardware resources. Moreover, this isn't even necessarily the case. If the threading system is a user space threading system then the runtime system decides when to schedule it --> not the OS.

 

In both cases you are expressing asynchrony in the same manner via spawn/join semantics, the only effective difference is where the scheduling is done. And again as I said, if you have a userspace threading system (if you decided to use threadpools for example) then it ends up being the runtime system's responsibility either way. Note, the one important difference that I see is that the overhead of creating threads is going to be higher than creating async tasks IF you don't use threadpools. So you wouldn't want to go off spawning a bunch of threads that don't do much work if you aren't using threadpools.

 

 

At the very least you have to put that in the plural because with multiple processors you have many "currently executing statement"s. Secondly, it's now a many-to-many relationship between programming language statements and machine code statements: a single line of C can translate to many instructions, and a single, say, vector instruction can map to several lines of C. Also processors support out-of-order execution and can even execute two branches of a conditional simultaneously (on the same core). So the idea that "this line of code is what the CPU is now doing" is wrong today on several levels. It has caused many hard to track bugs in multi-threaded code.

You have multiple cores so you have multiple executing statements, but that doesn't throw out the concept of a PC or the dependencies between instructions. Out-of-order execution refers executing already decoded instructions when their dependencies are available from a pipeline point of view -- nothing more.They are still fetched, decoded and committed in order. The point is that if you have fetched and decoded instructions that don't have dependencies that you can execute those in whatever order their operands become available in. This allows you to reduce stalls in the pipeline. But, those instructions are never committed to the register-file out-of-order so from the perspective of anyone outside of the CPU, the instructions were executed completely in-order -- the only difference being that there are reduced stall cycles in the pipeline. At the end of the day, the important parts are only the instruction fetch and commit orders --> those have to be in-order otherwise you break the memory consistency rules. Also, I don't know what you mean by executing two branches of a conditional simultaneously, are you referring to branch prediction? Because that is just one or the other path and is strictly about pushing instructions into the pipeline sooner instead of waiting until the conditional hits the execution stage of the pipeline. It's a similar to OoOE -- to reduce pipeline stalls, and again occurs completely in-order from a commit perspective. The point is that the pipeline of each processor operates in-order from an outside perspective so absolutely the notion of currently executing statements still stands.



#36 illegaloperation

illegaloperation

    Neowinian Senior

  • Joined: 24-October 09

Posted 16 January 2014 - 04:31

Also, since Keller is back at AMD (the original Phenom 2 architect) I doubt they will just leave the FX line hanging just like that and will make some changes to bring the CPU's back into game. Maybe a bit late with next gen arch but this is my opinion.

 

What do you guys think? :)

I am afraid that Vishera is the end of the line.

 

Rg9fKas.png



#37 TheExperiment

TheExperiment

    Reality Bomb

  • Tech Issues Solved: 1
  • Joined: 11-October 03
  • Location: Everywhere
  • OS: 8.1 x64

Posted 16 January 2014 - 04:34

Hmm, FM2+ has another arch to go yet eh?

 

Nice.

 

I wish I could order my proc already, heh. :)



#38 Athernar

Athernar

    ?

  • Joined: 15-December 04

Posted 16 January 2014 - 04:45

That roadmap above is fake by the way.



#39 illegaloperation

illegaloperation

    Neowinian Senior

  • Joined: 24-October 09

Posted 16 January 2014 - 04:57

That roadmap above is fake by the way.

It's the unofficial road map. Whether it's fake or not is up for debate.

 

Anyway, here's the official road map. At least there won't be anything new in 2014.

 

AMD_Desktop_Roadmap_2013-2014.png



#40 TheExperiment

TheExperiment

    Reality Bomb

  • Tech Issues Solved: 1
  • Joined: 11-October 03
  • Location: Everywhere
  • OS: 8.1 x64

Posted 16 January 2014 - 05:01

It's the unofficial road map. Whether it's fake or not is up for debate.

 

Anyway, here's the official road map. At least there won't be anything new in 2014.

Pretty sure that's actually true.  But IIRC they said they haven't announced their plans for the FX line after that.



#41 Athernar

Athernar

    ?

  • Joined: 15-December 04

Posted 16 January 2014 - 05:09

It was outright stated to be a fake and IIRC an AMD exec also stated they were not going to abandon the FX line.

 

Personally I speculate that if what was said in the Anandtech review of Kaveri is accurate, the reason Steamroller FX was no-show is down to the GloFo 28nm process not allowing for competitive clockrates.

 

Essentially, the FX line will jump straight to Excavator which when combined with the improvements of Steamroller, could allow AMD to be more competitive with Intel's offerings.



#42 illegaloperation

illegaloperation

    Neowinian Senior

  • Joined: 24-October 09

Posted 16 January 2014 - 05:22

It was outright stated to be a fake and IIRC an AMD exec also stated they were not going to abandon the FX line.

 

AMD will say whatever it takes for you to buy into its platform.

 

If AMD said that Vishera is indeed the end, many people wouldn't be want to buy into a platform that is a dead end.



#43 Athernar

Athernar

    ?

  • Joined: 15-December 04

Posted 16 January 2014 - 05:33

AMD will say whatever it takes for you to buy into its platform.

 

If AMD said that Vishera is indeed the end, many people wouldn't be want to buy into a platform that is a dead end.

 

Sorry, but your reasoning here doesn't make any sense.

 

If AMD were planning to EOL the FX-line in favour of APUs, they would've done so. They have nothing to gain and everything to lose by being vague and non-committal.

 

People are more as likely to say "screw it I'll wait" and not buy anything, or jump ship to Intel.



#44 +snaphat (Myles Landwehr)

snaphat (Myles Landwehr)

    Electrical & Computer Engineer

  • Tech Issues Solved: 29
  • Joined: 23-August 05
  • OS: Win/Lin/Bsd/Osx
  • Phone: dumb phone

Posted 16 January 2014 - 05:33

There is C++ AMP if you're wanting to go higher level but I guess it depends on mindshare outside of the Windows world - there are implementations of C++ AMP on non-Microsoft platforms where the output is OpenCL code but I guess it comes down to whether it is adopted in serious numbers outside of maybe a few niche areas.

I don't place stock in it. I had actually forgotten it existed, I thought you were referring to OpenACC at first. Based on what I see at university level, I don't think that many researchers use OpenCL in practice either. The GPU based research I've seen is almost always CUDA this and CUDA that. I wonder if that will change. I have feeling not because it's always the last ounce of performance that matters for papers and everyone seems to have NV cards or clusters.



#45 TheExperiment

TheExperiment

    Reality Bomb

  • Tech Issues Solved: 1
  • Joined: 11-October 03
  • Location: Everywhere
  • OS: 8.1 x64

Posted 16 January 2014 - 05:52

Speaking of HSA/OpenCL

http://tinyurl.com/lhsxq6a

Interesting early results.  Can't say anything will come of it (or won't) but interesting nonetheless.