AMD 2014 Roadmap Discussion (CPU/APU)


Recommended Posts

So who keeps an eye on AMD products should already have seen the new 2014 roadmap which consists mostly of new gen APU's. 

 

Now this is an speculation and trying to "debunk" the rumors on the internet.

 

A lot of talk is going around that AMD is not going to release new CPU's this year since there's no mark of Steamroller core coming to CPU's, which actually is a bit disappointing to say the least. I'm no real big fan of APU's. I'm one of the people who have an mind set of "Moar cores, moar ghz, hell with the TDP". But, it doesn't say anywhere that AMD won't release new CPU's with Vishera cores. Sure, the 9950 is an beast and 8350 is actually more than enough already but I'm still hoping for something new.

 

Also, since Keller is back at AMD (the original Phenom 2 architect) I doubt they will just leave the FX line hanging just like that and will make some changes to bring the CPU's back into game. Maybe a bit late with next gen arch but this is my opinion.

 

What do you guys think? :) 

Link to comment
Share on other sites

No Steamroller FX is disappointing to say the least. That said, Kaveri is basically the Xbox One/PS4 architecture coming for desktop and it's bigger than people think. It's currently very hard to gain any performance improvement using GPGPU, except in very specific scenarios, simply because of the synchronizing costs between the two memories (VRAM and system RAM). This restriction vanishes with a unified memory architecture, meaning the GPU can be used to accelerate even small ordinary tasks. It's not unthinkable that we'll see auto-"GPU"ing code optimizers in the same way we already have auto-vectorizing optimizers.

 

On that note, we really need new programming languages designed from the ground up with parallelism and asynchrony in mind, rather than based on 70s era assumptions (all the C-family) (/editorial).

 

CPUs will not get much faster than Sandy Bridge in the foreseeable future, so heterogenous computing is really where the big performance gains will be. Kaveri has a huge advantage on traditional CPUs on that front. I can only see Intel catching up with AMD.

 

Also, Kaveri will simply kill the low-end discrete GPU, and probably even mid-range GPUs on laptops. AMD demoed it at 30fps in BF4 with medium settings using Mantle.

Link to comment
Share on other sites

No Steamroller FX is disappointing to say the least. That said, Kaveri is basically the Xbox One/PS4 architecture coming for desktop and it's bigger than people think. It's currently very hard to gain any performance improvement using GPGPU, except in very specific scenarios, simply because of the synchronizing costs between the two memories (VRAM and system RAM). This restriction vanishes with a unified memory architecture, meaning the GPU can be used to accelerate even small ordinary tasks. It's not unthinkable that we'll see auto-"GPU"ing code optimizers in the same way we already have auto-vectorizing optimizers.

 

On that note, we really need new programming languages designed from the ground up with parallelism and asynchrony in mind, rather than based on 70s era assumptions (all the C-family) (/editorial).

GPU auto-tuners (and more generally load balancing frameworks where you write one generic code and run everywhere(CPU,GPU,XeonPhi, etc.)) are a point of research even with current non-unified memory architecture. One of the things that I think is important to consider is that at the end of the day from an HPC perspective, you will never see a global unified memory architecture. You may get it on single nodes (even then it is NUMA), but you won't get it beyond that because it simply doesn't scale.

 

On the topic of languages, I was at a conference last year where this Intel guy (Tim Mattson) discussed languages (he's a software guy, not a hardware guy just to be clear). Essentially, he said that languages are a dime a dozen because people continuously create new parallel programming models instead of using or improving existing models. In the video below, about 20 minutes in he has this discussion (this is a year before I saw him). It's pretty interesting:

 

 

EDIT: on a side note, he also asked me how I foresee handling of soft-errors in HPC during my talk at said conference and then really did not like my answer and was very vocal about it -- he's very vocal about his opinions and beliefs. I'm still a bit miffed about that, but I digress.

 

 

Also, Kaveri will simply kill the low-end discrete GPU, and probably even mid-range GPUs on laptops. AMD demoed it at 30fps in BF4 with medium settings using Mantle.

There's no surprise there. I expect the mid-range component market to stop existing eventually. The general trend is that feature-set is more and more migrated on-die as time goes on and process shrinks occur.

Link to comment
Share on other sites

GPU auto-tuners (and more generally load balancing frameworks where you write one generic code and run everywhere(CPU,GPU,XeonPhi, etc.)) are a point of research even with current non-unified memory architecture. One of the things that I think is important to consider is that at the end of the day from an HPC perspective, you will never see a global unified memory architecture. You may get it on single nodes (even then it is NUMA), but you won't get it beyond that because it simply doesn't scale.

Oh, certainly not, I'm just talking about personal computers, not distributed computing.

 

On the topic of languages, I agree with the guy in the talk that automatic parallelisation of scalar code will never work. The more fundamental problem is that our programming languages are by nature scalar. Why is it that in 2014, there's no programming language (AFAIK) where the sum of two arrays is an array containing the sum of their respective elements? Oh sure you can call into a specific library that tries to shoehorn parallel constructs on top, but at the language level there's still very little. There's async in C#/F#/VB which is a step in the right direction, but we're far from general vectorizable/parallelizable language constructs allowing us to naturally write parallel code. In particular the whole .NET runtime knows nothing of vector or massively parallel operations. We should be able to treat collections like we treat any other variable; why should a variable be something that fits into a scalar register?

 

Meanwhile our SSE/AVX registers and thousands of programmable shaders sit idle, and that's if you're lucky enough that your program actually uses all CPU cores because doing that with threads and locks is mind-boggling hard.

 
Have you ever written OpenCL code? Man that is horrible. This is like writing assembly code; it's low-level, error-prone and highly remote from the actual logic you're trying to implement. Somehow I can't believe a computer cannot be more effective than a human at figuring this stuff out.
 
I realize that I'm only proposing vague ideas and haven't done any research on their feasability, but I just feel like there's something wrong with basing our modern languages on assumptions made in the 70s. After all C was just syntactic sugar on popular assembly-level patterns; we need syntatic sugar that matches current-day good assembly code, and for-looping over a collection one element at time isn't a generally good solution anymore.
Link to comment
Share on other sites

On the topic of languages, I agree with the guy in the talk that automatic parallelisation of scalar code will never work. The more fundamental problem is that our programming languages are by nature scalar. Why is it that in 2014, there's no programming language (AFAIK) where the sum of two arrays is an array containing the sum of their respective elements? Oh sure you can call into a specific library that tries to shoehorn parallel constructs on top, but at the language level there's still very little. There's async in C#/F#/VB which is a step in the right direction, but we're far from general vectorizable/parallelizable language constructs allowing us to naturally write parallel code. In particular the whole .NET runtime knows nothing of vector or massively parallel operations. We should be able to treat collections like we treat any other variable; why should a variable be something that fits into a scalar register?

 

Meanwhile our SSE/AVX registers and thousands of programmable shaders sit idle, and that's if you're lucky enough that your program actually uses all CPU cores because doing that with threads and locks is mind-boggling hard.

Languages tend to reflect the hardware they run on. For the most part commodity processors are not really vector processors except in floating point workloads so languages tend to reflect that in their scalar nature. There are some languages that are capable of expressing vector operations: Fortran and Matlab for example. Fortran is by-far the language that the majority of useful HPC applications are coded in. 

 

I would caution that there are fundamentally two different topics when it comes to parallelization: data parallelism and task parallelism. The former category is arguably easier to express but much more restrictive in terms of application than the latter (read as: task parallelism is an open research topic). SIMD parallelization in the form of vectorization or shaders is a form of data-parallelism. Semantically, they are fairly easy to express at the language level, and as such, it could be done. But I'm not sure it needs to be. Vectorization can be and is done at the compiler level and it is generally considered to be a compiler optimization. On the GPU front, we have CUDA/OpenCL instead of direct shader programming (technically CUDA is a language extension if we are being strict).

 

So, I said above that task parallelism is an open research topic. Let me expound upon that. For a few decades, in the commodity sector, we were running on sequential machines, so there was simply no need to extend parallel language semantics (or runtime semantics). We had simple threading libraries/semantics which were enough at that point. What we are seeing now is extensions that are driven by well researched task parallelism semantics (async, futures, promises, etc.) that are a logical extension to the existing threading functionality. You are not likely to see much else because paradigm shifts, simply put, are difficult. Getting programmers to understand or use fundamentally different programming semantics is almost impossible (in HPC people are still using MPI and fortan for that reason). Recent research in the last 5-6 years has been on how to express dependencies between tasks, but, again, this reflects fundamental changes ins how programmers express parallelism in applications so, again, this is difficult. In the consumer sphere, I do not think we are going to see either languages or runtimes that deviate much from what is currently available. What you'll see is simply more logical extensions to existing frameworks and that's about it.

 

 

Have you ever written OpenCL code? Man that is horrible. This is like writing assembly code; it's low-level, error-prone and highly remote from the actual logic you're trying to implement. Somehow I can't believe a computer cannot be more effective than a human at figuring this stuff out.

 
I realize that I'm only proposing vague ideas and haven't done any research on their feasability, but I just feel like there's something wrong with basing our modern languages on assumptions made in the 70s. After all C was just syntactic sugar on popular assembly-level patterns; we need syntatic sugar that matches current-day good assembly code, and for-looping over a collection one element at time isn't a generally good solution anymore.

 

I haven't done OpenCL specifically, but I've done CUDA and Cell (the latter is much worse). Everything I work with these days tends to be worse than those though  :laugh:. I think the problem is is that we don't know where to go paradigm wise for task parallelism. We have data parallelism pretty much down (compiler and semantic expression wise). It may not be "easy" from a programmer stand point, but it IS doable. You may see CUDA/OpenCL as horrible, but they are really not all that bad of a way to express data-parallelism (remember it is not meant for anything but HPC programming). For task parallelism, we really have no idea.

Link to comment
Share on other sites

Languages tend to reflect the hardware they run on. For the most part commodity processors are not really vector processors except in floating point workloads so languages tend to reflect that in their scalar nature. There are some languages that are capable of expressing vector operations: Fortran and Matlab for example. Fortran is by-far the language that the majority of useful HPC applications are coded in. 

Floating-point and integer workloads. It really bugs me that the highest-level option we currently have for writing vector code is dropping down to instrinsics in C++. I should be able to just add two arrays, or I should be able to express matrix multiplication in such a way that I know this will compile to vectorized code. For now auto-vectorisation only exists as an optimization in C++, and even then it's far from obvious that a given piece code can be auto-vectorized. You can't even express "please give me a compiler error if the following code is not vectorizable". The problem being that the piece of code is scalar code that has to be reverse-engineered into parallel code, rather than parallel code that can naturally be translated into parallel instructions, just like scalar C code translates naturally into scalar assembly code.
 
Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?
 
So, I said above that task parallelism is an open research topic. Let me expound upon that. For a few decades, in the commodity sector, we were running on sequential machines, so there was simply no need to extend parallel language semantics (or runtime semantics). We had simple threading libraries/semantics which were enough at that point. What we are seeing now is extensions that are driven by well researched task parallelism semantics (async, futures, promises, etc.) that are a logical extension to the existing threading functionality. You are not likely to see much else because paradigm shifts, simply put, are difficult. Getting programmers to understand or use fundamentally different programming semantics is almost impossible (in HPC people are still using MPI and fortan for that reason). Recent research in the last 5-6 years has been on how to express dependencies between tasks, but, again, this reflects fundamental changes ins how programmers express parallelism in applications so, again, this is difficult. In the consumer sphere, I do not think we are going to see either languages or runtimes that deviate much from what is currently available. What you'll see is simply more logical extensions to existing frameworks and that's about it.

 

async really changes the way programmers think about task parallelism because now they can write concurrent code just like sequential code; no more callbacks, natural exception handling, etc; the transformation is done systematically by the compiler. That's a great development, but at the same time it's an extension on top of essentially sequential languages. I wonder how farther we could go by designing a language to be concurrent from the ground up, rather than sequential first and concurrent sprinkled on top.  Perhaps that stuff is already well figured out in the academia, but I'm still using a language which name starts with "C" and staring at a switch case inside a for loop right now, and I know that none of that can ever be automatically parallelized.
 
I'm probably not being very constructive, but at least I get to express my thoughts at someone that understands them, so thanks for the conversation.
 
I haven't done OpenCL specifically, but I've done CUDA and Cell (the latter is much worse). Everything I work with these days tends to be worse than those though   :laugh:. I think the problem is is that we don't know where to go paradigm wise for task parallelism. We have data parallelism pretty much down (compiler and semantic expression wise). It may not be "easy" from a programmer stand point, but it IS doable. You may see CUDA/OpenCL as horrible, but they are really not all that bad of a way to express data-parallelism (remember it is not meant for anything but HPC programming). For task parallelism, we really have no idea.

 

How is it "not all that bad"? It's like writing assembly code: low-level, error-prone, and highly remote from the actual domain logic. We got rid of the need to write assembly code in the 70s, and by the 90s programs had gotten so complex that it was downright impossible to write large programs in assembly code. OpenCL is the assembly code of parallel programming. We need to develop some language on top of it so anyone can write complex programs. Right now everyone is writing his own framework on top of it, but isn't that proof that none of these are satisfactory solutions?

Link to comment
Share on other sites

A lot of talk is going around that AMD is not going to release new CPU's this year since there's no mark of Steamroller core coming to CPU's, which actually is a bit disappointing to say the least. I'm no real big fan of APU's. I'm one of the people who have an mind set of "Moar cores, moar ghz, hell with the TDP". But, it doesn't say anywhere that AMD won't release new CPU's with Vishera cores. Sure, the 9950 is an beast and 8350 is actually more than enough already but I'm still hoping for something new.

I don't know if their dedicated procs are actually done.  They implied recently it wasn't, they just had nothing to announce.

 

As to the other discussion, did OpenCL 2.0 change much?  I haven't looked at it as there isn't enough OpenCL stuff yet for me to get terribly interested.

Link to comment
Share on other sites

Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?

 

I'm not 100% sure on this but I think the JVM can generate SIMD instructions, a cursory search seems to agree - but I'm no expert, just interested in the topic.

Link to comment
Share on other sites

I'm not 100% sure on this but I think the JVM can generate SIMD instructions, a cursory search seems to agree - but I'm no expert, just interested in the topic.

You're right, it looks like the HotSpot VM supports it (source). Then again that article argues in the same direction as I do: giving programmers better tools at the language level to make use of SIMD instructions.

Link to comment
Share on other sites

 

Floating-point and integer workloads. It really bugs me that the highest-level option we currently have for writing vector code is dropping down to instrinsics in C++. I should be able to just add two arrays, or I should be able to express matrix multiplication in such a way that I know this will compile to vectorized code. For now auto-vectorisation only exists as an optimization in C++, and even then it's far from obvious that a given piece code can be auto-vectorized. You can't even express "please give me a compiler error if the following code is not vectorizable". The problem being that the piece of code is scalar code that has to be reverse-engineered into parallel code, rather than parallel code that can naturally be translated into parallel instructions, just like scalar C code translates naturally into scalar assembly code.

Architecturally speaking, for many generations Intel abandoned enhancements to the integer portions of their SIMD pipelines. Basically what we've simply had is SSE front-end instructions shoehorned onto half of the the floating-point AVX data-path. Personally, I've always considered it to be kludge to maintain backwards compatibility with SSE more than anything. With AVX2 (in Haswell) though, Intel has finally has proper support across workloads though.

 

Compiler wise, you can usually make low-level language compilers tell you if they are promoting operations to vector equivalents (http://gcc.gnu.org/projects/tree-ssa/vectorization.html, ICC and VS have options also). When you say that you have to reverse engineer a code to be parallel, I don't really the see the effective difference whether you have SIMD operations in the language or not. Suppose, they have language extensions and you can natively do vector operations in the language. In order to use those, you need to be able to decompose your problem into a parallel SIMD form. Similarly, even without the language extensions you are still need to decompose your problem into a parallel SIMD form. The only effective difference is that you have to explicitly write the loop and the compiler will generate the vector operation for you as opposed you writing out the operation yourself. There are generally intrinsic instructions if you didn't want to do the loop yourself though. The point though is that that the process of decomposition and expression is there regardless.

 

 

Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?

Again, these are high level languages that aren't purposed for high performance (in fact the ISAs lack any notion of vectorization support). They were explicitly designed to forgo low-level control for automatic optimization and memory management. Manual vectorization would be be against those goals. So, to answer the question, yes, you could feasible enable vectorization opportunities through language extensions, but that goes against the design of the languages. Really the point is that it is the wrong class of languages for such semantics. If you want low-level control for optimization opportunities, then use or interface with a low-level language.

 

 

async really changes the way programmers think about task parallelism because now they can write concurrent code just like sequential code; no more callbacks, natural exception handling, etc; the transformation is done systematically by the compiler. That's a great development, but at the same time it's an extension on top of essentially sequential languages. I wonder how farther we could go by designing a language to be concurrent from the ground up, rather than sequential first and concurrent sprinkled on top.  Perhaps that stuff is already well figured out in the academia, but I'm still using a language which name starts with "C" and staring at a switch case inside a for loop right now, and I know that none of that can ever be automatically parallelized.

Async/await is just fork-join-parallelism where you spawn a task and then later join back to the task. It's the typical style of parallelism that most languages and threading libraries support (pthreads, omp, windows threads, etc.). The difference here is that it has simplified semantics compared to traditional threading libraries so it's easier to start/manage/join to tasks. I don't see why this changes the way programmers think about parallelism, Regardless, the programmer has to go though the process of identifying fork-join parallelism and decomposing programs into tasks. It isn't as if they can get away with thinking sequentially.

 

If you want languages that are concurrent from the ground up, then look into functional languages (haskell, Lisp, etc.). These languages consist of statements that are semantically concurrent and thus given infinite resources could be perfectly executed in parallel (as much as data dependencies allow). The problem is they don't map well to real machines. Real machines have limited resources and the granularity of tasks is important for performance. By switching your language to be inherently parallel you are giving the compiler the task of automatically (auto-magically) mapping parallelism to limited resources or more strictly put: the compiler is responsible for optimizing perfectly parallel code to fit on machine with limited parallel resources (chunking things into tasks and simd instructions). This doesn't necessarily end well because it is actually a fairly difficult problem from a compiler standpoint. Hand-tuned manual parallelism using control-flow languages tends to yield better results. And that's where you always end up, because you don't have have infinite resources so you are either trying to (1) parallelize a sequential program, (2) letting a compiler parallelize a sequential program, or (3) letting a compiler sequentialize a parallel program.

 

 

How is it "not all that bad"? It's like writing assembly code: low-level, error-prone, and highly remote from the actual domain logic. We got rid of the need to write assembly code in the 70s, and by the 90s programs had gotten so complex that it was downright impossible to write large programs in assembly code. OpenCL is the assembly code of parallel programming. We need to develop some language on top of it so anyone can write complex programs. Right now everyone is writing his own framework on top of it, but isn't that proof that none of these are satisfactory solutions?

If you are going to do comparisons on difficulty then this is more valid comparison: x86 SIMD assembly is akin to PTX assembly (NV assembly). Those are around similar levels of difficulty. OpenCL/CUDA is not on level of either of those things because you don't need to know specific ISA details to write OpenCL/CUDA code. Most things are abstracted away from you and it has actual high-level language concepts like variables, memory management, tasking, etc. Such concepts don't exist in assembly. Of course, these languages are designed to closely map to the architectures they run on and give you relatively low-level control so they aren't exactly super easy to use.

 

You seem to be making two arguments to me: you want more control in higher level languages but you want less control in lower level languages. Complex programs require complex languages and you aren't going to get a silver bullet of easy-to-use-with-lots-of-power-and-runs-well language. The fact is, no-one has an answer for what the perfect language or runtime is: it's simply an open research topic. Currently, from what I've seen, people think the best alternative to do a layered approach. At the low level you have languages that map well to your architecture and a higher level you have languages that are transformed into lower-level languages. You expose architectural details in the low level languages, but not in the high level languages. On top of that you expose some sort of hinting system  at the high level to give help with parallel optimizations (e.g. group these tasks). This is just a vague trend I'm alluding to that has been occurring in HPC.

Link to comment
Share on other sites

Architecturally speaking, for many generations Intel abandoned enhancements to the integer portions of their SIMD pipelines. Basically what we've simply had is SSE front-end instructions shoehorned onto half of the the floating-point AVX data-path. Personally, I've always considered it to be kludge to maintain backwards compatibility with SSE more than anything. With AVX2 (in Haswell) though, Intel has finally has proper support across workloads though.

 

I take it what you mean here is the fact that the SSE ops under AVX still only address xmm instead of ymm, correct?

 

Excuse the newbie question by the way, but I have to say I absolutely love these kinds of discussions and it's a shame there aren't more like it.

Link to comment
Share on other sites

I take it what you mean here is the fact that the SSE ops under AVX still only address xmm instead of ymm, correct?

 

Excuse the newbie question by the way, but I have to say I absolutely love these kinds of discussions and it's a shame there aren't more like it.

Yeah, it adds additional features like vector shifts and non-adjacent element accesses into the ISA also. I'm not exactly being fair in my assessment, but Intel really didn't seem to care about it all that much. It seems to me to be more of an after thought thing because they got die space in the redesigns to properly do it probably now.

Link to comment
Share on other sites

Yeah, it adds additional features like vector shifts and non-adjacent element accesses into the ISA also. I'm not exactly being fair in my assessment, but Intel really didn't seem to care about it all that much. It seems to me to be more of an after thought thing because they got die space in the redesigns to properly do it probably now.

 

AVX-512 seems to support this from the start, so they seem to have "learned from their mistake" so to speak.

 

I'm curious however, how well does doubling register size / element count scale? To my mind logic would dictate a 2x speedup per increase, but reality is often not best-case.

 

This seems to be another case for having standardisation of ISE development/implementation in my opinion (likewise with the mess of FMA3/4 etc), not that Intel would allow that mind you.

Link to comment
Share on other sites

AVX-512 seems to support this from the start, so they seem to have "learned from their mistake" so to speak.

 

I'm curious however, how well does doubling register size / element count scale? To my mind logic would dictate a 2x speedup per increase, but reality is often not best-case.

 

This seems to be another case for having standardisation of ISE development/implementation in my opinion (likewise with the mess of FMA3/4 etc), not that Intel would allow that mind you.

It is double the performance in well optimized code. You can't necessarily pack instructions perfectly (or maybe at all in some cases) though unless you are doing something simple like matrix multiplication though so in real-world applications it is less.

 

EDIT: speaking of FMA3/4. Yeah, that business was a headache. Though, I don't think it was Intel's fault persay. I was under the impression that both Intel and AMD switched half way through to the opposite. It looks like FMA3 is the standard going forward so it should be settled.

 

EDIT2: Also to be fair, that doesn't necessarily get rid of all issues. Some AMD cores use to shared FMAC units (e.g. 1 unit per 2 core) where you could either issue a single 256-bit width FMA instruction on a single core or two 128-bit width FMAs on adjacent cores. In practice you saw better performance by doing the latter. The point being is that you don't end up with a unification in optimizations regardless.

Link to comment
Share on other sites

It is double the performance in well optimized code. You can't necessarily pack instructions perfectly (or maybe at all in some cases) though unless you are doing something simple like matrix multiplication though so in real-world applications it is less.

 

EDIT: speaking of FMA3/4. Yeah, that business was a headache. Though, I don't think it was Intel's fault persay. I was under the impression that both Intel and AMD switched half way through to the opposite. It looks like FMA3 is the standard going forward so it should be settled.

 

EDIT2: Also to be fair, that doesn't necessarily get rid of all issues. Some AMD cores use to shared FMAC units (e.g. 1 unit per 2 core) where you could either issue a single 256-bit width FMA instruction on a single core or two 128-bit width FMAs on adjacent cores. In practice you saw better performance by doing the latter. The point being is that you don't end up with a unification in optimizations regardless.

 

Yeah, unfortunate for people with bdver1 based chips, but bdver2+ supports FMA3.

 

As to the FMAC stuff, yeah I hinted to as much in my initial post referencing improvements in Excavator/bdver4. (Presumably 2x 256-bit FMACs?)

 

What I don't understand though, is with the changes that are landing post-Bulldozer (Like Steamroller's additional decoder), aren't AMD essentially just edging slowly back to having a "standard" architecture? Seems rather odd to make that change just when their HSA stuff is reaching early maturity.

Link to comment
Share on other sites

Yeah, unfortunate for people with bdver1 based chips, but bdver2+ supports FMA3.

 

As to the FMAC stuff, yeah I hinted to as much in my initial post referencing improvements in Excavator/bdver4. (Presumably 2x 256-bit FMACs?)

 

What I don't understand though, is with the changes that are landing post-Bulldozer (Like Steamroller's additional decoder), aren't AMD essentially just edging slowly back to having a "standard" architecture? Seems rather odd to make that change just when their HSA stuff is reaching early maturity.

Do you mean why 2x 256-bit FMACs? Probably because it's easier to design, layout, and uses less die space. I'm sure they aren't wasting any space in their design.

 

But yeah they'll continue to move forward with less and less shared parts of the pipeline in future designs. It's always going to be a balance of what more you can shove into the die at the end of the day though. Isn't AMD still a process size behind Intel? If so, that'd probably be a constraint in continuing to share resources.

Link to comment
Share on other sites

Do you mean why 2x 256-bit FMACs? Probably because it's easier to design, layout, and uses less die space. I'm sure they aren't wasting any space in their design.

 

But yeah they'll continue to move forward with less and less shared parts of the pipeline in future designs. It's always going to be a balance of what more you can shove into the die at the end of the day though. Isn't AMD still a process size behind Intel? If so, that'd probably be a constraint in continuing to share resources.

 

The presumably part was in regards to what constituted Excavator's "FPU Improvments", I don't think they've been detailed yet. All that's known thus far is it supports AVX2 amongst other things.

 

As far as process size goes I had read that the issue was due to contractual obligations (Soon to end) with the spun-off GloFo, the Radeon 7xx0 parts fabbed at TMSC are/were 28nm vs GloFo's 32nm.

Link to comment
Share on other sites

The presumably part was in regards to what constituted Excavator's "FPU Improvments", I don't think they've been detailed yet. All that's known thus far is it supports AVX2 amongst other things.

 

As far as process size goes I had read that the issue was due to contractual obligations (Soon to end) with the spun-off GloFo, the Radeon 7xx0 parts fabbed at TMSC are/were 28nm vs GloFo's 32nm.

Eh, they are still a few years behind Intel in that regard then (that's always going to be the case at this point me thinks). It's funny because Intel doesn't necessarily have to make great designs. They can just ride the benefits from a better process and do easy improvements. That's not to say that they don't do good designs, but they'd stay afloat even if they didn't.

Link to comment
Share on other sites

Eh, they are still a few years behind Intel in that regard then (that's always going to be the case at this point me thinks). It's funny because Intel doesn't necessarily have to make great designs. They can just ride the benefits from a better process and do easy improvements. That's not to say that they don't do good designs, but they'd stay afloat even if they didn't.

Or pay OEMs to ignore the competition when they can't match them.

Link to comment
Share on other sites

Or pay OEMs to ignore the competition when they can't match them.

And then have the resulting lawsuit yield just a slap on the wrist :-D

Link to comment
Share on other sites

This topic is now closed to further replies.