AMD 2014 Roadmap Discussion (CPU/APU)


Recommended Posts

So who keeps an eye on AMD products should already have seen the new 2014 roadmap which consists mostly of new gen APU's. 

 

Now this is an speculation and trying to "debunk" the rumors on the internet.

 

A lot of talk is going around that AMD is not going to release new CPU's this year since there's no mark of Steamroller core coming to CPU's, which actually is a bit disappointing to say the least. I'm no real big fan of APU's. I'm one of the people who have an mind set of "Moar cores, moar ghz, hell with the TDP". But, it doesn't say anywhere that AMD won't release new CPU's with Vishera cores. Sure, the 9950 is an beast and 8350 is actually more than enough already but I'm still hoping for something new.

 

Also, since Keller is back at AMD (the original Phenom 2 architect) I doubt they will just leave the FX line hanging just like that and will make some changes to bring the CPU's back into game. Maybe a bit late with next gen arch but this is my opinion.

 

What do you guys think? :) 

Link to comment
https://www.neowin.net/forum/topic/1196275-amd-2014-roadmap-discussion-cpuapu/
Share on other sites

No Steamroller FX is disappointing to say the least. That said, Kaveri is basically the Xbox One/PS4 architecture coming for desktop and it's bigger than people think. It's currently very hard to gain any performance improvement using GPGPU, except in very specific scenarios, simply because of the synchronizing costs between the two memories (VRAM and system RAM). This restriction vanishes with a unified memory architecture, meaning the GPU can be used to accelerate even small ordinary tasks. It's not unthinkable that we'll see auto-"GPU"ing code optimizers in the same way we already have auto-vectorizing optimizers.

 

On that note, we really need new programming languages designed from the ground up with parallelism and asynchrony in mind, rather than based on 70s era assumptions (all the C-family) (/editorial).

 

CPUs will not get much faster than Sandy Bridge in the foreseeable future, so heterogenous computing is really where the big performance gains will be. Kaveri has a huge advantage on traditional CPUs on that front. I can only see Intel catching up with AMD.

 

Also, Kaveri will simply kill the low-end discrete GPU, and probably even mid-range GPUs on laptops. AMD demoed it at 30fps in BF4 with medium settings using Mantle.

No Steamroller FX is disappointing to say the least. That said, Kaveri is basically the Xbox One/PS4 architecture coming for desktop and it's bigger than people think. It's currently very hard to gain any performance improvement using GPGPU, except in very specific scenarios, simply because of the synchronizing costs between the two memories (VRAM and system RAM). This restriction vanishes with a unified memory architecture, meaning the GPU can be used to accelerate even small ordinary tasks. It's not unthinkable that we'll see auto-"GPU"ing code optimizers in the same way we already have auto-vectorizing optimizers.

 

On that note, we really need new programming languages designed from the ground up with parallelism and asynchrony in mind, rather than based on 70s era assumptions (all the C-family) (/editorial).

GPU auto-tuners (and more generally load balancing frameworks where you write one generic code and run everywhere(CPU,GPU,XeonPhi, etc.)) are a point of research even with current non-unified memory architecture. One of the things that I think is important to consider is that at the end of the day from an HPC perspective, you will never see a global unified memory architecture. You may get it on single nodes (even then it is NUMA), but you won't get it beyond that because it simply doesn't scale.

 

On the topic of languages, I was at a conference last year where this Intel guy (Tim Mattson) discussed languages (he's a software guy, not a hardware guy just to be clear). Essentially, he said that languages are a dime a dozen because people continuously create new parallel programming models instead of using or improving existing models. In the video below, about 20 minutes in he has this discussion (this is a year before I saw him). It's pretty interesting:

 

 

EDIT: on a side note, he also asked me how I foresee handling of soft-errors in HPC during my talk at said conference and then really did not like my answer and was very vocal about it -- he's very vocal about his opinions and beliefs. I'm still a bit miffed about that, but I digress.

 

 

Also, Kaveri will simply kill the low-end discrete GPU, and probably even mid-range GPUs on laptops. AMD demoed it at 30fps in BF4 with medium settings using Mantle.

There's no surprise there. I expect the mid-range component market to stop existing eventually. The general trend is that feature-set is more and more migrated on-die as time goes on and process shrinks occur.

GPU auto-tuners (and more generally load balancing frameworks where you write one generic code and run everywhere(CPU,GPU,XeonPhi, etc.)) are a point of research even with current non-unified memory architecture. One of the things that I think is important to consider is that at the end of the day from an HPC perspective, you will never see a global unified memory architecture. You may get it on single nodes (even then it is NUMA), but you won't get it beyond that because it simply doesn't scale.

Oh, certainly not, I'm just talking about personal computers, not distributed computing.

 

On the topic of languages, I agree with the guy in the talk that automatic parallelisation of scalar code will never work. The more fundamental problem is that our programming languages are by nature scalar. Why is it that in 2014, there's no programming language (AFAIK) where the sum of two arrays is an array containing the sum of their respective elements? Oh sure you can call into a specific library that tries to shoehorn parallel constructs on top, but at the language level there's still very little. There's async in C#/F#/VB which is a step in the right direction, but we're far from general vectorizable/parallelizable language constructs allowing us to naturally write parallel code. In particular the whole .NET runtime knows nothing of vector or massively parallel operations. We should be able to treat collections like we treat any other variable; why should a variable be something that fits into a scalar register?

 

Meanwhile our SSE/AVX registers and thousands of programmable shaders sit idle, and that's if you're lucky enough that your program actually uses all CPU cores because doing that with threads and locks is mind-boggling hard.

 
Have you ever written OpenCL code? Man that is horrible. This is like writing assembly code; it's low-level, error-prone and highly remote from the actual logic you're trying to implement. Somehow I can't believe a computer cannot be more effective than a human at figuring this stuff out.
 
I realize that I'm only proposing vague ideas and haven't done any research on their feasability, but I just feel like there's something wrong with basing our modern languages on assumptions made in the 70s. After all C was just syntactic sugar on popular assembly-level patterns; we need syntatic sugar that matches current-day good assembly code, and for-looping over a collection one element at time isn't a generally good solution anymore.

On the topic of languages, I agree with the guy in the talk that automatic parallelisation of scalar code will never work. The more fundamental problem is that our programming languages are by nature scalar. Why is it that in 2014, there's no programming language (AFAIK) where the sum of two arrays is an array containing the sum of their respective elements? Oh sure you can call into a specific library that tries to shoehorn parallel constructs on top, but at the language level there's still very little. There's async in C#/F#/VB which is a step in the right direction, but we're far from general vectorizable/parallelizable language constructs allowing us to naturally write parallel code. In particular the whole .NET runtime knows nothing of vector or massively parallel operations. We should be able to treat collections like we treat any other variable; why should a variable be something that fits into a scalar register?

 

Meanwhile our SSE/AVX registers and thousands of programmable shaders sit idle, and that's if you're lucky enough that your program actually uses all CPU cores because doing that with threads and locks is mind-boggling hard.

Languages tend to reflect the hardware they run on. For the most part commodity processors are not really vector processors except in floating point workloads so languages tend to reflect that in their scalar nature. There are some languages that are capable of expressing vector operations: Fortran and Matlab for example. Fortran is by-far the language that the majority of useful HPC applications are coded in. 

 

I would caution that there are fundamentally two different topics when it comes to parallelization: data parallelism and task parallelism. The former category is arguably easier to express but much more restrictive in terms of application than the latter (read as: task parallelism is an open research topic). SIMD parallelization in the form of vectorization or shaders is a form of data-parallelism. Semantically, they are fairly easy to express at the language level, and as such, it could be done. But I'm not sure it needs to be. Vectorization can be and is done at the compiler level and it is generally considered to be a compiler optimization. On the GPU front, we have CUDA/OpenCL instead of direct shader programming (technically CUDA is a language extension if we are being strict).

 

So, I said above that task parallelism is an open research topic. Let me expound upon that. For a few decades, in the commodity sector, we were running on sequential machines, so there was simply no need to extend parallel language semantics (or runtime semantics). We had simple threading libraries/semantics which were enough at that point. What we are seeing now is extensions that are driven by well researched task parallelism semantics (async, futures, promises, etc.) that are a logical extension to the existing threading functionality. You are not likely to see much else because paradigm shifts, simply put, are difficult. Getting programmers to understand or use fundamentally different programming semantics is almost impossible (in HPC people are still using MPI and fortan for that reason). Recent research in the last 5-6 years has been on how to express dependencies between tasks, but, again, this reflects fundamental changes ins how programmers express parallelism in applications so, again, this is difficult. In the consumer sphere, I do not think we are going to see either languages or runtimes that deviate much from what is currently available. What you'll see is simply more logical extensions to existing frameworks and that's about it.

 

 

Have you ever written OpenCL code? Man that is horrible. This is like writing assembly code; it's low-level, error-prone and highly remote from the actual logic you're trying to implement. Somehow I can't believe a computer cannot be more effective than a human at figuring this stuff out.

 
I realize that I'm only proposing vague ideas and haven't done any research on their feasability, but I just feel like there's something wrong with basing our modern languages on assumptions made in the 70s. After all C was just syntactic sugar on popular assembly-level patterns; we need syntatic sugar that matches current-day good assembly code, and for-looping over a collection one element at time isn't a generally good solution anymore.

 

I haven't done OpenCL specifically, but I've done CUDA and Cell (the latter is much worse). Everything I work with these days tends to be worse than those though  :laugh:. I think the problem is is that we don't know where to go paradigm wise for task parallelism. We have data parallelism pretty much down (compiler and semantic expression wise). It may not be "easy" from a programmer stand point, but it IS doable. You may see CUDA/OpenCL as horrible, but they are really not all that bad of a way to express data-parallelism (remember it is not meant for anything but HPC programming). For task parallelism, we really have no idea.

Languages tend to reflect the hardware they run on. For the most part commodity processors are not really vector processors except in floating point workloads so languages tend to reflect that in their scalar nature. There are some languages that are capable of expressing vector operations: Fortran and Matlab for example. Fortran is by-far the language that the majority of useful HPC applications are coded in. 

Floating-point and integer workloads. It really bugs me that the highest-level option we currently have for writing vector code is dropping down to instrinsics in C++. I should be able to just add two arrays, or I should be able to express matrix multiplication in such a way that I know this will compile to vectorized code. For now auto-vectorisation only exists as an optimization in C++, and even then it's far from obvious that a given piece code can be auto-vectorized. You can't even express "please give me a compiler error if the following code is not vectorizable". The problem being that the piece of code is scalar code that has to be reverse-engineered into parallel code, rather than parallel code that can naturally be translated into parallel instructions, just like scalar C code translates naturally into scalar assembly code.
 
Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?
 
So, I said above that task parallelism is an open research topic. Let me expound upon that. For a few decades, in the commodity sector, we were running on sequential machines, so there was simply no need to extend parallel language semantics (or runtime semantics). We had simple threading libraries/semantics which were enough at that point. What we are seeing now is extensions that are driven by well researched task parallelism semantics (async, futures, promises, etc.) that are a logical extension to the existing threading functionality. You are not likely to see much else because paradigm shifts, simply put, are difficult. Getting programmers to understand or use fundamentally different programming semantics is almost impossible (in HPC people are still using MPI and fortan for that reason). Recent research in the last 5-6 years has been on how to express dependencies between tasks, but, again, this reflects fundamental changes ins how programmers express parallelism in applications so, again, this is difficult. In the consumer sphere, I do not think we are going to see either languages or runtimes that deviate much from what is currently available. What you'll see is simply more logical extensions to existing frameworks and that's about it.

 

async really changes the way programmers think about task parallelism because now they can write concurrent code just like sequential code; no more callbacks, natural exception handling, etc; the transformation is done systematically by the compiler. That's a great development, but at the same time it's an extension on top of essentially sequential languages. I wonder how farther we could go by designing a language to be concurrent from the ground up, rather than sequential first and concurrent sprinkled on top.  Perhaps that stuff is already well figured out in the academia, but I'm still using a language which name starts with "C" and staring at a switch case inside a for loop right now, and I know that none of that can ever be automatically parallelized.
 
I'm probably not being very constructive, but at least I get to express my thoughts at someone that understands them, so thanks for the conversation.
 
I haven't done OpenCL specifically, but I've done CUDA and Cell (the latter is much worse). Everything I work with these days tends to be worse than those though   :laugh:. I think the problem is is that we don't know where to go paradigm wise for task parallelism. We have data parallelism pretty much down (compiler and semantic expression wise). It may not be "easy" from a programmer stand point, but it IS doable. You may see CUDA/OpenCL as horrible, but they are really not all that bad of a way to express data-parallelism (remember it is not meant for anything but HPC programming). For task parallelism, we really have no idea.

 

How is it "not all that bad"? It's like writing assembly code: low-level, error-prone, and highly remote from the actual domain logic. We got rid of the need to write assembly code in the 70s, and by the 90s programs had gotten so complex that it was downright impossible to write large programs in assembly code. OpenCL is the assembly code of parallel programming. We need to develop some language on top of it so anyone can write complex programs. Right now everyone is writing his own framework on top of it, but isn't that proof that none of these are satisfactory solutions?

A lot of talk is going around that AMD is not going to release new CPU's this year since there's no mark of Steamroller core coming to CPU's, which actually is a bit disappointing to say the least. I'm no real big fan of APU's. I'm one of the people who have an mind set of "Moar cores, moar ghz, hell with the TDP". But, it doesn't say anywhere that AMD won't release new CPU's with Vishera cores. Sure, the 9950 is an beast and 8350 is actually more than enough already but I'm still hoping for something new.

I don't know if their dedicated procs are actually done.  They implied recently it wasn't, they just had nothing to announce.

 

As to the other discussion, did OpenCL 2.0 change much?  I haven't looked at it as there isn't enough OpenCL stuff yet for me to get terribly interested.

Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?

 

I'm not 100% sure on this but I think the JVM can generate SIMD instructions, a cursory search seems to agree - but I'm no expert, just interested in the topic.

I'm not 100% sure on this but I think the JVM can generate SIMD instructions, a cursory search seems to agree - but I'm no expert, just interested in the topic.

You're right, it looks like the HotSpot VM supports it (source). Then again that article argues in the same direction as I do: giving programmers better tools at the language level to make use of SIMD instructions.

 

Floating-point and integer workloads. It really bugs me that the highest-level option we currently have for writing vector code is dropping down to instrinsics in C++. I should be able to just add two arrays, or I should be able to express matrix multiplication in such a way that I know this will compile to vectorized code. For now auto-vectorisation only exists as an optimization in C++, and even then it's far from obvious that a given piece code can be auto-vectorized. You can't even express "please give me a compiler error if the following code is not vectorizable". The problem being that the piece of code is scalar code that has to be reverse-engineered into parallel code, rather than parallel code that can naturally be translated into parallel instructions, just like scalar C code translates naturally into scalar assembly code.

Architecturally speaking, for many generations Intel abandoned enhancements to the integer portions of their SIMD pipelines. Basically what we've simply had is SSE front-end instructions shoehorned onto half of the the floating-point AVX data-path. Personally, I've always considered it to be kludge to maintain backwards compatibility with SSE more than anything. With AVX2 (in Haswell) though, Intel has finally has proper support across workloads though.

 

Compiler wise, you can usually make low-level language compilers tell you if they are promoting operations to vector equivalents (http://gcc.gnu.org/projects/tree-ssa/vectorization.html, ICC and VS have options also). When you say that you have to reverse engineer a code to be parallel, I don't really the see the effective difference whether you have SIMD operations in the language or not. Suppose, they have language extensions and you can natively do vector operations in the language. In order to use those, you need to be able to decompose your problem into a parallel SIMD form. Similarly, even without the language extensions you are still need to decompose your problem into a parallel SIMD form. The only effective difference is that you have to explicitly write the loop and the compiler will generate the vector operation for you as opposed you writing out the operation yourself. There are generally intrinsic instructions if you didn't want to do the loop yourself though. The point though is that that the process of decomposition and expression is there regardless.

 

 

Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?

Again, these are high level languages that aren't purposed for high performance (in fact the ISAs lack any notion of vectorization support). They were explicitly designed to forgo low-level control for automatic optimization and memory management. Manual vectorization would be be against those goals. So, to answer the question, yes, you could feasible enable vectorization opportunities through language extensions, but that goes against the design of the languages. Really the point is that it is the wrong class of languages for such semantics. If you want low-level control for optimization opportunities, then use or interface with a low-level language.

 

 

async really changes the way programmers think about task parallelism because now they can write concurrent code just like sequential code; no more callbacks, natural exception handling, etc; the transformation is done systematically by the compiler. That's a great development, but at the same time it's an extension on top of essentially sequential languages. I wonder how farther we could go by designing a language to be concurrent from the ground up, rather than sequential first and concurrent sprinkled on top.  Perhaps that stuff is already well figured out in the academia, but I'm still using a language which name starts with "C" and staring at a switch case inside a for loop right now, and I know that none of that can ever be automatically parallelized.

Async/await is just fork-join-parallelism where you spawn a task and then later join back to the task. It's the typical style of parallelism that most languages and threading libraries support (pthreads, omp, windows threads, etc.). The difference here is that it has simplified semantics compared to traditional threading libraries so it's easier to start/manage/join to tasks. I don't see why this changes the way programmers think about parallelism, Regardless, the programmer has to go though the process of identifying fork-join parallelism and decomposing programs into tasks. It isn't as if they can get away with thinking sequentially.

 

If you want languages that are concurrent from the ground up, then look into functional languages (haskell, Lisp, etc.). These languages consist of statements that are semantically concurrent and thus given infinite resources could be perfectly executed in parallel (as much as data dependencies allow). The problem is they don't map well to real machines. Real machines have limited resources and the granularity of tasks is important for performance. By switching your language to be inherently parallel you are giving the compiler the task of automatically (auto-magically) mapping parallelism to limited resources or more strictly put: the compiler is responsible for optimizing perfectly parallel code to fit on machine with limited parallel resources (chunking things into tasks and simd instructions). This doesn't necessarily end well because it is actually a fairly difficult problem from a compiler standpoint. Hand-tuned manual parallelism using control-flow languages tends to yield better results. And that's where you always end up, because you don't have have infinite resources so you are either trying to (1) parallelize a sequential program, (2) letting a compiler parallelize a sequential program, or (3) letting a compiler sequentialize a parallel program.

 

 

How is it "not all that bad"? It's like writing assembly code: low-level, error-prone, and highly remote from the actual domain logic. We got rid of the need to write assembly code in the 70s, and by the 90s programs had gotten so complex that it was downright impossible to write large programs in assembly code. OpenCL is the assembly code of parallel programming. We need to develop some language on top of it so anyone can write complex programs. Right now everyone is writing his own framework on top of it, but isn't that proof that none of these are satisfactory solutions?

If you are going to do comparisons on difficulty then this is more valid comparison: x86 SIMD assembly is akin to PTX assembly (NV assembly). Those are around similar levels of difficulty. OpenCL/CUDA is not on level of either of those things because you don't need to know specific ISA details to write OpenCL/CUDA code. Most things are abstracted away from you and it has actual high-level language concepts like variables, memory management, tasking, etc. Such concepts don't exist in assembly. Of course, these languages are designed to closely map to the architectures they run on and give you relatively low-level control so they aren't exactly super easy to use.

 

You seem to be making two arguments to me: you want more control in higher level languages but you want less control in lower level languages. Complex programs require complex languages and you aren't going to get a silver bullet of easy-to-use-with-lots-of-power-and-runs-well language. The fact is, no-one has an answer for what the perfect language or runtime is: it's simply an open research topic. Currently, from what I've seen, people think the best alternative to do a layered approach. At the low level you have languages that map well to your architecture and a higher level you have languages that are transformed into lower-level languages. You expose architectural details in the low level languages, but not in the high level languages. On top of that you expose some sort of hinting system  at the high level to give help with parallel optimizations (e.g. group these tasks). This is just a vague trend I'm alluding to that has been occurring in HPC.

Architecturally speaking, for many generations Intel abandoned enhancements to the integer portions of their SIMD pipelines. Basically what we've simply had is SSE front-end instructions shoehorned onto half of the the floating-point AVX data-path. Personally, I've always considered it to be kludge to maintain backwards compatibility with SSE more than anything. With AVX2 (in Haswell) though, Intel has finally has proper support across workloads though.

 

I take it what you mean here is the fact that the SSE ops under AVX still only address xmm instead of ymm, correct?

 

Excuse the newbie question by the way, but I have to say I absolutely love these kinds of discussions and it's a shame there aren't more like it.

I take it what you mean here is the fact that the SSE ops under AVX still only address xmm instead of ymm, correct?

 

Excuse the newbie question by the way, but I have to say I absolutely love these kinds of discussions and it's a shame there aren't more like it.

Yeah, it adds additional features like vector shifts and non-adjacent element accesses into the ISA also. I'm not exactly being fair in my assessment, but Intel really didn't seem to care about it all that much. It seems to me to be more of an after thought thing because they got die space in the redesigns to properly do it probably now.

Yeah, it adds additional features like vector shifts and non-adjacent element accesses into the ISA also. I'm not exactly being fair in my assessment, but Intel really didn't seem to care about it all that much. It seems to me to be more of an after thought thing because they got die space in the redesigns to properly do it probably now.

 

AVX-512 seems to support this from the start, so they seem to have "learned from their mistake" so to speak.

 

I'm curious however, how well does doubling register size / element count scale? To my mind logic would dictate a 2x speedup per increase, but reality is often not best-case.

 

This seems to be another case for having standardisation of ISE development/implementation in my opinion (likewise with the mess of FMA3/4 etc), not that Intel would allow that mind you.

AVX-512 seems to support this from the start, so they seem to have "learned from their mistake" so to speak.

 

I'm curious however, how well does doubling register size / element count scale? To my mind logic would dictate a 2x speedup per increase, but reality is often not best-case.

 

This seems to be another case for having standardisation of ISE development/implementation in my opinion (likewise with the mess of FMA3/4 etc), not that Intel would allow that mind you.

It is double the performance in well optimized code. You can't necessarily pack instructions perfectly (or maybe at all in some cases) though unless you are doing something simple like matrix multiplication though so in real-world applications it is less.

 

EDIT: speaking of FMA3/4. Yeah, that business was a headache. Though, I don't think it was Intel's fault persay. I was under the impression that both Intel and AMD switched half way through to the opposite. It looks like FMA3 is the standard going forward so it should be settled.

 

EDIT2: Also to be fair, that doesn't necessarily get rid of all issues. Some AMD cores use to shared FMAC units (e.g. 1 unit per 2 core) where you could either issue a single 256-bit width FMA instruction on a single core or two 128-bit width FMAs on adjacent cores. In practice you saw better performance by doing the latter. The point being is that you don't end up with a unification in optimizations regardless.

It is double the performance in well optimized code. You can't necessarily pack instructions perfectly (or maybe at all in some cases) though unless you are doing something simple like matrix multiplication though so in real-world applications it is less.

 

EDIT: speaking of FMA3/4. Yeah, that business was a headache. Though, I don't think it was Intel's fault persay. I was under the impression that both Intel and AMD switched half way through to the opposite. It looks like FMA3 is the standard going forward so it should be settled.

 

EDIT2: Also to be fair, that doesn't necessarily get rid of all issues. Some AMD cores use to shared FMAC units (e.g. 1 unit per 2 core) where you could either issue a single 256-bit width FMA instruction on a single core or two 128-bit width FMAs on adjacent cores. In practice you saw better performance by doing the latter. The point being is that you don't end up with a unification in optimizations regardless.

 

Yeah, unfortunate for people with bdver1 based chips, but bdver2+ supports FMA3.

 

As to the FMAC stuff, yeah I hinted to as much in my initial post referencing improvements in Excavator/bdver4. (Presumably 2x 256-bit FMACs?)

 

What I don't understand though, is with the changes that are landing post-Bulldozer (Like Steamroller's additional decoder), aren't AMD essentially just edging slowly back to having a "standard" architecture? Seems rather odd to make that change just when their HSA stuff is reaching early maturity.

Yeah, unfortunate for people with bdver1 based chips, but bdver2+ supports FMA3.

 

As to the FMAC stuff, yeah I hinted to as much in my initial post referencing improvements in Excavator/bdver4. (Presumably 2x 256-bit FMACs?)

 

What I don't understand though, is with the changes that are landing post-Bulldozer (Like Steamroller's additional decoder), aren't AMD essentially just edging slowly back to having a "standard" architecture? Seems rather odd to make that change just when their HSA stuff is reaching early maturity.

Do you mean why 2x 256-bit FMACs? Probably because it's easier to design, layout, and uses less die space. I'm sure they aren't wasting any space in their design.

 

But yeah they'll continue to move forward with less and less shared parts of the pipeline in future designs. It's always going to be a balance of what more you can shove into the die at the end of the day though. Isn't AMD still a process size behind Intel? If so, that'd probably be a constraint in continuing to share resources.

Do you mean why 2x 256-bit FMACs? Probably because it's easier to design, layout, and uses less die space. I'm sure they aren't wasting any space in their design.

 

But yeah they'll continue to move forward with less and less shared parts of the pipeline in future designs. It's always going to be a balance of what more you can shove into the die at the end of the day though. Isn't AMD still a process size behind Intel? If so, that'd probably be a constraint in continuing to share resources.

 

The presumably part was in regards to what constituted Excavator's "FPU Improvments", I don't think they've been detailed yet. All that's known thus far is it supports AVX2 amongst other things.

 

As far as process size goes I had read that the issue was due to contractual obligations (Soon to end) with the spun-off GloFo, the Radeon 7xx0 parts fabbed at TMSC are/were 28nm vs GloFo's 32nm.

The presumably part was in regards to what constituted Excavator's "FPU Improvments", I don't think they've been detailed yet. All that's known thus far is it supports AVX2 amongst other things.

 

As far as process size goes I had read that the issue was due to contractual obligations (Soon to end) with the spun-off GloFo, the Radeon 7xx0 parts fabbed at TMSC are/were 28nm vs GloFo's 32nm.

Eh, they are still a few years behind Intel in that regard then (that's always going to be the case at this point me thinks). It's funny because Intel doesn't necessarily have to make great designs. They can just ride the benefits from a better process and do easy improvements. That's not to say that they don't do good designs, but they'd stay afloat even if they didn't.

Eh, they are still a few years behind Intel in that regard then (that's always going to be the case at this point me thinks). It's funny because Intel doesn't necessarily have to make great designs. They can just ride the benefits from a better process and do easy improvements. That's not to say that they don't do good designs, but they'd stay afloat even if they didn't.

Or pay OEMs to ignore the competition when they can't match them.

This topic is now closed to further replies.
  • Posts

    • One of Logitech's best productivity mice is now available for just $79.99 by Taras Buria The MX Master 3S, formerly Logitech's flagship productivity mouse, is now available at an all-time low price during Prime Day sale. Thanks to the latest discount, you can have this mouse for as little as $79.99. This large-sized mouse has many things to like. From its ergonomic shape to the iconic MagScroll wheel, the MX Master 3S is a great productivity-focused accessory. It has an 8K DPI sensor that tracks on various surfaces, including glass. Its main MagScroll has two modes: ratched and infinite, with the latter capable of scrolling up to 1,000 lines in just a second. Additionally, there is a secondary wheel for horizontal scrolling. The MX Master 3S has plenty of buttons, which can be remapped to gestures, keyboard shortcuts, or other actions in the Options+ app on Windows and macOS. You can connect the mouse to up to three devices (via Bluetooth or the Bolt connector) and switch between them with a dedicated button. You also get a USB Type-A to Type-C cable to recharge the built-in battery, which lasts up to 70 days on a full charge, and a quick one-minute charge gets you three hours of use. Logitech MX Master 3S - $79.99 | 20% off for Prime Members Good to know This Amazon deal is U.S. specific, and not available in other regions unless specified. We only use first-party seller links (at the time of article publishing); ensure that you purchase from a first-party seller link only. Check out Today's Deals on Amazon | or our recent tech deals. Become a Prime member (for Students or SNAP) via Neowin Get Prime Access - Prime for half price (for qualifying Medicaid, EBT, SNAP) Subscribe to Prime Video, Audible Plus, Music Unlimited or Kindle Unlimited via Neowin As an Amazon Associate, we earn from qualifying purchases.
    • Exactly, this is just the beginning. I hope that by that time, our inept politicians devise something like a Universal Basic Income, because unemployment and poverty rates will skyrocket otherwise. And believe me, robots that perform physical work aren't a matter of IF, but WHEN. No career is truly safe from AI/robots, it's just a matter of time.
    • Subtitle Edit 5.0.0 by Razvan Serea Subtitle Edit is a powerful, free, and user-friendly subtitle editing tool designed for creating, editing, and converting subtitles for videos. It supports a wide range of subtitle formats, including SRT, ****, and SUB, allowing users to easily modify and adjust subtitles for accurate timing and formatting. With its intuitive interface, Subtitle Edit provides a variety of features such as waveform audio display, spell-check, subtitle synchronization, and real-time video preview, making it an ideal choice for both beginners and professionals. The software also includes powerful tools for batch processing, translating subtitles, and converting between different subtitle formats. Subtitle Edit features: Create/adjust/sync/translate subtitle lines Convert between SubRib, MicroDVD, Advanced Sub Station Alpha, Sub Station Alpha, D-Cinema, SAMI, youtube sbv, and many more (300+ different formats!) Cool audio visualizer control - can display wave form and/or spectrogram Video player uses mpv, DirectShow, or VLC media player Visually sync/adjust a subtitle (start/end position and speed) Audio to text (speech recognition) via Whisper or Vosk/Kaldi Auto Translation via Google translate Rip subtitles from a (decrypted) dvd Import and OCR VobSub sub/idx binary subtitles Import and OCR Blu-ray .sup files - bd sup reading is based on Java code from BDSup2Sub Can open subtitles embedded inside Matroska files Can open subtitles (text, closed captions, VobSub) embedded inside mp4/mv4 files Can open/OCR XSub subtitles embedded inside divx/avi files Can open/OCR DVB and teletext subtitles embedded inside .ts/.m2ts (Transport Stream) files Can open/OCR Blu-ray subtitles embedded inside .m2ts (Transport Stream) files Merge/split subtitles Adjust display time Fix common errors wizard....and more. Subtitle Edit 5.0.0 changelog: Subtitle Edit 5 is a major new release and a big step for the project. For the first time, Subtitle Edit runs natively on Windows, macOS, and Linux from a single, modern, cross-platform codebase. The builds are self-contained, so no separate .NET installation is required, and on macOS and Linux the needed media components (mpv/ffmpeg) are bundled in. Please read before upgrading: Subtitle Edit 5 is a new application, not just an update of Subtitle Edit 4. It has been rebuilt from the ground up to be cross-platform, so: It is not 100% the same app. The look, layout, and some workflows have changed. Some things are in different places, and a few behave differently than in SE4. Not every SE4 feature exists in SE5 yet. SE5 covers all the core editing, conversion, sync, video playback, OCR, and online services, but some of the more specialized SE4 tools are not available yet. Features will continue to be added. If you rely on a specific SE4 feature that is missing, please keep SE4 installed alongside SE5. The easiest way to run both side by side is to use the Portable versions of SE4 and SE5, which keep their settings separate and do not interfere with each other. Which version should I use? Subtitle Edit 5: recommended for most users on Windows 10 (22H2) or newer, macOS 12+, and Linux. Subtitle Edit 4: please continue to use SE4 if you are on an older Windows version (Windows 7/8), or on older / slower computers where SE5 may not run well. SE4 remains available and is the right choice in those cases. To run SE4 and SE5 at the same time, use the Portable versions - you can try SE5 while keeping SE4 as a fallback. Download: Subtitle Edit 5.0.0 | ARM64 | ~60.0 MB (Open Source) Download: Subtitle Edit Portable | 103.0 MB View: Subtitle Edit Homepage | Screenshot Get alerted to all of our Software updates on Twitter at @NeowinSoftware
    • Google Pixel 11 series: Here's what to expect by Hamid Ganji Google Pixel 10 series In recent years, Google has successfully turned its Pixel devices into worthy contenders in the smartphone market. The search giant is now preparing to launch the Pixel 11 series in just a few months, and many Pixel fans are likely wondering what Google has in store for them this year. The next lineup of Google smartphones includes four devices: the Pixel 11, Pixel 11 Pro, Pixel 11 Pro XL, and Pixel 11 Pro Fold. This year, we don’t expect Google to bring revolutionary upgrades to its handsets, and the Pixel 11 series is likely to receive modest hardware improvements alongside a slew of AI-powered features. Here are the rumored specifications of the Google Pixel 11 series ahead of its official debut: When will the new Pixel phones be unveiled? The last two generations of Google Pixel phones (Pixel 9 series and Pixel 10 series) were launched in August, unlike the previous three generations that debuted in October. With that in mind, we expect Google to unveil the Pixel 11 series sometime in August 2026. The exact launch date has yet to be confirmed. Google Pixel 11 CAD renders - Image via AndroidHeadlines How much will the Pixel 11 series cost? Predicting the final price of upcoming smartphones has become increasingly difficult. As you may know, RAM and memory prices are rising sharply, leading to significant increases in the cost of consumer electronics. Recently, Apple CEO Tim Cook said that price increases for some future Apple products are unavoidable, suggesting that the iPhone 18 series could become more expensive. Google has remained tight-lipped about any potential price increases for the Pixel 11 series. If the company manages to maintain last year’s pricing structure, here’s what the lineup could cost: Pixel 11: $799 Pixel 11 Pro: $999 Pixel 11 Pro XL: $1,199 Pixel 11 Pro Fold: $1,799 Given current market conditions, it may be difficult for Google to avoid raising prices unless it adopts cost-saving measures, such as equipping the base model with 8GB of RAM. Google Pixel 11 series anticipated specs: We expect the Google Pixel 11 series to debut with a new Tensor G6 processor as well as an upgraded camera system. The overall design, however, is expected to remain largely unchanged across the lineup. Specifications Pixel 11 Pixel 11 Pro Pixel 11 Pro XL Pixel 11 Pro Fold Display 6.3-inch LTPO AMOLED / 120Hz refresh rate / up to 3100 nits of brightness 6.3-inch Super Actua LTPO OLED, 120Hz refresh rate, up to 3600 nits of brightness 6.8-inch Super Actua LTPO OLED, 120Hz refresh rate, up to 3600 nits of brightness 8-inch inner screen and 6.4-inch outer display, 120Hz refresh rate, up to 3600 nits of brightness RAM & Processor Tensor G6 / 8-12GB of RAM Tensor G6 / 12-16GB of RAM Tensor G6 / 12-16GB of RAM Tensor G6 / 16GB of RAM Storage options 128GB or 256GB 256GB, 512GB, 1TB 256GB, 512GB, 1TB 256GB, 512GB, 1TB Camera 50MP main sensor, 13MP ultra-wide, 10.8MP 5x telephoto, 10.5MP front camera 50MP main camera, 48MP ultra-wide, 48MP telephoto with 5x optical zoom, 42MP selfie camera 50MP main camera, 48MP ultra-wide, 48MP telephoto with 5x optical zoom, 42MP selfie camera 50MP main camera, 10.5MP ultra-wide camera, 10.8MP telephoto camera, 10MP front camera, 10MP inner camera Battery 4,840 mAh 4,707 mAh 5,000 mAh 4,658 mAh Software Android 17 Android 17 Android 17 Android 17 The Pixel 11 series won’t be a major departure from its predecessor, with Google instead focusing on subtle improvements and AI additions such as Gemini Intelligence. However, a patent filed by Google suggests the company is working on a removable battery for its smartphones, and we could see this feature make its way to the Pixel 11 Pro Fold. Given that nearly all smartphones today lack removable batteries, such a feature would be a welcome addition to future Pixel devices. That said, it may not arrive with this year’s lineup after all, and the final decision is yet to be made by Google. The Pixel 11 series could also face an uphill battle in the market. In the Android segment, Samsung is performing well with the Galaxy S26 series, while the Galaxy Z Fold 8 lineup is also expected to launch next month. On the other hand, Apple is preparing to unveil the iPhone 18 Pro and iPhone 18 Pro Max in September alongside its first foldable iPhone.
  • Recent Achievements

    • One Month Later
      timbobit earned a badge
      One Month Later
    • One Month Later
      nates earned a badge
      One Month Later
    • Week One Done
      Almohandis earned a badge
      Week One Done
    • Rookie
      dorf went up a rank
      Rookie
    • First Post
      mike_rumble earned a badge
      First Post
  • Popular Contributors

    1. 1
      +primortal
      476
    2. 2
      +Edouard
      171
    3. 3
      PsYcHoKiLLa
      105
    4. 4
      Michael Scrip
      88
    5. 5
      Steven P.
      70
  • Tell a friend

    Love Neowin? Tell a friend!